In our case, clusters are built from genes (hybridizations), in such a way that genes (hybridizations) with highest similarity - or reverse: smallest distance - are grouped togehter into clusters.

Aim is to organize all gene-vectors (hybridfization-vectors) into a binary tree (dendrogram).

The clustering process:

Initialize a distance matrix by computing all vector-vector distances between all objects (genes/hybridizations).

Iteratively compute:

- find pair of vectors with smallest distance
- combine two closest vectors - single genes or average vectors from already assembed subclusters - into a new "subcluster" and remove original objects
- recompute all distances between new "subcluster" and all others

The clustering process is memory and time intensive. The distance matrix for 10000 genes will require ~400MByte of RAM, for 30000 genes ~3 GigaByte are required.

Clustering 1000 genes is done in a view seconds, 10000 genes require several minutes, 30000 genes hours.

A key point in clustering is: how to measure distances/similarieties between the idividual data vectors.

Different distance measures may generate quite different cluster results.

Thus, it is important to select the distance metric wich is most adeqate for the specific dataset and question.

See below for distances metrices supported by

See a very simple example illustrating the overall clustering process.

(6 "genes" with two expression values each, applying Manhattan distance metric and average linkage)

Select Cluster from the main menu:

The cluster dialog opens up:

Define the parameters used for clustering:

Select to cluster Genes and/or Conditions (=Hybridizations).

Clustering may be performed in one go, or you can first cluster genes, then
cluster conditions with different distance metric or linkage.

Several distance metrics may be selected. See below for more details about the metrics.

At present three linkage methods may be selected (Group average does not work yet). See below for more details about the metrics.

Run clustering algorithm with specified parameters.

A progress indicator box will show up.

The lower progress indicator bar shows the main processing steps:

- gene distance calculation
- gene clustering
- condition distance calculation
- condition clustering

The upper progress indicator bar shows the progress of each of the (up-to) four steps.

After clustering, the heatmap view will update and show the computed trees:

Therefore, one can define the mathematical concept of a metric.

A metric on a set is a function (the distance function d(x,y) between two vectors x and y) with the fundamental properties:

1. | d(x,y) ≥ 0 | non negativity |

2. | d(x,y) = d(y,x) | symmetry |

3. | d(x,x) = 0 | identity |

4. | d(x,y) = 0, if and only if x = y | definiteness |

5. | d(x,z) ≤ d(x,y) + d(y,z) | triangle inequation |

Now you can

Commonly used methods are:

**Minkowski distances**- measures mainly the absolute similarity of genes

Typical representatives:

- Eculidean distance
- Manhattan distance
- Chebyshev distance

**Gower distance**- a variant of Eunclidean distance**Canberra distance**- a variant of Manhattan distance**Correlation distances**- measures mainly the similarity in the shape of the profiles**Cosine distance**- measures the angle between the two vectors

(very similar to correlation distance)**"Dot-product" like****"Jaccard" - binary distance****"Hamming" - binary distance**

Different distance metrics will measure different properties of the data.

As a result, clustering with different distance metrics may generate completely different trees.

Which distance metric to use, depends on the question you want to be answered by your clustering.

Assume two vectors P,Q with dimension n:

The Minkowski distance of order p is defined as:

Most commonly used are order p= 1,2 and infinite.

Lets look at a very simple example in two dimensional space (n=2): |

Manhattan distance (p=1): | ||

Euclidean distance (p=2): | ||

Chebyshev distance (p=infinite): |

with r = n

Lets take our two vectors (P,Q), and plot the components (x

If the shape of the data vectors is similar, the data points should be closely scattered around a line (linear regression).

E.g. the two genes (P and Q) are always expressed / regulated in the same way (up / down) for each single hybridisation.

Now we can test whether there is a certain denpendance (=correlation) between the two data vectors:

**Pearson correlation**- linear dependance**Pearson Jackknife**- linear dependance, with permutation test**Spearman correlation**- monotonic dependance**Kendall's Tau correlation**- local monotonic dependance

with

The values for

1 = | identical shape of vectors | |

0 = | no similarity in vector's shape | |

-1 = | complete opposite shape of the vectors |

- 0 = identical shape of vectors
- 1 = no similarity in vector's shape
- 2 = complete opposite shape of the vectors

also:

- slope af the regresson line might be different from 1
- absolute values between the vectors might be quite different

The picture illustrates in more details the result of pearson correlation (taken from Wikipedia):

If the data vectors form a perfect e.g. banana or sigmoid shaped scatterplot, the Pearson correlation coefficient will return values

Spearman's rank correlation is a non parametric measure of statistical dependence and tests whether the relation between our two vectors may be described by a monotonic function.

Data vectors are transformed into ranked variables.

Assume two vectors:

P | 4 | 2 | 10 | 6 | 8 |

Q | 5 | 3 | 42 | 20 | 40 |

Lets rank both vectors (i.e. sort by magnitude):

P | 4 | 2 | 10 | 6 | 8 |

Ranks(P) | 2 | 1 | 5 | 3 | 4 |

Q | 5 | 3 | 42 | 20 | 40 |

Ranks(Q) | 2 | 1 | 5 | 3 | 4 |

With the new rank vectors we compute the Pearson correlation coefficient:

Ranks(P) | 2 | 1 | 5 | 3 | 4 |

Ranks(Q) | 2 | 1 | 5 | 3 | 4 |

In the present version,

The non-parametric tests measures the rank correlation, which means the similarity in the order of the data after ranking (=sorting).

Assume two data vectors:

P | 4 | 2 | 10 | 6 | 8 |

Q | 5 | 20 | 42 | 40 | 3 |

Lets rank both vectors:

P | 4 | 2 | 10 | 6 | 8 |

Ranks(P) | 2 | 1 | 5 | 3 | 4 |

Q | 5 | 20 | 42 | 40 | 3 |

Ranks(Q) | 2 | 3 | 5 | 4 | 1 |

Now we analyze the ranks:

A | B | C | D | E | |

Ranks(P) | 2 | 1 | 5 | 3 | 4 |

Ranks(Q) | 2 | 3 | 5 | 4 | 1 |

Lets analyze each possible pair of data values from vector P and Q.

- Count the pairs where the order of values is identical (concordant pairs, n
_{c}) - Count the pairs where the order of values is inversed (disconcordant pairs, n
_{d})

Pair | Data | concordant | dis-concordant |

A-B | 2 > 1 2 < 3 | X | |

A-C | 2 < 5 2 < 5 | X | |

A-D | 2 < 3 2 < 4 | X | |

A-E | 2 < 4 2 > 1 | X | |

B-C | 1 < 5 3 < 5 | X | |

B-D | 1 < 3 3 < 4 | X | |

B-E | 1 < 4 3 > 1 | X | |

C-D | 5 > 3 5 < 4 | X | |

C-E | 5 > 4 5 < 1 | X | |

D-E | 3 < 4 4 > 1 | X | |

Sum | 6 | 4 |

4 data pairs have inversed order, thus Kendall tau distance is 4.

Normalized to the number of pairs, Kendall tau distance is 0.4 indicating low association in the rankings.

Similarly we compute the correlation coefficient:

with | n_{c}=number of concordant pairs | |

n_{d}=number of disconcordant pairs) | ||

n=dimension of data vectors | ||

In the example τ = 0.44, indicating nearly no correlation.

In the present version,

The formula and result are very similar to correlation distance.

An alternative measure could combine both informations.

The dot product between two vectors is defined as:

The Dotproduct between two identical vectors is >0, the dotproduct between two orthogonak vectors = 0.

But we can use a modified "dotproduct", where we replace the cosine function with a sine:

Special metrices for binary data - a bit unconventional for gene expression data.

For these metrics, data should contain only "1" and "0" vlaues - binary data.

- All data are binary, i.e. only "1" and "0": data are diretly clustered
- anything else - perform a temporary binarisation.

Define a threshold value - any data cell >= Threshold will be transformed to "1"

- all other data cells will be transformed to "0"

Clustering is now performed with temporary binarized data

After clustering, the "normal" continuous date are restored.

Use

Assume two gene / condition vectors x

x

x

Now we can count:

n_{11} | number of common existing elements in both vectors ("1" - "1"). In the example: x _{i} = ( 1,1,0,1,0,0,1,1,0,1,0,1 )x _{j} = ( 1,0,1,1,0,1,1,0,0,0,1,1 )n _{11}= 4 | |

n_{00} | number of common missing elements in both vectors ("0" - "0"). In the example: x _{i} = ( 1,1,0,1,0,0,1,1,0,1,0,1 )x _{j} = ( 1,0,1,1,0,1,1,0,0,0,1,1)n _{00}= 2 | |

n_{10} | number of elements only exisitng in first vector ("1" - "0") . In the example: x _{i} = ( 1,1,0,1,0,0,1,1,0,1,0,1 )x _{j} = ( 1,0,1,1,0,1,1,0,0,0,1,1 )n _{10}= 3 | |

n_{01} | number of elements only exisitng in second vector ("0" - "1") In the example: x _{i} = ( 1,1,0,1,0,0,1,1,0,1,0,1 )x _{j} = ( 1,0,1,1,0,1,1,0,0,0,1,1 )n _{01}= 3 |

With the count numbers n

here: J

Jaccard distance:

Here: D

Here:D

Here: D

Here: D

Russel-Rao= DIce / 2

Dice = Russel-Rao*2

Here: D

Here: D

Here: D

Here: D

D

Here: D

D

Here: D

D

Here: D

D

Here: D

D

Here: D

D

Here: D

Teh Phi coeffcient gibve s the same resutl as the Pearson correlation wit binary vectors.

D

Here: D

D

Here: D

D

Here: D