Clustering

Agglomerative hierarchical clustering is a distance based method to find structure in data sets.

In our case, clusters are built from genes (hybridizations), in such a way that genes (hybridizations) with highest similarity - or reverse: smallest distance - are grouped togehter into clusters.

Aim is to organize all gene-vectors (hybridfization-vectors) into a binary tree (dendrogram).

The clustering process:

Initialize a distance matrix by computing all vector-vector distances between all objects (genes/hybridizations).
Iteratively compute:

find pair of vectors with smallest distance
combine two closest vectors - single genes or average vectors from already assembed subclusters - into a new "subcluster" and remove original objects
recompute all distances between new "subcluster" and all others

Repeat this iteration until only one cluster - containing all objects - is left over.

The clustering process is memory and time intensive. The distance matrix for 10000 genes will require ~400MByte of RAM, for 30000 genes ~3 GigaByte are required.
Clustering 1000 genes is done in a view seconds, 10000 genes require several minutes, 30000 genes hours.

A key point in clustering is: how to measure distances/similarieties between the idividual data vectors.

Different distance measures may generate quite different cluster results.
Thus, it is important to select the distance metric wich is most adeqate for the specific dataset and question.

See below for distances metrices supported by SUMO

See a very simple example illustrating the overall clustering process.
(6 "genes" with two expression values each, applying Manhattan distance metric and average linkage)

Hierarchical clustering with SUMO

Select Cluster from the main menu:

Cluster

The cluster dialog opens up:

Define the parameters used for clustering:

Cluster

Select to cluster Genes and/or Conditions (=Hybridizations).
Clustering may be performed in one go, or you can first cluster genes, then cluster conditions with different distance metric or linkage.

Distance metric

Several distance metrics may be selected. See below for more details about the metrics.

Linkage

At present three linkage methods may be selected (Group average does not work yet). See below for more details about the metrics.

OK-Cluster button

Run clustering algorithm with specified parameters.

A progress indicator box will show up.

The lower progress indicator bar shows the main processing steps:

gene distance calculation
gene clustering
condition distance calculation
condition clustering

The upper progress indicator bar shows the progress of each of the (up-to) four steps.

After clustering, the heatmap view will update and show the computed trees:

Distance Metric

A key in clustering is: how to measure simlarity / distance between data vectors (gene's expression profiles over multiple conditions or condition's profiles over multiple genes).

Therefore, one can define the mathematical concept of a metric.
A metric on a set is a function (the distance function d(x,y) between two vectors x and y) with the fundamental properties:

1.	d(x,y) ≥ 0	non negativity
2.	d(x,y) = d(y,x)	symmetry
3.	d(x,x) = 0	identity
4.	d(x,y) = 0, if and only if x = y	definiteness
5.	d(x,z) ≤ d(x,y) + d(y,z)	triangle inequation

Now you can freely define a function fulfilling the above properties as a distance measure.

Commonly used methods are:

Minkowski distances - measures mainly the absolute similarity of genes
Typical representatives:
- Eculidean distance
- Manhattan distance
- Chebyshev distance
Gower distance - a variant of Eunclidean distance
Canberra distance - a variant of Manhattan distance
Correlation distances - measures mainly the similarity in the shape of the profiles
Cosine distance - measures the angle between the two vectors
(very similar to correlation distance)
"Dot-product" like
"Jaccard" - binary distance
"Hamming" - binary distance

Different distance metrics will measure different properties of the data.
As a result, clustering with different distance metrics may generate completely different trees.
Which distance metric to use, depends on the question you want to be answered by your clustering.

Geometric distances

Minkowski distance

The Minkowski distance is a generalized distance metric in Euclidean space.

Assume two vectors P,Q with dimension n:

The Minkowski distance of order p is defined as:

Most commonly used are order p= 1,2 and infinite.

Lets look at a very simple example in two dimensional space (n=2):

Manhattan distance (p=1):
Euclidean distance (p=2):
Chebyshev distance (p=infinite):

Distances metrices similar to Manhattan distance:

Gower distance

A variant of the Manhatton distance, divided the vector dimension:

with r = n

Lorentzian distance

Canberra distance

A variant of the Manhattan distance, where each component is normalized to its "length":

Bray-Curtis / Sorensen / Soergel / Czekanowski distance

Intersection distance

Penrose shape distance

Meehl distance

Hellinger distance

Clark distance

Correlation distance

A correlation coefficient is probable more familiar in (linear) regression.

Lets take our two vectors (P,Q), and plot the components (x₁,y₁), (x₂,y₂), (x₃,y₃), ... in a 2-d scatter plot:

If the shape of the data vectors is similar, the data points should be closely scattered around a line (linear regression).
E.g. the two genes (P and Q) are always expressed / regulated in the same way (up / down) for each single hybridisation.

Now we can test whether there is a certain denpendance (=correlation) between the two data vectors:

Pearson correlation - linear dependance
Pearson Jackknife - linear dependance, with permutation test
Spearman correlation - monotonic dependance
Kendall's Tau correlation - local monotonic dependance

Pearson correlation coefficient

We can test how good the components are lying on a line by computing the Pearson correlation coefficient:

with

The values for r range from -1 ≤ r ≤ 1:

1 =	identical shape of vectors
0 =	no similarity in vector's shape
-1 =	complete opposite shape of the vectors

To transform the Pearson Correlation coefficient to a distance we compute:

d_PC = 1 - r

0 = identical shape of vectors
1 = no similarity in vector's shape
2 = complete opposite shape of the vectors

All three examples view pairs of nicely correlating vectors:

also:

slope af the regresson line might be different from 1
absolute values between the vectors might be quite different

The picture illustrates in more details the result of pearson correlation (taken from Wikipedia):

Several sets of (x, y) points, with the correlation coefficient of x and y for each set.
Note that the correlation reflects the

non-linearity and direction of a linear relationship (top row),
but not the slope of that relationship (middle),
nor many aspects of nonlinear relationships (bottom).

N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

Pearson Jackknife correlation coefficient

Spearman's rank correlation coefficient

As discussed above, Pearson correlation can only give r=1 if the components of both vectors are related by a linear function.
If the data vectors form a perfect e.g. banana or sigmoid shaped scatterplot, the Pearson correlation coefficient will return values r < 1.

Spearman's rank correlation is a non parametric measure of statistical dependence and tests whether the relation between our two vectors may be described by a monotonic function.

Data vectors are transformed into ranked variables.

Assume two vectors:

P	4	2	10	6	8
Q	5	3	42	20	40

Pearson correlation coefficient: r ~ 0.81

Lets rank both vectors (i.e. sort by magnitude):

P	4	2	10	6	8
Ranks(P)	2	1	5	3	4
Q	5	3	42	20	40
Ranks(Q)	2	1	5	3	4

With the new rank vectors we compute the Pearson correlation coefficient:

Ranks(P)	2	1	5	3	4
Ranks(Q)	2	1	5	3	4

Spearman's rank correlation coefficient = 1

In the present version, SUMO does not take care of ties (= identical numerical value for multiple vector components), as they are not very likely to appear in quasi continuous floating point expression intensities or ratios.

Kendall's Tau correlation

Kendall Tau correlation coefficient measures even more generalized the association between two datasets.
The non-parametric tests measures the rank correlation, which means the similarity in the order of the data after ranking (=sorting).

Assume two data vectors:

P	4	2	10	6	8
Q	5	20	42	40	3

Lets rank both vectors:

P	4	2	10	6	8
Ranks(P)	2	1	5	3	4
Q	5	20	42	40	3
Ranks(Q)	2	3	5	4	1

Now we analyze the ranks:

	A	B	C	D	E
Ranks(P)	2	1	5	3	4
Ranks(Q)	2	3	5	4	1

Lets analyze each possible pair of data values from vector P and Q.

Count the pairs where the order of values is identical (concordant pairs, n_c )
Count the pairs where the order of values is inversed (disconcordant pairs, n_d )

Pair	Data	concordant	dis-concordant
A-B	2 > 1 2 < 3		X
A-C	2 < 5 2 < 5	X
A-D	2 < 3 2 < 4	X
A-E	2 < 4 2 > 1		X
B-C	1 < 5 3 < 5	X
B-D	1 < 3 3 < 4	X
B-E	1 < 4 3 > 1		X
C-D	5 > 3 5 < 4	X
C-E	5 > 4 5 < 1	X
D-E	3 < 4 4 > 1		X
Sum		6	4

4 data pairs have inversed order, thus Kendall tau distance is 4.
Normalized to the number of pairs, Kendall tau distance is 0.4 indicating low association in the rankings.

Similarly we compute the correlation coefficient:


	with	n_c=number of concordant pairs
		n_d=number of disconcordant pairs)
		n=dimension of data vectors

In the example τ = 0.44, indicating nearly no correlation.

In the present version, SUMO does not take care of ties (= identical numerical value for multiple vector components), as they are not very likely to appear in quasi continuous floating point expression intensities or ratios.

All data are binary, i.e. only "1" and "0": data are diretly clustered
anything else - perform a temporary binarisation.
Define a threshold value - any data cell >= Threshold will be transformed to "1"
- all other data cells will be transformed to "0"
Clustering is now performed with temporary binarized data
After clustering, the "normal" continuous date are restored.

It may be recommended to permanently transform continuous expression data into binary (1 or 0) values before clustering data when using binary distance.
Use SUMO Main menu | Adjust data | Data transformation | Binarization

Assume two gene / condition vectors x_i and x_j with p elements (in the example p=12 elements):

x_i = ( 1,1,0,1,0,0,1,1,0,1,0,1 )
x_j = ( 1,0,1,1,0,1,1,0,0,0,1,1 )

Now we can count:

n₁₁	number of common existing elements in both vectors ("1" - "1"). In the example: x_i = ( 1,1,0,1,0,0,1,1,0,1,0,1 ) x_j = ( 1,0,1,1,0,1,1,0,0,0,1,1 ) n₁₁= 4
n₀₀	number of common missing elements in both vectors ("0" - "0"). In the example: x_i = ( 1,1,0,1,0,0,1,1,0,1,0,1 ) x_j = ( 1,0,1,1,0,1,1,0,0,0,1,1) n₀₀= 2
n₁₀	number of elements only exisitng in first vector ("1" - "0") . In the example: x_i = ( 1,1,0,1,0,0,1,1,0,1,0,1 ) x_j = ( 1,0,1,1,0,1,1,0,0,0,1,1 ) n₁₀= 3
n₀₁	number of elements only exisitng in second vector ("0" - "1") In the example: x_i = ( 1,1,0,1,0,0,1,1,0,1,0,1 ) x_j = ( 1,0,1,1,0,1,1,0,0,0,1,1 ) n₀₁= 3

With the count numbers n₁₁, n₀₀, n₀₁, n₁₀ we can compute several similarity / distance measures:

Jaccard distance

Jaccard Index (similarity):

J_I = n₁₁ / ( n₁₁ + n₁₀ + n₀₁ )

here: J_I = 4 / (4+3+3) = 0.4

Jaccard distance:

D_Jaccard = 1 - J_I

Here: D_Jaccard = 1-0.4 = 0.6

Hamming distance

Count the number of components different in both vectors, but not Zero in both:

D_Hamming = n₁₀ + n₀₁

Here:D_Hamming = 3+3 = 6

Simple Matching Distance

Normalized Hamming distance.

D_{SimpleMatching} = ( n₁₀ + n₀₁ ) / p
Here: D_{SimpleMatching} = ( 3+3 ) / 12 = 0.5

Russel-Rao

Normalized count of positive matches.

D_Russel-Rao = 1 - n₁₁ / p
Here: D_Russel-Rao = 1 - 4/12 = 0.333
Russel-Rao= DIce / 2

Dice

Normalized count of positive matches.
Dice = Russel-Rao*2

D_Dice = 2 * n₁₁ / p
Here: D_Dice = 2- 2*4/12 = 1.333

Tanimoto

Normalized count of positive matches.

D_Tanimoto = 1 - (n₁₁ + n₀₀) / (n₀₀ + 2*(n₀₁+n₁₀) + n₁₁)
Here: D_Tanimoto = 1 - (4+2)/(2+2*(3+3)+4) = 1 - 6/18 = 0.666

Braun

D_Braun = 1 - n₁₁ / max((n₁₁₊+n₀₁),(n₁₁₊+n₁₉))
Here: D_Braun = 1 - 4/7 = 0.5667

Kulczynski

Simpson

Kappa

Hamann

Sneath

Yule

Pearson Phi coefficent

Or "mean square contingency coefficient" or "Matthews correlation coefficient"
Phi-coefficent = (n₁₁*n₀₀) - (n₀₁*n₁₀)) / sqrt((n₁₁+n₀₁)*(n₁₁+n₁₀)*(n₀₀+n₀₁)*(n₀₀+n₁₀))

Teh Phi coeffcient gibve s the same resutl as the Pearson correlation wit binary vectors.

D_Phi = (1 - Phi-coefficent) / 2

Here: D_Phi = (1 - (4*2 - 3*3) / sqrt((4+3)*(4+3)*(2+3)*(2+3))) /2 = (1 - 1/sqrt(1225)) / 2 = 0.486

Accuracy

Accuracy = (n₁₁+n₀₀) / p

D_Accuracy = 1 - Accuracy

Here: D_Accuracy = 1 - (4+2)/12 = 0.5

F1-score
F1-score = (2*n₁₁ / (2*n₁₁+n₁₀+n₀₁)

D_F1-score = 1 - F1-score

Here: D_F1-score = 1 - ?8/(8+3+3) = 0.29

Linkage

Clustering

Hierarchical clustering with SUMO

Cluster

Cluster

Distance metric

Linkage

OK-Cluster button

Distance Metric

Geometric distances

Minkowski distance

Distances metrices similar to Manhattan distance:

Gower distance

Lorentzian distance

Canberra distance

Bray-Curtis / Sorensen / Soergel / Czekanowski distance

Intersection distance

Penrose shape distance

Meehl distance

Hellinger distance

Clark distance

Correlation distance

Pearson correlation coefficient

Pearson Jackknife correlation coefficient

Spearman's rank correlation coefficient

Kendall's Tau correlation

Cosinus distance

"Dot product"

Binary distances

Jaccard distance

Hamming distance

Simple Matching Distance

Russel-Rao

Dice

Tanimoto

Braun

Kulczynski

Simpson

Kappa

Hamann

Sneath

Sneath

Yule

Pearson Phi coefficent

Accuracy

F1-score

Linkage