K-means / -medians Clustering

K-Means Clustering is a iterative distance based clustering method. In opposite to totally non-supervised hierarchical clustering, k-means clustering requires one key parameter: K, the number of clusters to search for.
The goal is to divide the objects (genes in the expression matrix) into a user defined number of K clusters in such a way that all members of the cluster have the same common pattern, different to all other clusters.

The user has to define the number of clusters K for which the algorithm shall try to build cluster.
Often the number of clusters may be guessed according to a hypothesis. But in most cases it is hard to determine the correct number of clusters, even on well known problems:
Define number of clusters for playing cards

Not depending of the internal structure of the data, K-means clustering will always try to dissect the data into the requested number of k clusters. Independent of the real internal data structure.
When asking for


One way to solve this dilemma may be Sihouettes.

 

Basic algorithm:

Initialize the cluster centroids with random values
Iterate:
  1. Compute distance of each feature to all cluster centroids
  2. Assign features to cluster with smallest feature-centroid distance
  3. Recompute new centroifs from all member features<
  4. if STOP criterion is reached - finalize
    otherwise continue with step one.
The overall computational burden is proportional the the number of featues (Ord(n)m and thus much faster compared to hierarchical clustering (Ord2).

Imagine a dataset which shall be divided into two clusters:


Randomly assign centroids to the two clusters:


1. Compute distances from each gene to all cluster centroids, and assigne a gene to that cluster where distance to centroid is minimal:


Until all genes are uniquely assign to one of the k-clusters:


2. Re-compute cluster centroids with new cluster members:
3. Reassign genes to new cluster centroids








Parameters

Having selected KMC by clicking the KMC-button ( or from Main menu | Analyse | KMC), the Group Selection / Parameter dialog pops up.

ON the Groups tabsheet, the hybridisations used for a KMC analyses may be selected:

On the Parameters tab sheet, basic computation parameters may be set:

Number of clusters K, the cluster number, default value=5
Init cluster Method how to initialize cluster centroids:
 - Random, random values taken from the dat amtrix (hot deck) are assign to the centroids
 - Regular "wave" patterns scaled to date in the matrix
Centroid computation The method how to compute cluster centroids from member genes:
 - Arithmetic mean
 - Median
Distance metric The method how to compute Centroid <=> feature:
 - Euclidean distances
 - Correlation distance
Same as used as for hierarchical clustering.
Stop criterion Stop criterion1:
- Count number of featureds exchanged between the clusters during one iteration.
As soon as this number is small enough, KMC stops.
Maximum number of iterations Stop criterion 2:
Maximumn number of iterations "Assign genes/Re-compute centroids".
Useful in cases where KMC does not converge: i.e. there is no preferable gene=>cluster assignment.


Data views

After the analyses a new branch is added to the Analysis tree:

As usual, Gene lists, Sub-expression-matrices, Gene-/Centroid profiles or heat maps for the respective cluster may be displayed, by clicking the respective tree item.

Convergence table:

The convergence table displays graphically how fast each single cluster  as well as the whole KMC-analyses converges to its final classification:












Silhouette

One way to estimate clustering success might be to anlyse distances of featúrs beeing members of a cluster (INTRA cluster distance ~ Signals) versus distances of features from differnt clusters (BETWEEN clsuster distances ~nopise).

Wikpedia explans:

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from -1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.
The method in brief:

For each feature Fi (gene) a Silhouette score is computed.
WITHIN cluster distance ai is computed, as average from all distances between feature Fi to features forming the SAME cluster to which Fi belongs.
BETWEEN cluster distance bi is computed, as average from all distances between feature Fi to features beloning to all other clusters.



The Silhouette score is the normalized difference between BETWEEN and WITHIN average distances:

si = ( bi-ai ) / max(ai,bi)


It indicates whether Fi fits nicely to its cluster ( si ~1 ), could fit to anywhere ( si ~1 ), or doesn't belong at all to its cluster ( si ~ -1 ).

Next an average from Silhouette scores from all features of a cluster is derived, indcating how tightly members af a cluster are matched (very well ~1, very bad ~ -1).

Finally the Silhouette Coefficient (SC) is computed as Maximum from all Cluster-Silhouette-Score-Averages.
A high SC indicates that the clustering is able to reflect the real structure in the data.
Whereas a small SC indicates there are too few - or too many - cluster generated by e.g. k-Means.
with ohter words the nmuber of k was not optimal selected.

Thus we can run multiple k-means clusterings with increasing k (eg. k=2,k=3, .... k=10), and compute the corresponding SC.

The k at which SC becomes maximal suggest the best clustering (dependant on the other clustering parameters, e.g. distance metrc, ...).
Nevertheless, if the best SC is <<1, clustering is probably meanngless.

Analysis progres is show in the Log tab sheet:



The final table summarizes:
  1. column - k - number of clusters
  2. column - Silhouette Coefficient
  3. column - number of KMC cycles to reach stable clustering


The highest Silhouette Coefficinet is found at k=11, suggesting 11 clusters as the most suited k value, with selected distance metric, initialization, mean computation, ...





There are alternative suggestions how to compute cluster validity indices.

Calinski-Harabasz index

This index is computed as:



With: N= number of features (genes), k=number of clusters, B=between cluster distance, Wk=within distance for cluster k

As samller the index as better the clustering.



Dunn index

Dunn-Index = (Minimum Cluster-Cluster-Centroid distance) / Maximum (Feature-Feature distances)

As larger the Dunn-Index as better the claustering.



Davies-Bouldin Index
For each Cluster i we find the maximum seaparation Di to any other cluster j, i,j=1..k,i≠j as

Di = max (Intra-Cluster-Distancei + Intra-Cluster-Distancej) / Between-Cluster-Distance i,j

Davies-Bouldin-Index = Average ( Di)