Classification and Prediction

A challenge might be to classify uncharacterized data based on gene expression data.

Imagine you have two well stratified groups of samples with a clear phenotype.
E.g. Two groups of patients with Long vs. Short survival.
A set of genes differentially transcribed/expressed/methylated between the two groups might be used to predict the outcome of a not yet characterized new sample.

Gene sets may be defined as highly abundant, low noise, best scoring genes from e.g. statistical class test, K-emans clusterng, ...

With SUMO you may perform 2-class classification and prediction, fr most test with muliple classes, too.

SUMO supports a few classificatin methods / algorithms:

LDA - Linear Discriminant Analysis
KNN - K Nearest Neighbours
CC - Closest Centroid
LDA - Artificial Neural Network
SVM - Support Vector Machines
RF - Random Forests

Prepare an expression matrix containing:

Training set: 2 groups (or more) of samples to be used as classifier
(obviously there should be enough replicates)
Test set: samples which shall be predicted and characterized as: similar to class 1 or class 2

Click the Classification button.
Select the desired method from the pull-down menu:

Analysis

Define the (two) training groups.
Build the predictor and test it with the training set.

In the optimal case, all training samples should be correctly assigned to their respective classes.
If not, try to adjust parameters or test another method.

Analysis & Prediction

Define the training groups and an additional group of unknown samples which shall be assigned.

Predict

Some methods allow to save the "classifier" derived from the training set.
Here you can load a previously created predictor and use it directly to classify unknown samples.
Obviously, the features, order of features, method and parameter shold be identical to those used when buiding the predictior !!

The group selecton dialog opens up.

Assign samples to the respective groups:

Go to the Parameters tab-sheet and set method as well as respective parameters.

Click Run button to perform the Analysis / Prediction

SUMO runs the requested training / prediction tasks and shows results in the Log tabsheet and additional graph viewers.

In Log tab-sheet:

First section: test set-up:

Second section: application specific messages. E.g. Ranger64's training results:

Third section: Application of the just built predictor to the training set:

If a pre-computed predictor was used, the training section does not show up.

Fourth section: Application of the predictor to the Query set:

This section only shows up if Prediction was performed.

Classification graph:

Ths graph shows the class vector of all samples (training + prediction).

Red markers represent vectors assigned to class 1
Blue markers represent vectors assigned to class 2
Empty boxes = training vectors
Solid diamonds = query vectors
horizontal lines indicate custom defined confidence interval

In the example

20 traning sample were used (empty boxes)
all 10 Class-2 samples (blue boxes) were correctly assigned to class in the crossvalidation (positive score)
but 2 Class-2 samples with very low confidence (Score~0)
from the 10 Class-1 training samples, (red-boxes
one was wrongly assigned to Class-2 but with low confidence
all 5 query samples (solid diamonds) were assigned to Class-2 (blue)
but 2 at very low confidence (score~0)

In a perfect training set, all red boxes should show up in the lower part of the graph, wheras all blue boxes shouls show up in the top part of the graph.

SVM and Random forest generate binary values - thus the trival graph is not generated.

LDA - Linear Discriminant Analysis

Discriminant analysis is a statistical technique to classify objects into mutually exclusive and exhaustive groups based on a set of measurable object's features.
Linear Discriminant Analysis (LDA) is used as dimensionality reduction technique for pattern-classification and machine learning applications - similar to Principal Component Analysis (PCA). The goal is to project a dataset onto a lower-dimensional space with good class-separability.

LDA tries to find the axes that maximize the separation between the classes, wheras PCA tries to find the axis which maximize the variance in the data.

LDA is a parametric method assumung unimodal Gaussian likelihoods which are linearly separable.

For significantly non-Gaussian distrbutions, LDA might not generate expected results:

CC - Closest Centroid

CC is an easy to understand straight forward method.

For each of the two traning sets (T1,T2) with multiple replicates for the two penotypes, compute the controids C1,C2.
I.e. component wise the mean for all features (genes) within the sets T1,T2 individually.
Depending of data type and assumend distribution use

arithmetic mean - Gaussion distributed data (linear/log)
median mean - non parametric (non Gaussian) disributed data (linear/log)
outlier insensitive
geometric mean - ratio data

For all new samples, compute the similarity (distance) to all centroids d_i.

Assign the new samples to that class where similarity is largest (distance is smallest).

Distances may be computed as simple geometric distances (Euclidean, Manhattan, Chebyshev, Canberrra, Gowan) or as correlation distances (Pearson, Spearman, Kendall's Tau).

It is also straightforward to normalize the distance of an individual sample to the distance between the centroids, thus allowing to score the quality of the classification for each new sample.

Centroids may be reused for other non-classified samples.

CC uses the information from ALL class members to assign the new samples independant of the internal distribution of training set members.

Thus, CC should give best results for homogenousliy, symmetric datasets - similar to LDA.

KNN - K Nearest Neighbours

KNN computes the distances of the new unclassified sample (X_s) to all members of the training sets.
Next we take the K most similar (smallest distance = neighbouring) vectors, and count their class assignments.
The new sample is assigned to the class with highest count number.

Let's look at a (constructed) example:
Traning set 1 and 2 with vectors a,b,c,d, ... each.
C₁, C₂ the centroids of the training sets (computed with arithmetic mean).
The Black, Green lines represent euclidean distance from the uncharacterized sample X_s to all training vectors:

Table of distances from X_s to all members of the training sets T_1i, T_2i;

Sample:	T₁	T₂
a	2.5	6
b	4	3
c	1	5
d	2.5	2
e	5	4
f	5	1
g	2	1
h	4.5	6
i	4	7
j	4

Lets rank the trainng set vectors by their distances to X_s:

Sample	T_1c	T_2f	T_2g	T_1g	T_2d	T_1d	T_2b	...
Rank	1	2	3	4	5	6	7	...
Distance	1	1	1	2	2	2.5	3	...

For this example we choose K=5.
I.e. we take the 5 traning set vectors with smallest distance (T_1c, T_2f, T_2g, T_1g, T_2d) and count their class membership:

Class	Count
T₁	2
T₂	3

Thus we would assign the new sample to class 2 with KNN classification.

In opposite to Closest Centroid/LDA: there, we would assign the sample to class 1.

Obviously, KNN may be better suited to predict non-homogenious non symetric training data sets.

ANN - Artificial Neural Network

SVM - Support Vector Machines

RF - Random Forests

The original publication summarizes:
"Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large.
The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Freund and Schapire[1996]), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression."

From the original publication:
LEO BREIMAN
Random Forests
Machine Learning, 45, 5-32, 2001, Kluwer Academic Publishers.

SUMO utilizies Ranger64 Version 0.2.7, a standalone C++ implementation of random forests:

On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data
Daniel F. Schwarz, Inke R. König and Andreas Ziegler
Bioinformatics, Volume 26, Issue 14, Pp. 1752-1758.

The basic part of a random forest is a decision tree.

Lets look at a simple example.
A decision tree with low complexity to predict whether an Apple tree will bear fruit.

As input, the tree requires a vector containing information about the attributes of an Apple tree (e.g. age of the tree, strain: natural or variety, rich or poor soil).
Starting with the root node of the decision-making rules of the tree be applied to the input vector.
At each node one of the attributes of the input vector is queried. In the example, at the root node the age of the Apple tree.
The answer decides: old=>no fruit, young=>proceed to next node.
After a series of evaluated rules, you have the answer to the original question.
Not always, all levels (nodes) of the decision tree must be traversed.

How to build a decision tree (automatically) from a suited date set?
E.g. the set of genes best suited to distinguish two phenotpes ?

One method could be a recursive top-down approach.
It's crucail to have a suitable record of reliable empirical data to the decision problem (the training data set). The classification of the target attribute must be known to any object (sample) of the training data set.
In each iteration step, the attribute which best classifies the training samples will be identified and used as the a node in the tree.
This is repeated until all samples from the trainng set are classified.
At the end, a decision tree is built, who describes the experience and knowledge of the training data set in formal rules.

One problem could be, that thew decision tree becomes overfitted - it can exactly and perfectly classify the training set- but fails with other data.

To circumvent this problem a randomisation with subsampling is applied:
A variety of random subsample of attributes as well as of traing samples is used to build multiple decision trees.

Mutliple random decision trees are grown and combined to the random forest.

From this random forest, we can deduce weight factors for the individual attributes and extract an "averaged" decision tree which should be able to classify our class of problems.
(Adopted from Wikidpedia))

Select the respective Random Forest method from SUMO's classification menu.

Define the sample groups according to the seleted method:

Training: 2 or mopre groups groups (Class1, Class2, ...)
Training + Predicition (Class1, Class2, ... + Query)
Prediction with forest : 1 group (Query)
NB: - Query data set should contain the same features in same order as the training set
- Use the same parameters as used in the training

A parameter dialog opens up:

Set the parameters accordingly:
Multiple Random decision trees are grown random selected trraining data subsets and combined.
To regenerate the random training subsets, eg for testing of different parameters - it may be recommended to initialize the random number generator with the same constant value (seed).
Thus the classification differs only depending on parameter settings and not on training subset generation.

Type	Prediction method: Classification into two or more distinct groups Regression to a continuous numeric paramter and preodict outcome Survival - learn sutvival curve from (censored) survivla dat and profeict outcome
RegressionSurvial/censor column ID: - Regressiona: the annotation row ID containing the numerical parameter for regression analysis - Survial: specify spevify comma seaprated: + surval row ID - time to evnet (typical time to death of sample) + censor row ID - defines whether sample experienced event > NOT censored ( value = 0,no,not,false) sample did NOT experience event => sanmple censored (value=1,yes,ja,true)
Number of trees	Number of random trees to grow as more as better as more as more memory and cpu-time required adopt to number of features and samples in trainng set
# vars to split	At each branch of the decision tree, only a part of the variates (features) is used for classification.
Importance measure	Method how to measure influence of the individual features on the overall classifier
Random seed
Gene name column ID	ID of gene annotation column containing unique Gene / Feature names / IDs These IDs show up in the importance file.
View log/importance	Yes/no : Automatically show Ranger64 log and importance file.
Forest file name	Define name and path to forest file generated / used in ther present run Default: "ranger.forest"

Now, Ranger64 is called to build the forest and SUMO harvests the results to the Log-tabsheet - see above.

Several data files are created in the folder where SUMO.exa as well as Ranger64.exe are localized.

The forest
A binary representation of the extracted decision tree.
The forest file may be reused to classify new samples without the time consuming recomputing of the forest again.

New samples for predicition should have:

same number and order of features
same platform (e.g. transcriptome array)
same data normalization / transformation.

It might be not recommended to apply a RF deduced from virtual-pool-normalized/log2-transformed gene expression array ratios to rlog transformed RNASeq counts.

The importance file:
A tab-delimited list of importance values for each feature in the training set.
Each line contains the features name (=> Gene name column ID) and its respective importance values.

High importance values indicate high correlation of the particular feature to the grouping/regresson: they are helpful for classification.
Features with importance value ~0 are not helpful for classification
Features with importance value << 0 are contraproductive

Thus it may be helpful to filter the importance file, as well as traing data set und later test data sets for high impotant features.

The graph shows the sorted importance values from a small gene expression data set (nb. not all gene names are shown on the x-axes).
- the break point at importance ~2
- or the plateau at importance ~1
may indicate positions, where to filter the most valuable features for later classification.

Now you may generate a new RF from the filtered training set and use to predict new filtered test data sets.

Furthermore, the highly important features may be regarded as most singnifcant features of a kind of non-parametric test (class-test/regresson response).

Trouble shooting

In case classification didn't succed or results seems to be weired, view all Ranger64 data files.
Select View last result data from SUMO's classification menu.

All result files are opened with Windows' Notepad.