Non negative Matrix Factorization (NMF)

Genome wide molecular analysis (e.g. gene expression ~30000 features), mehthylation (~450000 features)of hundreds samples results in huge datasets wich are not easy to inspect or interpret.
Dimensionalty reduction techniques may help to simplify data interpretaion while keeping the underlying patterns / forces.

Non negative Matrix Factorization (NMF) tries to express the complex source matrix as a product of two matrices with much lower dimensionalty:


Wi,j = Wi,r x Hr,j

With R= Rank of the factorization (r<<i and r<&t;j).

E.g.: Factorizing a complete geneExpression data set with r=2 would operate with matrices like:
Dimensionalty of V: ~30000x500
Dimensionalty of W: ~30000x2
Dimensionalty of H: ~2x500

The property on non negativity allows to apply a straight forward updating-rule based algorithm to find a solution for the factorization.
Algorithms for Non-negative Matrix Factorization Daniel D. Lee, H. Sebastian Seung
Advances in Neural Information Processing Systems 13 edited by T.K. Leen and T.G. Dietterich and V. Tresp.
proceedings from the conference, "Neural Information Processing Systems 2000.

In an iterative way, W and H are recomputed.
Due to the update function and non-negative values, the iteration will converge to a stable solution of NMF.
A cost function is used to estimate the monotonic decreasing difference between V and W x H after each iteration.

The extracted simplified matrices W and H can be used to assign conditions (samples) to classes defined by the derived meta-features (genes) in H, as well as to assign features (genes) to classes defined by derived meta-conditons (samples in) W.

The high dimensonalty of source and even result data, lead to the fact, that multiple (similar) approximative solutions for the factorization exist.

The most suited assignment of conditions/features may be found when running multiple randomly initialized factorization runs.
From all runs the pairwise co-assignments of conditions/features is collected in a consensus similarity matrix.
The matrix may be sorted (clustered) to visualize underlying structure of the factorization.


Like with any other clustering approach, remove data which are not supportive:
Best, reduce the starting data matrix to a few thousand rows/columns to enable fast and efficient analysis.






NMF with SUMO

From the main tool bar select Clustering | NMF | NMF:


a parameter dialog opens up.
Define how to perform NMF:



MethodDefine how the iterative is performed.
1=Lee, Seung: Multiplicative update with Educledean distance as cost function
2=Lee, Seung: Multiplicative update with Kullback-Leibler divergence as cost function
...

RankNumber of factors for Conditions/features
Should be a small number (2..10..)
Give a range (e.g. "2,5") to run 4 complete rounds of NMF.
Starting with NMF rank=2, followed by NMF rank=3, ....
Result files are generated with a tag defining the rank (e.g. *_r2.txt for all results from NMF with rank=2)
Max. iterationsMaximumn number of iterations to perform in a single NMF run
In case the NMF problem can not be solved the process might run infinitely. After user defned maximumn number of iteratons the update process is interrupted independent of convergence.

Convergence radiusAfter each update cycle, the relative change of the cost function is computed.
The cost function should assymptotically approach Zero (but probably never reach it).
Define the maximal accepted change (in %) of the cost funcion to stop the updating, harvesting a close approximation to a solution of NMF.
As smaller the convergence radus as better the approximated solution, and as more update loops are required

Random seedW and H are initialized with random numbers.
Define a fixed start value to ensure the same sequence of random numbers.
This may be hepful, to compare diffrent NMF runs with varying parameters.
Leave the field empty for completely random sequences.

Random cyclesNumber of NMF cycles with different randomly initialized W and H matrices.
Condition/Feature associations to factors are collected and will be used to build a consensus similarity matrix.

Build consensus matrix
C: only for the conditons
F: only for features
CF: both
empty: none

Factor asignments for condtitons/features are collected after each random cylce and saved in the condition-/feature factor file.
Additionally you may build the consensus similartity matrix and save it (condition & feature consensus matrix).
Building of consensus matrices becomes time consuming (dim > 10000) and requires a lot of memory which may crash the program (dim > 35000).
Thus, you may only build one (the smaller) matrix, or build the matrix later, independently reusing the condition-/feature factor files.

Show consensus matrix
C: only for the conditons
F: only for features
CF: both
empty: none

Having build the consensus similarity matrix, you may immediately show data in a heatmap viewer.
With matrices dim > 20000 this may crash SUMO.
FilesBase names for result files.
SUMO will use the base names to store results from the different randomization rounds at the specified location.
(e.g. "d:\nmf_Condition_cm_r1.txt" as consensus similarity matrix for ramdom round 1.


Click OK-button to run NMF.

Specified number of Random cycles NMF are performned. For each rond, W and H matrix are intitalized with a new set of random numbers (in the same numerical range as data matrix V).
Update iterations are run until relative change of cost function falls belowe user defined convergence radius or user defined max number of iterations is reached.
Next all Conditons are assigned to feature factors:
For each conditon, all "meta features" are checked.
The respective condition is assign to the meta-feature with highest numerical values.
Same procedure is applied to assign features to meta conditions.
Conditon/feature asignment to factors is saved using user defined file names.
If requested, consensus similarity matrices are generated and saved.
If requersted, consensus matrices are sown as heatmap views.

Computation progress is shown and summarized in Log tab-sheet and status bar:



For each random round, basic results are shown:
#Number of random round
(e.g. 1=first round)
nNumber of iterations in this round.
(e.g. 94 iterations for round 1, if n equals user defined number of max iterations, NMF didn't reach user defined accuracy. May be adjust parameters and run again.
DeltaValue of cost functions at last performed round.
Delta=abs(Costfunction[i-1]-Costfunction[i];
Delta %Relative change of cost function between secod last and last random round.
Delta%=(Delta / CostFunction[i-1])*100; Here, cost function changed between last two random rounds by ~0.097%.
This is smaller user defined convergence radius => stop iterative update scheme and finalize randoum round 1.

Next name of created result files and some timings are shown.


Timings / performance:
NMF update scheme runs in linear time and linear memory consumption.
Consensus matrices require sqared time and memory consumption.

Some Timings (I7, 1 thread, 2.6GHz, rank=5, 100 random cycles, delta%=0.1):
Features
(Rows)
Conditions
(Columns)
of iterations t NMF (s) Condition matrix
t (s)
Save C-CM
t (s)
Feature matrix
t (s)
Save F-CM
t (s)
Total
t (s)
1001001200.0470.0150.1870.0160.01614.820
20010010524.1340.0470.0150.0160.01624.695
50010010155.4580.0470.0160.0310.04756.004
1000100106115.1130.0630.0150.1090.156115.878
2000100107231.5680.0630.0160.8730.578233.643
5000100104565.0200.1090.0156.4893.526576.080
100001001031432.1830.1720.01530.76314.3371480.060
150001001042094.5630.2190.01575.25531.8872207.351