CO-LOcalisation Mapper (COLOM )

Genome wide DNA microarrays and Next Generation Sequencing (NGS) allow to acquire systmatically information about ten -to hundred thousands of features (e.g. Genes, exons, miRNAS, ...) from duzends or hundreds of samples (e.g. cancer patient biopsies). For examle:

Typically, samples are grouped according to a specific phenotype or correlated to an outcome or when a certain endpoint is reached:
  • healthy tissue vs. tumor tissue
  • normal cells vs.cells with drug treatment 1
  • tumor volume after
  • time to time to local recurrent tumor after initial treatment
  • ....
  • A statistical test value is computed for each individual gene, telling: how good does this idividual gene reflect the phenotype:
  • differential expression between two groups
  • signal to noise ratio
  • t-value from a Student's t-test
  • p-value from a Student's t-test
  • correlation coefficient from a Pearson / Spearman correlation
  • ...
  • A list of most interesting genes (with best test values) may be filtered.
    But such a list does not tell us anything about biological (medical) importance or impact.

    One way to explore this could be to evaluate how good my regulated / selected genes co-localise on known biological roles or themes There are different mathematical models to explore such findings:


    Also, you may use different biological themes for testing e.g.:
    But in principle you may use any kind of biological information with COLOM as long as you can bring them into a form which COLOM expects.
    E.g. if you have hierachicall structured themes, format your biological themes as OBO files and use them instead of GO-terms ...
    If you have sa simple association: list of genes => theme, format them as MSig-DB.
    For more details see:

    Basic idea
    Some mathematical concepts
    COLOM
        Load data
        Data view
        Preferences
        Search
        Copy
        Paste
    Data bases / Data files
        GO-Terms
        Chromosomal bands
        KEGG Metabolic pathways
        miRNA
        Transfac
        MSig-DB
        Wiki Pathways
        Reactome

     

     

     

     

    >Basic Idea and mathematical concepts

    Over-representation of significant genes - Hypergeometric distribution

    Consider a population of genes (gene-index) representing a diverse set of biological roles or themes (e.g. GO-terms, chromosomal localisation, biochemical pathways, ...) shown below as different colours (e.g. Green=chromosome1, Blue=chromosome2, ...):

    Partition (group) the genes based on expression profiling over multiple conditions (hybridisations) with any algorithm (e.g. 6-class ANOVA, k-Means clustering, ...) into e.g. 6 groups:

    Lets look in detail at the lower right group (Group6):


    Population size (all genes) = 40 genes
    Group size = 12 genes

    From 10 genes with the same common biological role (Chromosome1), 8 genes occur within Group6.

    Question:
    Is there a (statistical) significance, to find this group of genes in a biological role?
    Is there a biological theme predominantly expressed?

     

    Some mathematical concepts

    Consider gene frequencies:


    10 from total 40 genes are found in "biological role" (Chromosome1) = 25%


    8 from 12 genes form our Group6 are found in Chromosome1 = 87%

    A 2x2 contingency matrix is typically used to evaluate the relationships between membership between and Groups and biological role.

    Theme  

    Group

    in out
    in 8 2
    out 4 26

    The statistical probability to get exactly these numbers can be estimated with the Hyper-geometric distribution (Fischer exact test):

    a b a+b
    c d c+d
    a+c c+d  

    The Hyper geometric distribution only gives the probability for one contingency matrix (8,2,4,26).
    But we are interested to get the probability for this and all better (more in-in hits) combinations.
    Therefore, we build all "better" combination and sum up the individual p-values:

    8 2
    4 26
    +
    9 1
    3 27
    +
    10 0
    2 28
     
    0.0002207 + 0.00000727 + 0.0000000779 ~0.000228
    NB: Above pictures taken from a MS-Powerpoint presentation by John Quackenbush.

    To perform such a test we need 4 complex informations:

    1. List of biological roles GO-Terms
    Chromosomal bands
    Pathways
    ...
    2. Gene index Maps genes to the above biological roles
    3. Reference gene list All genes on your microarray (filter,...)
    4. Selection gene list Result gene list of your statistical analysis






    Gene Set Enrichement Analysis (GSEA) - Kolmogorov-Smirnov statistics

    Another approach was suggested by Aravind Subramanian et al.:
    "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles"

    In brief, their method works like:
    ".... GSEA considers experiments with genomewide expression profiles from samples belonging to two classes, labeled 1 or 2. Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric.

    Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. ....

    .... Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. ....

    .... Estimation of Significance Level of ES. We estimate the statistical significance (nominal P value) of the ES by using an empirical phenotype-based permutation test procedure that preserves the complex correlation structure of the gene expression data. ...."
    Figure and text taken from the above pbulication.

    A few ES-graphs for typical distributions of regulated genesets:
    Up-regulated geneset:
    Down-regulated geneset:
    Geneset with Gaussian
    regulation distribution:
    Geneset with (artificial)
    random disribution:





    Simple Enrichement Analysis (SEA) - Chi2 statistics







    COLOM

    In SUMO click the Co-Localization-Button and select the desired analyses:

    The browser (on first start-up empty) opens up:







    Load required data files

    Go to preferences page and select default files for 1. Biological Role (in the example below : Go-terms file) and
    2. Gene index file mapping Genes => Biological roles (here: GO-terms):

    For some mappings the gene index file is notrequired (e.g. KEGG, Transfac). Here, the mapping of genes to biologicla roles is already included n the biologicla role files.

    Click the Auto Load buttons. On next start-up of COLOM the respective files are loaded automatically.

    NB: COLOM memorizes all settings for the different analysis types separately.

    Depending on the analyses, COLOM asks to define file structure. For details see below in the respective analyses (GO-terms, Chromosomal bands)

    Now click Load GO-terms (or respective other analyses), Load gene index,

    3. Load Reference gene list. This is a complete list of genes analysed in yopur experiment (e.g. all genes on your microarray, filter, ...).
    COLOM supports three sources for such a list:

    a) Loaded SUMO analysisMost easy way, in case you opend SUMO analysis contains the full list of genes analyzed on the Mircor-array/filter/... .
    This option shold give the most accurate p-values.
    b) External data fileLoad the complete list of anaylzed genes from an external data file.
    Typically this could be the array's ADF file, or an expression matrix, ...
    This shold give correct p-values, too
    c) Genes mapped to biological theme Simply take all genes which are assigned to the loaded bological themes.
    It may happen, that not all genes associated to the biological themes are found on the microarray.
    Thus computed p-values might be slightly incorrect.
    But with genome wide analyses this should not generate massively wrong data.

    A dialog opens up, asking how to get the reference gene list:



    Click the corresponding button.

    For option b) and c), COLOM asks to define the column with gene names.
    A file previewer shows up:

    Double-click the column containing gene-names (or single click the column, than OK-button).


    4. Load selection gene list. This is a subset of your expression data resulting from statistical or cluster analysis. COLOM counts occurrence of selected as well as regulated genes and calculates p-values for all Nodes.

    When loading the Selection List, COLOM asks to define a column with gene names and a column with regulation values.
    A file previewer shows up:

    Click into the column containing desired regulation values, than click Regulation Column edit field.
    Now Click into the column containing gene names, than click Gene-ID column edit field.
    Click OK button to continue.


    Data loading and processing can require some time.
    Therefore, COLOM tells in the status line what he is doing and indicates the working progress:


     

     

     

     

     

    Data View

    When all (four) data files were loaded and statistic is calculated you can browse and analyse data.
    COLOM now looks like:

    Left panel Brows able hierarchical tree of biological themes
    Right panel Sort able table of populated biological themes
    Bottom panel Info window. Content depends an what was selected in above panels

    Each of the top panels can show the columns:

    Column Content  
    1 Name of biological role / gene Icons indicate:
    • N: this is a node=biological theme, Click +/- sign to expand collapse this node
    • G: this is a gene

    Text colours indicate (Default colours)

    • Gray: no gene in this biological theme or
      this gene was not found in reference gene list
    • Black: Reference genes for this biological theme found
    • Blue: Selection genes found for this node, no nett regulation
    • Red: Selection genes found for this node, netto up-regulated
    • Green: Selection genes found for this node, netto down regulated
    2 Number of genes in the biological theme role in reference gene list (e.g. ADF file) or
    Number of replicates for this gene found in reference list
     
    3 Number of genes in the biological role found in selection list or
    Number of replicates for this gene found in selection list
     
    4 Number of up-regulated genes in the biological role found in selection list or
    Number of up-regulated replicates for this gene found in selection list
     
    5 Number of down-regulated genes in the biological role found in selection list or
    Number of down-regulated replicates for this gene found in selection list
     
    6 Net regulation of biological role / gene Arithmetic mean from regulation of all genes populating the respective Biological role / gene
    7 Significance indicator A coloured disk indicates: this biological role is significantly over/under populated (set critical p-value on preferences tab-sheet).
    Colours indicate:
    Blue: whole selection
    Red: up-regulated
    Green: Down-regulated
    Yellow: both Up-/Down-regulated
    8 Number of expected gene for biological role  
    9 p-value: significance of populated for the biological role with all selection genes  
    10 p-value: significance of populated for the biological role with all up-regulated genes  
    11 p-value: significance of populated for the biological role with all down-regulated genes  

    Size columns or drag columns to change order of columns.
    In right panel (table) sort columns by double clicking on column header (alternatively up-down, down-up).
    Customize view by changing fonts/colours  columns/names to display on Preferences tab-sheet.

     

    The info window (bottom panel) shows various information depending what was done:

     E.g. single click a node (either in browser or table (left/right panel), herre: aGO-term):

    The info window shows:

    Double click a Gene node:

    A web browser opens up, showing info for the selected gene available from NCBI'S Entrez Gene.

     

    Double click a line in table view (right panel):

    All nodes located in the same sub-tree with the selected node as root get selected (greyed background) and are listed in the info panel. Thus you can get an impression whether correlated biological roles are enriched in a certain part of the sorted table.

    Gene-set enrichment:
    Additionally a p-value according to U-Man-Whitney distribution is calculated, indicating the probability that selected nodes are enriched as significantly populated nodes.
    NB: This computation requires that the data table is adequately sorted. Eg. sort the data table according to p-Sel (above example). The computed "enrichment" p-value tells you  the probability to find the selected genes under with lowest-p by chance. If a data are not sorted accordingly => p-value is meanigless.


     

     

     

     

    Filter

    Filter biological role nodes (displayed in tree-view).
    Click the filter button:

    and select the respective filter:

    Hide nodes without genes Remove all biological-role nodes where no index genes were found.
    Hide nodes without reference genes Remove all biological-role nodes where no genes from reference list were found. Also remove all index genes from tree which were not found in reference-list.
    Hide nodes without selection genes Remove all biological-role nodes where no genes from selection list were found. Also remove all genes from tree which were not found in selection-list.
    Hide Genes Remove all genes from tree. Useful when to browse terms only
    Show all Remove all filters => show all nodes.

     

    Filter table:

    Open context menu on column headers and select filter:

    For each data columns Min / Max thresholds can be defined independently. Filters from multiple columns are combined with logical AND.
    a " * " in front of a column name indicates: a filter is set.

    Filter Min
    Filter Max Define maximum value for filtering, i.e. all items with a value larger Max are removed
    Filter All Disable filtering for this column
    Remove all filter Disable filter for all columns

    Filter settings can be seen in Session log:







    Search

    Search in names / descriptions from all loaded genes or biological roles:

    Search result is shown in info-panel.

    Search in names / descriptions from biological roles populated with selection gene. Click the search button on top of table (right) panel:

    Search result is shown in info-panel.







    Copy

    Copy statistics table (only selected entries; type Ctrl+A to select all entries.
    Click copy buttons on top of data table (right panel):

    Left button: Save selection as tab-delimited text file.
    Right button: Copy selection as tab delimited text to clipboard (ready to be pasted into spreadsheet programs.
     







    Paste

    Instead of loading Selection gene list from a text file, you can also copy a list of genes (with additional regulation information) from any other source and paste it into COLOM.
    COLOM expects a list of genes. One gene per line,. Optionally a numerical regulation value separated by a a TAB.







    Preferences

    Select the Preferences tab-sheet to customize COLOM's default data files and settings:

    Data files:
    Define default files for Biological role tree (here: GO-terms) andGene index.
    Click the Auto-load check boxes to automatically load the default files at start-up of COLOM.

    Biological role tree (here: GO-tree)
    Define view style of tree and table panel:

    Suppress underrepresented values There may be various reasons why too few genes may have passed your statistical test or clustering.
    Therefore you might see too many under-populated nodes. Click this field to mask underrepresented nodes, i.e. those nodes where number of mapped genes is lower compared to expected number.
    Or more easily: only show over-populated nodes.
    Alpha Critical p-value to mark nodes. Only those nodes with p-values smaller Alpha are marked with coloured buttons.
    p-Computation  
    Biological role ID
    (here: GO-ID)
    Display additionally systematic Node ID in tree and table (e.g. "GO:0044464 - cell part")
    View biological  role description Display additionally Node's description in tree and table (e.g. "cell part - "Any constituent part of a cell, the basic structural and functional unit of all organisms." [GOC:jl]")
    View gene counts Display columns containing gene counts in tree and table panel
    View p-values Display columns containing p-values in tree and table panel
    Exponential format View p-values in exponential format: Instead of 0.00341 => 3.41E-003.
    Most interesting for p-values is the order of magnitude (E-003).
    Everything below E-002 is interesting. As smaller as better (E-006,E-007,...)
       






    Data files and file formats

    As mentioned above COLOM requires 4 data sets to evaluate co-localisation information:

    1. Biological role (e.g. GO-terms, Chromosomal bands, pathways, ...)
    2. Gene index (associates genes to biological roles, e.g. ALB => GO:0019836 hemolysis of symbiont ...
    3. Reference gene list (all genes on e.g. a micro-array)
    4. Selection list (filtered genes, e.g. significant gene from a t-test)

    Two of them (Biological role and Gene Index) are normally derived from common available resources. Below a description of data file formats COLOM expects for the different kind of analysis.

    Reference and Selection gene lists are simple tab delimited text files.


    Download respective Biological role and corresponding Gene index files and "install" them with COLOM:
    Got to preferences page and select just downloaded role/gene index files in the respective fields.

    For the gene index, you have to define the respective columns:
    - A file dialog opens-up, shwowing the first few lines from the selected gene index file.
    - Identify required data columns
    - set the column IDs in the respective edit fields

    Check both auto load fields.
    Close and restart COLOM






    GO-Terms

    Biological Role:
    GO-tree

    Names, description and relationship of GO-terms (i.e. which GO-term is part of another).

    SUMO expects GO-tree data in OBO v1.2 data format.
    A good source for an up-to-data GO download is the Gene ontologies download page.
    Download GO-ontology files in OBOv1.2 format:

    Gene-Index:

    Association: which gene is associated => to a certain GO-term.

    SUMO expects a tab delimited data file containing columns with

    GO-ID GO:0040007
    Gene BMP10
    Description Bone morphogenetic protein 10 precursor (BMP-10).[Source:Uniprot/SWISSPROT;Acc:O95393]

    You can generate such files e.g. with ENSEMBLE Biomart.
    One example of a  recent (from 11.01.2008) version of such a file may be downloaded from here. (Right mouse click | Save as)

    Another source may be NCBI. From their FTP-site, navigate to Gene-Data folder.
    The gene2go.gz file links Gene-IDs to GO-terms. The gene2go files contains several hundred thousands of lines for all organisms Thus, it may be helpful to filter only those lines containing info for the organism of interest (by Taxonomy-ID, e.g. 9606=homo sapiens, 10090=mus musculus, 10116=rattus norwegicus, ...)
    The gene_info.gz file contains detailed annotations for Gene-IDs (e.g. description, chromosomal localization, gene symbol...). The gene_info file contains a few million lines for all organisms. Again, it may be helpful to filter only those lines for the organisms of interest by Taxonomy_ID (as above).
    Alternatively, browse to the GENE_INFO folder and download the gene-info for the organism of interest (e.g. human: go to subfolder Mammalia and download Homo_sapiens.gene_info.gz)
    Uncompress the files (most Zip-compressors e.g. Winzip can handle ".gz" files) and merge the data files by the Gene_ID (e.g. with TableButler).

    Recent files (up-dated every few weeks, last updated at) can be downloaded from SUMO site. Better visit above mentioned sites (or other), download and pre-process data to generate optimised data files for use with COLOM.

    Data files GO-Term column Gene-Name column GO/Gene-Description column Gene aliases
    (Alternate gene names)
    Alias divider
    Biological roles: GO-Tree
    gene_ontology.obo
     
             
    Gene index => GO-terms
    ENSEMBLE
     
    4 3 2    
    NCBI
        homo sapiens
        mus musculus
        rattus norwegicus
     
    18 3 9 5   |  
    ("Pipe", ASCII=124)






    Chromosomal bands

    "G-banding is technique used in cytogenetics to produce differently stained regions on condensed chromosomes. The metaphase chromosomes are treated with trypsin (to partially digest the protein) and stained with Giemsa. Dark bands that take up the stain are strongly A,T rich (gene poor). (...) Banding can be used to identity chromosomal abnormalies, such as translocations, because there is a unique pattern of light and dark bands for each chromosome."
    (from Wikipedia)
    Biological role:
    Ideogram
    List of Chromosomal bands.
    COLOM expects a tab-delimited ideogram data file containing:
    • one chromosomal band per line
    • 1. column: Chromosome number (e.g. 1..22,x,y)
    • 2. column: arm (e.g. p,q)
    • 3. column: band (ter,cen,1,11,11.1,11.1a, ...)

    Such a file may be downloaded from NCBI's FTP site: e.g. ideogram.gz (right mouse-click, Save as):

    #chromosome arm band iscn_start iscn_stop bp_start bp_stop stain density bases
    1 p ter 0 1 0 0     1
    1 p   0 7335 1 124300000     124300000
    1 p 3 0 4852 1 84700000     84700000
    1 p 36 0 1521 1 27800000     27800000
    Gene index:
    Genes-to-ideogram
    Association: which gene is located on a certain chromosomal band.

    COLOM expects a tab delimited data file containing

    • one gene per line
    • one column containing chromosomal band where gene is located
    • one column containing gene name
    • one column containing gene description

    Such a file may be downloaded from NCBI's FTP site E.g. cyto_gene.md.gz for homo sapiens. Data files from other organisms may be downloaded from respective directories.

    #tax_id chromosome iscn_start iscn_end orientation featureName featureId featureType printLocation units
    9606 1 1839 1895 na A3GALT2 GeneID:127550 GENE 1p35.1a BandsAsInt
    9606 1 660 710 na AADACL3 GeneID:126767 GENE 1p36.21d BandsAsInt
    9606 1 660 710 na AADACL4 GeneID:343066 GENE 1p36.21d BandsAsInt
    9606 1 5682 5741 na ABCA4 GeneID:24 GENE 1p22.1a BandsAsInt
    9606 1 13847 13906 na ABCB10 GeneID:23456 GENE 1q42.13e BandsAsInt
    9606 1 5741 5795 na ABCD3 GeneID:5825 GENE 1p21.3d BandsAsInt

     

    Here a few ready to use recent (up-dated every few weeks) date files:

    Data file  
    Biological roles:
    IDeogram (g-Banding)
    homo sapiens (NCBI)
    mus musculus (NCBI)
    rattus norwegicus (NCBI)
    uses data columns 1,2,3
    see above
    Gene index:
    homo sapiens (NCBI)
    mus musculus (NCBI)
    rattus norwegicus (NCBI)
     
    First row 2
    Band-ID column 9
    Gene-ID column 6
    other : 0 (=not used)

    NCBI data files are g-zipped. Standard MS-Windows utilities (e.g. freeware 7-zip, commercial Winzip, WinRar) can unpack such archives.

    When performing COLOM with chromosomal bands you get a new tool-button:

    Click the button to visualise the statistics as "Coloured Chromosomes":


    At first use of "Coloured Chromosomes", goto Preferences tabsheet and define Ideogram and gene index files similiarly to the method in COLOM.




    KEGG metabolic pathways

    "KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for:

    1. Metabolism
      Carbohydrate Energy Lipid Nucleotide Amino acid Other amino acid
      Glycan PK/NRP Cofactor/vitamin Secondary metabolite Xenobiotics

    2. Genetic Information Processing

    3. Environmental Information Processing

    4. Cellular Processes

    5. Human Diseases
      and also on the structure relationships (KEGG drug structure maps) in:

    6. Drug Development"

    For more details about KEGG visit their web-site.

    Sine 2011, KEGG pathway database is no longer freely available from KEGG-ftp-site via anonyous download.
    Instead, you may subscribe to their ftp-site, and donwload KEGG data for personal or institutional use.

    If you have access to the KEGG database, extract the information to a folder on your local computer:
    In COLOM, goto preferences page.
    As "KEGG file", select any of the previously unpacked "*.conf" files.

    An outdated set of KEGG pathway files (donwloaded in 2011 via anonymous ftp from KEGG ftp site) may be downoaded from SUMO site for exploring the functionality of COLOM
    All data are packed in a selfextracting archive.
    For serious work, better subsribe to KEGG and use updated data.

    Members from DKFZ may use the up-to-data KEGG pathway database which is availabel via HUSAR (DKFZ wide subscription to KEGG).






    miRNA

    Wikipedia:
    A microRNA (abbreviated miRNA) is a short ribonucleic acid (RNA) molecule found in eukaryotic cells. A microRNA molecule has very few nucleotides (an average of 22) compared with other RNAs.
    miRNAs are post-transcriptional regulators that bind to complementary sequences on target messenger RNA transcripts (mRNAs), usually resulting in translational repression or target degradation and gene silencing. The human genome may encode over 1000 miRNAs, which may target about 60% of mammalian genes and are abundant in many human cell types.


    Biological Role: a list of miRNAs

    COLOM expects a tab delimited text file containing a colomn with miRNA names:

         MicroRNAmiRGeneID   Gene
    MI0000266   hsa-miR-10a   23054NCOA6
    MI0000102hsa-miR-1002475FRAP1
    MI0000109hsa-miR-1032746GLUD1
    MI0000109hsa-miR-1039493KIF23
    ....

    Gene index:
    a list of associations miRNA => Gene

         MicroRNAmiRGeneID   Gene
    MI0000266   hsa-miR-10a   23054NCOA6
    MI0000102hsa-miR-1002475FRAP1
    MI0000109hsa-miR-1032746GLUD1
    MI0000109hsa-miR-1039493KIF23
    ....

    One source for miRNA <=> Gene interactions might be miRWalk database.
    For details how data were generated see:
    Dweep, H., Sticht, C., Pandey, P., Gretz, N.,
    miRWalk - database: prediction of possible miRNA binding sites by "walking" the genes of 3 genomes,
    Journal of Biomedical Informatics (2011), doi: 10.1016/ j.jbi.2011.05.002 (JBI PMID:21605702)

    They supply: Here a few association lists extracted from miRWalk database (built 09-2012, not updated).

    Predicted: different lists, with assocications detected with 1 up to all 10 algorithms supported by miRWAlk:

    The names stands for e.g. MirToGene_813k_650miR_15294.txt
    - 813000 association miRNA <=> gene
    - 650 miRNA
    - 15294 genes

    FileNumber of algorithmsNumber of predicitionsNumber of miRNAsNumber of genes
    MirToGene_LE1_5641K_652miR_16359G.txt   1 or more564079865216359
    MirToGene_LE1_5641K_652miR_16359G.txt2 or more355324865116243
    MirToGene_LE3_2087K_619miR_16109G.txt3 or more208698761916109
    MirToGene_LE4_1259k_568miR_15826G.txt4 or more125912956815826
    MirToGene_LE5_715K_547miR_14866G.txt5 or more71586454714866
    MirToGene_LE6_171K_489miR_11745G.txt6 or more17109548911745
    MirToGene_LE7_29K_343miR_5563G.txt7 or more292493435563
    MirToGene_LE8_8K_181miR_2090G.txt8 or more81251812050
    MirToGene_LE9_2k_104miR_652G.txt9 or more2043104652
    MirToGene_LE10_0,2K_56miR_156G.txt1023956105

    A list MirToGene_813k_650miR_15294.txtcompiled of
    - all miRNAs hit by at least 5 algorithms
    - new miRNAs hit by at least 4 algorithms
    - new miRNAs hit by at least 3 algorithms
    - new miRNAs hit by at least 2 algorithms
    - new miRNAs hit by at least 1 algorithms


    Validated: hsa_mirwalk_validated.txt






    TransFac







    MSig-DB

    The Molecular Signatures Database (MSigDB) is a collection of annotated gene sets for use with GSEA software.
    The MSigDB is maintained by the GSEA team with the support of our MSigDB Scientific Advisory Board.


    For more information of MSigDB visit GSEA website at Broad institute.
    Or read their publication or complementary.
    The GSEA software and source code and the Molecular Signatures Database (MSigDB) are freely available to individuals in both academia and industry for internal research purposes. Please see the GSEA/MSigDB license for more details.

    Register at GSEA site, go to download page and get the respective database for your reseach.
    Select the databases using GENE SYMBOLS

    CoLoM expects
    E.g.
    TRANSITION_METAL_ION_TRANSMEMBRANE_TRANSPORTER_ACTIVITYhttp://www.broadinstitute.org/gsea/msigdb/cards/TRANSITION_METAL_ION_TRANSMEMBRANE_TRANSPORTER_ACTIVITYSLC11A2FXNSLC30A4SLC30A5CCSSLC30A3SLC31A2SLC31A1...
    CYCLASE_ACTIVITYhttp://www.broadinstitute.org/gsea/msigdb/cards/CYCLASE_ACTIVITYGUCY2FADCY9ADCY7ADCY8RTCD1GUCY1A2GUCY1A3GUCY1B3...
    LOW_DENSITY_LIPOPROTEIN_BINDINGhttp://www.broadinstitute.org/gsea/msigdb/cards/LOW_DENSITY_LIPOPROTEIN_BINDINGAPOA4CDH13LDLRANKRA2STAB1CXCL16SORL1LRP6...
    MAP_KINASE_KINASE_KINASE_ACTIVITYhttp://www.broadinstitute.org/gsea/msigdb/cards/MAP_KINASE_KINASE_KINASE_ACTIVITYMAP3K7MAP3K6MAP3K5MAP3K4ZAKMAP3K3MAP3K9MAP3K10...
    ...

    Due to the straight structure, it might be comfortable to generate your own signature / theme database in MSig-DB format and use it with CoLoM.







    Wiki pathways

    Another source for publicly available and publicly curated pahways might be the Wiki Pathways project:

    "WikiPathways is an open, public platform dedicated to the curation of biological pathways by and for the scientific community.
    WikiPathways was established to facilitate the contribution and maintenance of pathway information by the biology community. WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. (...) More importantly, the open, public approach of WikiPathways allows for broader participation by the entire community, ranging from students to senior experts in each field. This approach also shifts the bulk of peer review, editorial curation, and maintenance to the community."


    For more detail visit the Wiki Pathways web page.

    To work with Wiki pathways within CoLOM download the patways from their download area.

    Pathways for various organisms are availabe.

    Download:
    Unpack both Zip-archives into any filder.
    On CoLOM's Preferences page, click the "..." and navigate to the folder where you just extracted the data.
    Select any of the *.gpml files.
    Check Autoload.

    Close and restart CoLoM.






    Reactome


    On Reactome Website we can learn:

    "Reactome is a free, open-source, curated and peer-reviewed pathway database.
    Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation
    and analysis of pathway knowledge to support basic research, genome analysis, modeling,
    systems biology and education."

    from https://reactome.org/

    Reactome data can be downloaded from Reactome Donwload web page,

    CoLoM uses:
    To translate Gene-IDs into Gene Names / Sybmols (e.g. 3251 => HPRT1) CoLoM uses NCBI's Gene-Info data, downloadable from NCBI' ftp site.
    Use specie specific (e.g.../Mammalia/Homo_Sapiens.gene_info.gz) files or a general info file (e.g.../Mammalia/All_Mammalia.gene_info.gz), and filter for a relevant specie (e.g. Homo sapiens => Tax-ID 9606).
    . Furthermore, CoLoM extracts alias gene names to resolve alternate namings.

    All data should be combined into a Reactome Database to be used with COLOM.
    This data base is adopted from Broads's MSig-DB signature file format (*.GMT):
    First part contains TAB-separated pathway information:
    
    R-HSA-350054:Notch-HLH transcription pathway  --  HDAC6	HDAC5	MAMLD1	CREBBP	SNW1	KAT2A	HDAC1	HDAC2	RBPJ	NOTCH1	NOTCH2	NOTCH3	...
    R-HSA-350562:Regulation of ornithine decarboxylase (ODC)	--	PSME3	PSMD14	PSMB11	PSMA8	NQO1	PSME4	OAZ1	OAZ2	ODC1	AZIN1	OAZ3	PSMA1	PSMA2	...
    R-HSA-349425:Autodegradation of the E3 ubiquitin ligase COP1	--	PSME3	PSMD14	PSMB11	PSMA8	PSME4	ATM	PSMA1	PSMA2	PSMA3	PSMA4	PSMA5	PSMA6	PSMA7	PSMB1	...
    
    

    Followed by the Alternate name section:
    
    [Gene-Alias-List]																																																																							
    A1BG	A1B|ABG|GAB|HYST2477																																																																	
    A1CF	ACF|ACF64|ACF65|APOBEC1CF|ASP
    A2M	A2MD|CPAMD5|FWP007|S863-7
    ...
    //[Gene-Alias-List]
    
    

    A prebuild (from 08-2021) - but not regularly updated - data base may be downloaded for test cases from SUMO site:

    Reactome-homo_sapiens.gmt


    But it would be preferable to generate the database from original freshly downloaded data files.

    COLOM provides a tool to download data and create specie specific databases for use with CoLoM.

    In CoLoM popup menu select Build databases | Reactome:


    A parameter Dialog opens up:

    Specify the three source data files (see above):

    Specfiy the specie (taxon) to filter for.
    Specify "taxon-name:Taxon-ID" (e.g. "homo sapines:9606".

    CoLoM tries to download / open the files and build the specie specific CoLoM-database.
    The resulting CoLoM-database is stored as "SUMO-folder\data\colom\reactome\Reactome-SPECIE.gmt".
    Now you may select the newly generated database on CoLoM's preferences tab-sheet.
    Depending on internet conection and server loads downloading of data files may take some while. Thus be patient.
    Work progress and results are shown in CoLoM's Session-Log tab-sheet / status bar.






    Reference gene lists

     

    Selection gene lists







     

     

    Go to:
    Basic idea
    Math concepts
    COLOM
        Load data
        Data view
        Filter
        Search
        Copy
        Paste
        Preferences

    Data files
        GO-Terms
        Chrom-bands
        KEGG
        miRNA
        MSig-DB
        Wiki Pathways
        Reactome

    SUMO