SeedScan - The Graphical User Interface

When you start SeedScan for the first time, all data fields are empty:




Fill in the required files and search parameters.

All red fields HAVE to be updated/specified for each analysis.

Green fields have to be updated/specified when different libraries - barcodes/pairs/seeds - are used.

White fields may be set once and remain unchanged forever.


Multiple sequence files

Define the FastQ files to be analyzed:

  • Drag the files from Windows File Explorer into the respective Sequence file field or

  • Click the "..." buttons. Use the file dialog to find and select the respective file or

  • Type or copy paste a file specification

  • FastQ files may be supplied as plain ASCII text files or as G-Zip compressed archives.
    When G-Zip compressed archives are used, ensure that files have Extension "*.gz".

    Next fill in structure of the constructs:











    Bar codes

    Supply:











    Seed files












    Run the analyses

    Now, after all parameters have been checked and correctly adjusted, you may start the analysis.

    Click the Direct Run button.

    The Library files are loaded.
    Loading and contstruction of 64-bit sorted hashcode transformed seed libraries will take a while.
    Transformation progress is indicated in the status line (bottom of window).

    Next scan is started.
    Scan progress is indicated by regular update of the bar chart and in the program windows title line.
    You may click the Minimize button to reduce SeedScan's window but still see the scan progress:



    We can see:

    At any time, you may:

    The analysis finalizes as soon as:











    ClonTracer - Run the analyses

    For ClonTracer runs, remember the expected structure:

    Now, after parameters have been checked and correctly adjusted: you may start the analysis.

    Click the ClonTracer Run button.

    The Library files are loaded.

    Next scan is started.
    Scan progress is indicated by regular update of the bar chart and in the program windows title line.
    You may click the Minimize button to reduce SeedScan's window but still see the scan progress:

    At any time, you may:

    The analysis finalizes as soon as:

    For each sample a ClonTracer Barcode list is generated, containing all CT-BarCcodes/abundance for this specific samples.
    Additionally a consensus matrix will be generated.
    By default, all CT-Barcodes which show up in at least one sample with abundace>=100 will be collected into the consensus matrix.

    Use ClonTracer Utilities to:











    Result data

    Finally, as soon as the anaylsis has terminated, a set of result files are automatically generated:
    All result files are generated in the location as specified for the Result file.

    The differerent result file names are composed of the base file name - as defined by the user - and an extension specific for the data type:











    View menu

    From the view menu you may ona any of the disbostic plats:











    Utilties menu

    From the utilities menu you may select extended options or additionaL functionality not diretly related to seed scanning











    Rescue no-BC1 sequences

    Due to the implementation of base calling software on Illumina HTS systems, highly similar costructs may result in a mentionable fraction of miscalled sequences.

    To circumvent this problem, it may be helpful ta add 20%-50% of random non-sense sequences to the sequencing library.
    This should increase the number of correctly called sequences - but on the other hand will reduce the fraction of useable sequences.

    In the special case of our barcoded seeds library constructs, the first ~35 bases are mostly common (~30 different version), followed by ~60000 different seeds.

    Under this condition even a Transrciptome/Exome/Genome/Methylome/ ..... library might act as "non-sense" library.
    Thus you might use the capacity of the "non-sense sequences" (~100 Million reads on a HiSeq4000) to run additonal meaningful samples.

    Check the Rescue no-BC1 sequences option.

    All Sequence pairs where no valid Barcode-1 was found will be written into 2 new FastQ files.
    These files may be used subsequently for alignment to a genome or whatever else.
    The new FastQ files will be created with names:
    Writing this files will slow down the analysis progress and create big files (~200 GByte).
    Thus, acitvate this option only if you really want to reuse the "non-sense" sequences.











    Scan seeds <=> Seed-Lib

    Accuracy of seed identification depends among sequencing quality on uniqueness of the seed sequences.

    To check this, you may scan each individual seed sequence against the complete seed library.
    In an ideal case you would only find unique matches
    Even if 1-error searching is enabled.

    One seed file may be checked in a single analysis.
    The seed file is taken from the Seed file text field. If two (or more) files are specified, the first one is used.
    Check the Allow 1 base exhcange box to enable 1-error matches.

    SeedScan generates a brief summary and two result files:



    The histogram indicates the distribution of sizes from crossmatching clusters.
    As exprected far most of the seeds are singular.
    But there are also 10 cluster where 28 seed's targets share the same recognition sequence.

    Two data files are generated:













    Scan seeds <=> Reference library

    Another question arises:


    Therefore, you may scan all seeds against a reference sequence library and find out (for each individual seed sequence:

    Example:
    You have a libray of gene silencing constructs for human genes.
    Each construct contains a guide sequence (the seed sequence) which shall target a single gene.

    Donwload a list of all genes as multiple FASTA file - e.g. RefseqRNA for homo sapiens from NCBI's ftp server:

    ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/RNA/

    Download and uncompress "rna.fa.gz".

    Process your seed library that each line contains tab delimited:
    E.g.:
    Seed_name	Seed_seq
    A1BG	GTCGCTGAGCTCCGATTCGA
    A1BG	ACCTGTAGTTGCCGGCGTGC
    A1BG	CGTCAGCGTCACATTGGCCA
    A1CF	CGCGCACTGGTCCAGCGCAC
    A1CF	CCAAGCTATATCCTGTGCGC
    A1CF	AAGTTGCTTGATTGCATTCT
    A2M	CGCTTCTTAAATTCTTGGGT
    A2M	TCACAGCGAAGGCGACACAG
    A2M	CAAACTCCTTCATCCAAGTC
    A2ML1	AAATTTCCCCTCCGTTCAGA
    ...	
    

    Select Main menu | Utilities | Seeds => RefDB.

    A Parameter dialog opens up:



    Fill in required values for seed-list / reference-db files and seed length.
    Double click the respective fields to open a file selection dialog
    or more easily just drag files from Windows Explorer into the respective fields.

    To check gene alias names too, supply a file mapping "official" gene symbls to commonly used gene aliases in the Gene => Alias file field.
    Such a file may be downloaded from NCBI's ftp server.
    E.g. "Homo_sapines_gene_info":
    #tax_id	GeneID	Symbol	LocusTag	Synonyms	dbXrefs	chromosome	map_location	description	type_of_gene	Symbol_from_nomenclature_authority	Full_name_from_nomenclature_authority	Nomenclature_status	Other_designations	Modification_date	Feature_type
    9606	1	A1BG	-	A1B|ABG|GAB|HYST2477	MIM:138670|HGNC:HGNC:5|...	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20170903	-
    9606	2	A2M	-	A2MD|CPAMD5|FWP007|S863-7	MIM:103950|HGNC:HGNC:7|...	12	12p13.31	alpha-2-macroglobulin	protein-coding	A2M	alpha-2-macroglobulin	O	alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M	20170903	-
    9606	3	A2MP1	-	A2MP	HGNC:HGNC:8|Ensembl:ENSG00000256069	12	12p13.31	alpha-2-macroglobulin pseudogene 1	pseudo	A2MP1	alpha-2-macroglobulin pseudogene 1	O	pregnancy-zone protein pseudogene	20170402	-
    9606	9	NAT1	-	AAC1|MNAT|NAT-1|NATI	MIM:108345|HGNC:HGNC:7645|...	8	8p22	N-acetyltransferase 1	protein-coding	NAT1	N-acetyltransferase 1	O	arylamine N-acetyltransferase 1|N-acetyltransferase 1 (arylamine N-acetyltransferase)|N-acetyltransferase type 1|arylamide acetylase 1|monomorphic arylamine N-acetyltransferase	20170906	-
    9606	10	NAT2	-	AAC2|NAT-2|PNAT	MIM:612182|HGNC:HGNC:7646|...	8	8p22	N-acetyltransferase 2	protein-coding	NAT2	N-acetyltransferase 2	O	arylamine N-acetyltransferase 2|N-acetyltransferase 2 (arylamine N-acetyltransferase)|N-acetyltransferase type 2|arylamide acetylase 2	20170903	-
    
    The tab-delimited file should contain:

    In the Gene/Alias column divider field define (delimited by comma (","));

    Click OK-button to start the scan.

    Each individual reference sequence is scanned aginst the seeds, accepting

    For each seed the number of matching reference is counted and the names between the seeds and the reference library are compared.

    Several result files are generated:

    " *_Ref_UniqueMatch - the list of genes which match their target unqiuely on a reference sequence which has the same "name" (or an alias) as the seed.

    " *_Ref_MultipleMatches" - a list of seeds which match multiple reference sequences.
    Each line contains (tab delimited) Seed name, Seed sequence, Seed-ID, List of names from matching sequences
    E.g.:
    SeedName	SeedSeq	SeedID	Matched gene symbols
    ANKRD65	CACCCGCGAGTGTCCGTGCC	2137	ankrd65,loc105378585,
    ANKRD65	CCGCCTGGCACGGACACTCG	2138	ankrd65,loc105378585,
    APITD1	GGCAGCAGTTCACTATACTG	2424	apitd1,apitd1-cort,
    APITD1	GAGCTGACTTTCCGACAGTG	2425	apitd1,apitd1-cort,
    APITD1-CORT	GCAGCGATTCTCTTACCAAC	2427	apitd1,apitd1-cort,
    APITD1-CORT	GGCAGCAGTTCACTATACTG	2428	apitd1,apitd1-cort,
    APITD1-CORT	GAGCTGACTTTCCGACAGTG	2429	apitd1,apitd1-cort,
    ATAD3B	GTACAGGCCCCGGTTCTTCT	3413	atad3b,atad3c,
    BAI2	GCGCCGCTTCCGCATGTGCC	4115	adgrb2,bai2,
    BAI2	CGAAGTTCTTGTCGAAGTGC	4116	adgrb2,bai2,
    BAI2	CTACCTGCGCTTCAACCGCC	4117	adgrb2,bai2,
    BMP8B	GGTCATGAGCTTCGTTAACA	4623	bmp8b,loc105378951,
    C1orf151-NBL1	TAGCCCGATTCATGCCATCT	5776	minos1-nbl1,nbl1,
    C1orf151-NBL1	CGGCAAGGAGCCTAGTCACG	5777	minos1-nbl1,nbl1,
    C1orf151-NBL1	TCGCACCAGGCACTCTTATC	5778	minos1-nbl1,nbl1,
    CCDC28B	TGAGGAGCAGTCCGCTGCGT	7747	ccdc28b,tmem234,
    CDK11A	AAGAAAGAGAGCACGAACGT	8724	cdk11a,cdk11b,
    CDK11A	AAGAACAGGATAAAGCTCGC	8725	cdk11a,cdk11b,
    CDK11A	AACCACCCCAGCAAATGTCT	8726	cdk11a,cdk11b,
    CDK11B	TGCCTCCTCAGAGTTGCTGC	8728	cdk11a,cdk11b,
    CDK11B	ACATCACCGAACGATGAGAG	8729	cdk11a,cdk11b,
    CLCNKB	GAAGACCATGTTGGCGGGTG	9843	clcnka,clcnkb,
    CORT	TGTCCCGGCGCGCAGATTGC	10796	apitd1-cort,cort,
    CORT	CCCCCCAGCAATCTGCGCGC	10798	apitd1-cort,cort,
    DCDC2B	TTAATTCGGTCTGGTTCCTT	12487	dcdc2b,tmem234,
    DCDC2B	GGTAACCCAGCCCTCTCCAA	12488	dcdc2b,tmem234,
    EPHA2	CTACAATGTGCGCCGCACCG	16289	epha2,loc101927479,
    GJA9	GAGCCGCTATTTAAGTGCCA	20153	gja9,gja9-mycbp,
    GJA9	CAGAAATGTATGCTACGACC	20154	gja9,gja9-mycbp,
    GJA9	CCAATCGCAGTTAGCTGGCA	20155	gja9,gja9-mycbp,
    GTF2H2D	CACCTGTAATCCCAGCTACT	21553	c1orf86,cep104,cldn19,draxin,fam110d,hp1bp3,loc105376829,nol9,nudc,otud3,pik3cd,sepn1,slc25a34,smim12,tnfrsf9,zbtb8a,zmym1,
    HMGB4	ACTCGACAAAGCCCGATACC	22698	c1orf94,hmgb4,
    HNRNPCL1	CACTCTTGTTGTCAAGAAAT	22811	hnrnpcl1,hnrnpcl2,hnrnpcl3,hnrnpcl4,
    hsa-mir-1254-1	GTGCCACTGTACTCCAGCCT	23466	loc105376896,pqlc2,psmb2,
    hsa-mir-1273a	TCGCCCAGGCTGGAGTGCAG	23585	akr7l,bmp8a,c1orf109,dffa,fbxo44,loc105376835,loc105378653,man1c1,maneal,meaf6,nfyc-as1,slfnl1-as1,snip1,tnfrsf1b,zbtb8a,
    hsa-mir-1273a	GCGCCACTGCACTCCAGCCT	23586	bmp8a,bmp8b,c1orf109,cdc42,cep104,cfap74,dffa,ece1,fbxo44,h6pd,hes2,loc101928303,loc101928391,loc105376858,loc105378666,nfyc,nol9,pafah2,phactr4,rab42,rhbdl2,rnf19b,slc35e2,slfnl1-as1,smim12,smpdl3b,
    hsa-mir-1273d	TCACCCAGGCTGAAGTGCAG	23595	loc105376864,mir1273d,
    hsa-mir-1302-5	CAGGCATGAGCCACTGTGCC	23763	htr1d,nipal3,snip1,
    hsa-mir-200b	TCTTACTGGGCAGCATTGGA	24208	mir200a,mir200b,mir429,
    ...
    

    " *_Ref_WrongMatch" - list of genes with a unique perfect match to a reference sequence, but different sequence names.
    SeedName	SeedSeq	SeedID	Matched gene Symbol
    ABP1	GATCCAGCGCTGGACGTAGT	319	aoc1,
    ACPL2	GAAACCGTATCACCCAAAAC	520	pxylp1,
    ADC	GTACACGATGGTCTTGGAGG	886	azin2,
    ADORA3	TCCGCAAGGCTGACCGCTCC	1018	tmigd3,
    AGPHD1	ATCATGTTTCTGAAAGCCGC	1237	hykk,
    AGXT2L1	CAACGACTTAGCCTTACGCC	1276	etnppl,
    AGXT2L2	GCAGTACATGTACGATGAAC	1279	phykpl,
    ALS2CR8	TTGACTCTTCACCGTCATTA	1729	carf,
    ANAPC15	CCCACAGCTCCAAAGCATCG	1875	lrtomt,
    ANLN	AACTCACTCACCTCCGTAAA	2168	KIAA0895,
    AQPEP	GTTCCAGCTGGGACGCTAAC	2613	lvrn,
    ATP4B	TCTCTCCAGGGGTAACCTTA	3636	znf510,
    ATPBD4	CTCTATCGCCGAACCATAAG	3803	dph6,
    AZI1	TCACCTTGCCATCCAATGCC	3933	cep131,
    B3GNT1	CTGCCGAAAGCGCTCGTCGA	3980	B4GAT1,
    BCMO1	GTCGAACCAATGGTTGTATC	4355	bco1,
    BET3L	TAACTCAGCTAGCCGCCCCG	4437	trappc3l,
    C10orf114	AATCGCTAGCTGCGCATCCG	5036	CASC10,
    C10orf118	GCAGAACATCGAGTACCAAA	5039	ccdc186,
    C10orf129	AAGTTCCCAAGACAGCGCTC	5053	acsm6,
    C10orf137	ACTCTTTATGAGATCGTCTC	5057	edrf1,
    ...
    

    " *_Ref_NoMatch" - list of genes where theres is no perfect match of the seed sequnece to any sequence in the reference library.
    But there may well be incomplete, shortened or error containing matches.











    Random data

    For demonstration or test purposes you may generate a complete random dataset.

    From main meu select: Main menu | Utilities | Random data.

    A parameter dialog opens up:



    Set all parameters accordingly.
    The File name is appended with respective extension for the different filetypes.
    E.g. Barcode file: "Base_Name_Barcodes.txt", ....











    Preview FastQ

    To verify positions of barcodes/seeds within the FastQ sequences, it may be helpfult to preview a FastQ file.

    What sounds trivial may become complicated with Giga-Byte files or GZ-compressed archives.

    Select Main menu Utilities | Preview FastQ.

    Select a plain text FastQ file or a GZ-compressed file (assure GZ-files have file extension (".GZ").

    The first thousand lines from the selected file are loaded and shown in a generic table viewer.

    Instead of FastQs you may preview any kind of huge (GZ-compresed) text files.











    FastQ => FastA

    You might want to anlysze FastQ files with tools which only accept FastA files.

    This utility converts FastQ => FastA Format.

    It simply removes the Direction and Quality line and converts the sequence start sign "@" into the standard ">" sign.

    Select Main menu | Utilities | FastQ=0> FastA.
    Select one - or multiple - FastQ files (either plain text or GZ-copressed [ with extension ".gz"]).

    The new files will be created with extension .fasta.










    Split FastQ

    This options allows to split a FastQ file using the Sample BarCode pairs.
    Define in the main SeedScan tab: Select Main menu | Utilities | Split Fastq

    For each sample barcode pair - as defined in sample barcode and pairs file - individual FastQ files for forward and reverse sequences are created.
    (Originalname_S01.FastQ ... OriginlaName_SXX.FastQ).

    SeedScan analyzes the input files:
    Reading and Writing of multiple files is time consuming.
    To avoid excessive disk access, SeedScan buffers output and writes result file in 10MByte junks.
    Nevertheless, spliting a FastQ file into 30 samples (i.e. 30 barcode pairs, forward and reverse reads, 250bp) runs at ~25 Kilo-Sequences/s on a local spinning disk.
    Running the same task accross a network mounted disk (1-GigaBit network) reduces the processing speed to ~1 Kilo sequences/s.











    ClonTracer utilities

    Here you may find some options to filter / analyse ClonTracer List files.





    Agglomerate error barcodes

    During a ClonTracer run a library of ClonTracer Barcodes is extracted from the analyzed sequences.
    The library will contain:
    Obviously, there will be:
    At an expected sequencing error rate from 1%-5% we could expect 0.3 to even 2 sequencing errors within a 30 BP ClonTracer Barcode.

    Thus we could apply a filtering strategy:

    Filtering a ClonTracer barcode list with 10 million sequences and 1 be/ins/del requires less than a secod.
    Filtering allowing 2 errors (2xbe /2xins / 2xdel / ins+be / del+be / ...) may require minutes.

    Select SeedScan Main menu | Utilities | CloncTracer Utilities | Agglomerate error barcodes.

    In the file selection dialog, select 1 or multiple ClonTracer Barcode lists.

    A parameter dialog opens up:


    Define the parameters:
    Seed abundance Minimum number of counts for the high abundant barcodes
    Error abundance Maximumn number of counts for the potential error barcodes
    Max base exchanges Max Number of allowed base exhcange errors
    Max insertions and BE on inserted Max Number of allowed insertions,
    and optional number of base exchange errors on the "inserted" barcodes
    Max deletions and BE on inserted Max Number of allowed deletions,
    and optional number of base exchange errors on the "deleted" barcodes

    It may be a wise strategy to:

    Filtered CT-Barcode lists are created as new files like "OldFileName_new".
    Results and statistics are shown in SeedScan's log window:

    SeedScan - ClonTracer utilities - Agglomerate error sequences
    ====================================================================
    Ver. 1.001a from 09.07.2018, C.Schwager@DKFZ.DE
    
    Parsing 1 data files ...
    #	Name	#-TC-BC	#-Seq
    1	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt	650509	14116053
    
    Filtering parameters:
    Seed abundance:	1000
    Error abundance:	10
    Max Base Exchanges:	1
    Max Insertions and BE on inserted:	1
    Max Deletions and BE on deleted:	0
    
    Processing file 1:	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt
    	# of Base Exchanges:	1
    	#Seeds:	369	# Merged BC:	12252	# Merged Seq	39144	ET:	0.031s
    	# of Insertions:	1 and # of BE ont the inserted:	0
    	#Seeds:	333	# Merged BC:	2483	# Merged Seq	8528	ET:	0.031s
    	#Total Seeds:	702	# Total Merged BC:	14735	# Total Merged Seq	47672	ET:	0.078s
    	REsultfile:	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt_new
    	from 650509 CTBC,14116053 Seqs to 635774 CTBC,14116053 Seqs; 2.27%
    Done.
    

    Additionally, the Log is saved to a file "FirstListFileName_Agg_Err_Log.txt" in the folder where the list files are found.











    Filter WSWS

    ClonTracer barcodes are synthesized in such a way that hybridized ClonTracer-Target complexes forms an alternating series of
    Consequnently, extracted ClonTacer barcodes MUST reflect this pattern.
    Extracted barcodes not showing the WSWSWSWS pattern are either synthesis or cloning artifacts or most probably sequening errors.

    It may be recommended to remove such barcodes, especially if multiple errors occur.











    Build consensus matrix

    This option allows to interactively build a filtered consensus matrix from all selected ClonTracer barcode lists.

    You may define filters for:

    Barcodes passing both criteria are added to the consensus matrix.
    Samples which do not contain the particular CT-BC will get "0" for the respective data cell,
    all others will get the respective samples/barcodes count number.

    Select SeedScan main menu | Utilities | ClonTracer Utilities | Build consensus matrix.

    Select at lest two ClonTracer barcode list files you want to merge into a consensus matrix.

    A parameter dialog opens up:


    Adjust the parameters:
    Min count The minimum count number found in at least one of the selected sample for the respective barcodes
    # > 0Minimum number of samples with count > 0 for the respective barcode

    As higher these filter parameters, as lower the nubmer of passed barcodes,
    but, also, as higher the probability the filtered are useful for subsequent analysis and not only data noise.

    Click the OK-button.

    SeedScan counts the passing ClonTracer barcodes and shows the result in the Message log:


    Now you may modify the filter parameters to see the results for higer / less stringent filtering parameters.

    As soons as you are satisfied with filter settings, click the Cancel-button.

    Now the matrix is build and saved, applying the last selected filters.











    Base composition models

    Is there a difference in the ClonTracer barcode library between differently handled/treated samples?

    One way to visualize the overall composition could be to visualize position dependant base-composition as a "Sequence logo":

    Sequence logos are directly extracted from ClonTracer barcode lists (containg hundred thousands of barcode sequences).
    But you may also use other file types:
    Sequence logos for muplitple samples are generated and collected in an html file.

    For each barcode list two sequence logos are generated:
    CT-BC logo is computed from all identified unique barcodes

    Abundance count weighted logo is computed by weighting each individual barode by it's abundance.
    Thus highly abundant barcodes may alter the overall base compostion sequence model.


    Select SeedScan main menu | Utilities | ClonTracer Utilities | Base composition.

    Select one or mutiple ClonTracer barcode list you want to analyze.

    The resulting html file contains:

    for both methods:


    And there may be substantial differences between samples and methods.






    Merge ClonTracer matrices

    This tool may be used to merge consensus matrices build
    with SeedScan (see above).
    The problem here is: result matrices from different ClontTracer analyses may contain differint BC-sequneces.
    Thus, matrices can not be appended row wise, but they must be sorted to combine common CT-BCs, or add "spcers" for unique CT_BCs coming from the individual matrices.

    Although designed for ClonTracer result matrices, But you may merge any kind of isomorphic data matrices.


    SeedScan expects tab delimitd text files with:
    Select SeddScan | Main menu | Utilities | ClonTracer | Merge tables|.

    A parameter dialog opens up:


    In a first step, SeedScan extracts all unique keys (ClonTracer barcode sequences) from all defined data files.
    Next, matching data lines from all selected data matrices are placed into the merged matrix.
    Missing records from a single data file are replaced with "0" cells.
    .
    The final result file contains the columns:
    The final result file contains the rows: