SeedScan - A generic tool to evaluate genome wide genetice targeting screens

SeedScan - The Graphical User Interface

When you start SeedScan for the first time, all data fields are empty:

Fill in the required files and search parameters.

All red fields HAVE to be updated/specified for each analysis.

Green fields have to be updated/specified when different libraries - barcodes/pairs/seeds - are used.

White fields may be set once and remain unchanged forever.

Multiple sequence files

Define the FastQ files to be analyzed:

Drag the files from Windows File Explorer into the respective Sequence file field or

Click the "..." buttons. Use the file dialog to find and select the respective file or

Type or copy paste a file specification

FastQ files may be supplied as plain ASCII text files or as G-Zip compressed archives.
When G-Zip compressed archives are used, ensure that files have Extension "*.gz".

Next fill in structure of the constructs:

Base position of bar code start within a sequenced construct
Length of Bar code in bases
In case you have bar codes with different length, Check the Variable barcode length field.
SeedScan will get the individual barcode's length from the barcode list file.
Base position of seed start within the construct.
If required, check Allow 1 base exchange field.
This may encrease the number of identified barcodes, as a single base exchange/insertion/ deletion within a sequence against the respective barcode is tolerated.
But this may generate wrong barcode calls.
Length of the seeds.
If required, check Allow 1 base exchange field.
This may encrease the number of identified seeds, as a single base exchange/insertion/ deletion within a target sequence against the respective seed is tolerated.
But this may generate wrong seed calls.
Set the Shift seed by +/- BP field to a small number (1..5) to tolerate insertions/deletions in front of the target's seed sequence against the expected position.
This may encrease the number of identified seeds, but may generate wrong seed calls, especially if the shift range is defined very wide (e.g. >10bp)
Define the result file, where final seed count matrix is saved.
In case you run multiple instances of SeedScan in parallel, take care to use DIFFERENT result file names for the different instances. Other wise the result files are overwritten.

You may save the result at regular intervals, e.g. to rescue a partial result in case the analyses gets interrupted.
Define the number of sequence pairs after which the result file is created/updated automatically.
E.g. Define "10000000" to save the result after any 10 Mega sequences have been analyzed.
Don't save to often: saving, especially on network mounted drives, may consume a mentionable amount of time.

Click the View button to see the present result matrix.

Bar codes

Supply:

Barcode file: A tab delimited text file containing a list of barcode sequences used for this analysis
Best would be to only specify barcodes used in this analysis.
Additonal barcodes - not used for multipexing in the particular analysis - will not generate problems but waste processing time.
For details about file format => see Data files.
To preview the selected barcode/pairs list click the View button.
Barcode pairs: A tab delimited text file containing a list of bar codes pairs (and optional target library index).
For details about file format => see Data files.

Seed files

Seed file: A tab delimited text file containing a list of seed sequences used for this analysis.
You may drag multiple seed sequence files into this field, one after the other.
SeedScan will scan each barcode-pair's target sequence agianst the respective library as defined in the pairs list.
Best would be to only specify seeds used in ths analysis.

Additonal seed sequences - not used in the particular analysis - might generate erraneous seed calls and waste processing time.
For details about file format => see Data files.
To preview the selected barcode list click the View button.
Seed start: The position of the seed sequence within the seed list files sequences.

Run the analyses

Now, after all parameters have been checked and correctly adjusted, you may start the analysis.

Click the Direct Run button.

The Library files are loaded.
Loading and contstruction of 64-bit sorted hashcode transformed seed libraries will take a while.
Transformation progress is indicated in the status line (bottom of window).

Next scan is started.
Scan progress is indicated by regular update of the bar chart and in the program windows title line.
You may click the Minimize button to reduce SeedScan's window but still see the scan progress:

We can see:

A total of 289.7 Million reads were already scanned.
176.7 Million seeds <=> target have been identified
Average analysis speed is ~ 51.000 analyzed sequences/secod
Total elapsed analysis time of ~94:15 min

At any time, you may:

View any of the diagnostic plots
For details see Diagnostic plots.
Pause the process by clicking the Pause button. E.g. to view an intermediate result matrix
Resume a paused process by clicking the Resume button.
Cancel the ongoing analyses by clicking the Cancel button.
This will stop the analysis and save the partial data generated data until the cancel-timepoint.

The analysis finalizes as soon as:

all data from the FastQ files have been read and scanned.
Canceled by user

ClonTracer - Run the analyses

For ClonTracer runs, remember the expected structure:

Now, after parameters have been checked and correctly adjusted:

Sequnece files, Forward/reverse
Barcode- / Pairs-file
Barcode Start/Length/Variable barcode length
Taq seq start => CT-BC start
Tag lenth => CT-BC length
Index primer => The last few bases of the ClonTracer 3'vector sequence
(i.e. the next few bases following the ClonTracer barcoe.
>
Index primer shift (+/- BP range): search range for the index primer;
compensating for false inserted/deleted bases by sequencing.
Result file

you may start the analysis.

Click the ClonTracer Run button.

The Library files are loaded.

Next scan is started.
Scan progress is indicated by regular update of the bar chart and in the program windows title line.
You may click the Minimize button to reduce SeedScan's window but still see the scan progress:

At any time, you may:

View any of the diagnostic plots
For details see Diagnostic plots.
Pause the process by clicking the Pause button. E.g. to view an intermediate result matrix
Resume a paused process by clicking the Resume button.
Cancel the ongoing analyses by clicking the Cancel button.
This will stop the analysis and save the partial data generated data until the cancel-timepoint.

The analysis finalizes as soon as:

all data from the FastQ files have been read and scanned.
Canceled by user

For each sample a ClonTracer Barcode list is generated, containing all CT-BarCcodes/abundance for this specific samples.

A tab-delimted text file
one CT-Barcode per line
tab-delimited:
- Line count
- Barcode - numeric representation
- Barcode sequence
- Abundance of this particular CT-BArcode

Additionally a consensus matrix will be generated.
By default, all CT-Barcodes which show up in at least one sample with abundace>=100 will be collected into the consensus matrix.

Use ClonTracer Utilities to:

Filter potential error CT-Barcode sequences
Create a custom filtered consensu matrix
Generate Base-Composition logos
Generate histograms
Copy sequences from CT-BC list for e.g. Venn analyses.

Result data

Finally, as soon as the anaylsis has terminated, a set of result files are automatically generated:

Result matrix: A tab delimited table containing count numbers for all samples/seeds specfied in the analysis.
For details about file format => see Data files
. ("basefilename_Matrix.txt")
Summary: A text file summarizing process parameters and global counts (~values shown in the bar chart)
("basefilename_Summary.txt")
Failed-BC1: A text file with the first ~1000 sequence pairs where no valid Barcode-1 was detected.
May be useful for error analysis
("basefilename_Failed_BC1.txt")
Failed-BC2: A text file with the first ~1000 sequence pairs where a valid Barcode-1 was detected, but no listed Barcode-2
May be useful for error analysis
("basefilename_Failed_BC2.txt")
Failed-BCPairs: A text file with the first ~1000 sequence pairs where Barcode-1 as well as Barcode-2 were within the defined list, but pairing is not "allowed".
May be useful for error analysis
("basefilename_Failed_BCPairs.txt")
Failed-IndexPrimer: A text file with the first ~1000 sequence pairs where IndexPRimer was not found.
Short Report collecting process parameters and diagnostic plots.
("basefilename_Report.htm")

All result files are generated in the location as specified for the Result file.

The differerent result file names are composed of the base file name - as defined by the user - and an extension specific for the data type:

View menu

From the view menu you may ona any of the disbostic plats:

Utilties menu

From the utilities menu you may select extended options or additionaL functionality not diretly related to seed scanning

Rescue no-BC1 sequences

Due to the implementation of base calling software on Illumina HTS systems, highly similar costructs may result in a mentionable fraction of miscalled sequences.

To circumvent this problem, it may be helpful ta add 20%-50% of random non-sense sequences to the sequencing library.
This should increase the number of correctly called sequences - but on the other hand will reduce the fraction of useable sequences.

In the special case of our barcoded seeds library constructs, the first ~35 bases are mostly common (~30 different version), followed by ~60000 different seeds.

Under this condition even a Transrciptome/Exome/Genome/Methylome/ ..... library might act as "non-sense" library.
Thus you might use the capacity of the "non-sense sequences" (~100 Million reads on a HiSeq4000) to run additonal meaningful samples.

Check the Rescue no-BC1 sequences option.

All Sequence pairs where no valid Barcode-1 was found will be written into 2 new FastQ files.
These files may be used subsequently for alignment to a genome or whatever else.
The new FastQ files will be created with names:

"Base_Result_Name_Rescued1.fastq" for the forward reads
"Base_Result_Name_Rescued2.fastq" for the reverse reads

Writing this files will slow down the analysis progress and create big files (~200 GByte).
Thus, acitvate this option only if you really want to reuse the "non-sense" sequences.

Scan seeds <=> Seed-Lib

Accuracy of seed identification depends among sequencing quality on uniqueness of the seed sequences.

To check this, you may scan each individual seed sequence against the complete seed library.
In an ideal case you would only find unique matches
Even if 1-error searching is enabled.

One seed file may be checked in a single analysis.
The seed file is taken from the Seed file text field. If two (or more) files are specified, the first one is used.
Check the Allow 1 base exhcange box to enable 1-error matches.

SeedScan generates a brief summary and two result files:

The histogram indicates the distribution of sizes from crossmatching clusters.
As exprected far most of the seeds are singular.
But there are also 10 cluster where 28 seed's targets share the same recognition sequence.

Two data files are generated:

"*_nonredundant_Lib1" - a list of unique seeds.
*_Redundant_Lib1 - a list of clusters, each cluster containg a set of seeds sharing the same recognition sequence:

First three columns show Number, ID and Name of intial seed, building a cluster.
Colmn 4 and 5 show ID and name of the other seeds with identical sequence
Column 6 shows the seeds' sequneces.
Clusters are separated by one empty line
In the example, cluster 120 contains also a 1-error matching seed:
CBWD1... / CBWD6 - AATTATTCTAGGAAGTCGC / ATTTATTCTAGGAAGTCGC

Scan seeds <=> Reference library

Another question arises:

do the seed sequences match uniquely the desired sequence target or
multiple sequencer or
none at all ?

Therefore, you may scan all seeds against a reference sequence library and find out (for each individual seed sequence:

do I get a unique match on the refernece library (e.g. seed sequeuce => to a single RefSeq gene)
is the name of the seed sequence the same as the reference's name (or may be a gene alias name)
do i get multiple matches, what are the reference's gene names

Example:
You have a libray of gene silencing constructs for human genes.
Each construct contains a guide sequence (the seed sequence) which shall target a single gene.

Donwload a list of all genes as multiple FASTA file - e.g. RefseqRNA for homo sapiens from NCBI's ftp server:

ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/RNA/

Download and uncompress "rna.fa.gz".

Process your seed library that each line contains tab delimited:

Gene symbol
seed sequence

E.g.:

Seed_name	Seed_seq
A1BG	GTCGCTGAGCTCCGATTCGA
A1BG	ACCTGTAGTTGCCGGCGTGC
A1BG	CGTCAGCGTCACATTGGCCA
A1CF	CGCGCACTGGTCCAGCGCAC
A1CF	CCAAGCTATATCCTGTGCGC
A1CF	AAGTTGCTTGATTGCATTCT
A2M	CGCTTCTTAAATTCTTGGGT
A2M	TCACAGCGAAGGCGACACAG
A2M	CAAACTCCTTCATCCAAGTC
A2ML1	AAATTTCCCCTCCGTTCAGA
...

Select Main menu | Utilities | Seeds => RefDB.

A Parameter dialog opens up:

Fill in required values for seed-list / reference-db files and seed length.
Double click the respective fields to open a file selection dialog
or more easily just drag files from Windows Explorer into the respective fields.

To check gene alias names too, supply a file mapping "official" gene symbls to commonly used gene aliases in the Gene => Alias file field.
Such a file may be downloaded from NCBI's ftp server.
E.g. "Homo_sapines_gene_info":

#tax_id	GeneID	Symbol	LocusTag	Synonyms	dbXrefs	chromosome	map_location	description	type_of_gene	Symbol_from_nomenclature_authority	Full_name_from_nomenclature_authority	Nomenclature_status	Other_designations	Modification_date	Feature_type
9606	1	A1BG	-	A1B|ABG|GAB|HYST2477	MIM:138670|HGNC:HGNC:5|...	19	19q13.43	alpha-1-B glycoprotein	protein-coding	A1BG	alpha-1-B glycoprotein	O	alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA	20170903	-
9606	2	A2M	-	A2MD|CPAMD5|FWP007|S863-7	MIM:103950|HGNC:HGNC:7|...	12	12p13.31	alpha-2-macroglobulin	protein-coding	A2M	alpha-2-macroglobulin	O	alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M	20170903	-
9606	3	A2MP1	-	A2MP	HGNC:HGNC:8|Ensembl:ENSG00000256069	12	12p13.31	alpha-2-macroglobulin pseudogene 1	pseudo	A2MP1	alpha-2-macroglobulin pseudogene 1	O	pregnancy-zone protein pseudogene	20170402	-
9606	9	NAT1	-	AAC1|MNAT|NAT-1|NATI	MIM:108345|HGNC:HGNC:7645|...	8	8p22	N-acetyltransferase 1	protein-coding	NAT1	N-acetyltransferase 1	O	arylamine N-acetyltransferase 1|N-acetyltransferase 1 (arylamine N-acetyltransferase)|N-acetyltransferase type 1|arylamide acetylase 1|monomorphic arylamine N-acetyltransferase	20170906	-
9606	10	NAT2	-	AAC2|NAT-2|PNAT	MIM:612182|HGNC:HGNC:7646|...	8	8p22	N-acetyltransferase 2	protein-coding	NAT2	N-acetyltransferase 2	O	arylamine N-acetyltransferase 2|N-acetyltransferase 2 (arylamine N-acetyltransferase)|N-acetyltransferase type 2|arylamide acetylase 2	20170903	-

The tab-delimited file should contain:

one gene per line
one column containing the gene symbol
one column containing muliple aliases/synomymes divided by a special divide character

In the Gene/Alias column divider field define (delimited by comma (","));

ID of column containing the gene Symbol.
In the example column 3
ID of column contaning the aliases/synonyms.
In the example column 5
Divider
In the example pipe ("|")

Click OK-button to start the scan.

Each individual reference sequence is scanned aginst the seeds, accepting

perfect matches
forward as well as reverse (reverse/complement) sequence

For each seed the number of matching reference is counted and the names between the seeds and the reference library are compared.

Several result files are generated:

" *_Ref_UniqueMatch - the list of genes which match their target unqiuely on a reference sequence which has the same "name" (or an alias) as the seed.

" *_Ref_MultipleMatches" - a list of seeds which match multiple reference sequences.
Each line contains (tab delimited) Seed name, Seed sequence, Seed-ID, List of names from matching sequences
E.g.:

SeedName	SeedSeq	SeedID	Matched gene symbols
ANKRD65	CACCCGCGAGTGTCCGTGCC	2137	ankrd65,loc105378585,
ANKRD65	CCGCCTGGCACGGACACTCG	2138	ankrd65,loc105378585,
APITD1	GGCAGCAGTTCACTATACTG	2424	apitd1,apitd1-cort,
APITD1	GAGCTGACTTTCCGACAGTG	2425	apitd1,apitd1-cort,
APITD1-CORT	GCAGCGATTCTCTTACCAAC	2427	apitd1,apitd1-cort,
APITD1-CORT	GGCAGCAGTTCACTATACTG	2428	apitd1,apitd1-cort,
APITD1-CORT	GAGCTGACTTTCCGACAGTG	2429	apitd1,apitd1-cort,
ATAD3B	GTACAGGCCCCGGTTCTTCT	3413	atad3b,atad3c,
BAI2	GCGCCGCTTCCGCATGTGCC	4115	adgrb2,bai2,
BAI2	CGAAGTTCTTGTCGAAGTGC	4116	adgrb2,bai2,
BAI2	CTACCTGCGCTTCAACCGCC	4117	adgrb2,bai2,
BMP8B	GGTCATGAGCTTCGTTAACA	4623	bmp8b,loc105378951,
C1orf151-NBL1	TAGCCCGATTCATGCCATCT	5776	minos1-nbl1,nbl1,
C1orf151-NBL1	CGGCAAGGAGCCTAGTCACG	5777	minos1-nbl1,nbl1,
C1orf151-NBL1	TCGCACCAGGCACTCTTATC	5778	minos1-nbl1,nbl1,
CCDC28B	TGAGGAGCAGTCCGCTGCGT	7747	ccdc28b,tmem234,
CDK11A	AAGAAAGAGAGCACGAACGT	8724	cdk11a,cdk11b,
CDK11A	AAGAACAGGATAAAGCTCGC	8725	cdk11a,cdk11b,
CDK11A	AACCACCCCAGCAAATGTCT	8726	cdk11a,cdk11b,
CDK11B	TGCCTCCTCAGAGTTGCTGC	8728	cdk11a,cdk11b,
CDK11B	ACATCACCGAACGATGAGAG	8729	cdk11a,cdk11b,
CLCNKB	GAAGACCATGTTGGCGGGTG	9843	clcnka,clcnkb,
CORT	TGTCCCGGCGCGCAGATTGC	10796	apitd1-cort,cort,
CORT	CCCCCCAGCAATCTGCGCGC	10798	apitd1-cort,cort,
DCDC2B	TTAATTCGGTCTGGTTCCTT	12487	dcdc2b,tmem234,
DCDC2B	GGTAACCCAGCCCTCTCCAA	12488	dcdc2b,tmem234,
EPHA2	CTACAATGTGCGCCGCACCG	16289	epha2,loc101927479,
GJA9	GAGCCGCTATTTAAGTGCCA	20153	gja9,gja9-mycbp,
GJA9	CAGAAATGTATGCTACGACC	20154	gja9,gja9-mycbp,
GJA9	CCAATCGCAGTTAGCTGGCA	20155	gja9,gja9-mycbp,
GTF2H2D	CACCTGTAATCCCAGCTACT	21553	c1orf86,cep104,cldn19,draxin,fam110d,hp1bp3,loc105376829,nol9,nudc,otud3,pik3cd,sepn1,slc25a34,smim12,tnfrsf9,zbtb8a,zmym1,
HMGB4	ACTCGACAAAGCCCGATACC	22698	c1orf94,hmgb4,
HNRNPCL1	CACTCTTGTTGTCAAGAAAT	22811	hnrnpcl1,hnrnpcl2,hnrnpcl3,hnrnpcl4,
hsa-mir-1254-1	GTGCCACTGTACTCCAGCCT	23466	loc105376896,pqlc2,psmb2,
hsa-mir-1273a	TCGCCCAGGCTGGAGTGCAG	23585	akr7l,bmp8a,c1orf109,dffa,fbxo44,loc105376835,loc105378653,man1c1,maneal,meaf6,nfyc-as1,slfnl1-as1,snip1,tnfrsf1b,zbtb8a,
hsa-mir-1273a	GCGCCACTGCACTCCAGCCT	23586	bmp8a,bmp8b,c1orf109,cdc42,cep104,cfap74,dffa,ece1,fbxo44,h6pd,hes2,loc101928303,loc101928391,loc105376858,loc105378666,nfyc,nol9,pafah2,phactr4,rab42,rhbdl2,rnf19b,slc35e2,slfnl1-as1,smim12,smpdl3b,
hsa-mir-1273d	TCACCCAGGCTGAAGTGCAG	23595	loc105376864,mir1273d,
hsa-mir-1302-5	CAGGCATGAGCCACTGTGCC	23763	htr1d,nipal3,snip1,
hsa-mir-200b	TCTTACTGGGCAGCATTGGA	24208	mir200a,mir200b,mir429,
...

" *_Ref_WrongMatch" - list of genes with a unique perfect match to a reference sequence, but different sequence names.

SeedName	SeedSeq	SeedID	Matched gene Symbol
ABP1	GATCCAGCGCTGGACGTAGT	319	aoc1,
ACPL2	GAAACCGTATCACCCAAAAC	520	pxylp1,
ADC	GTACACGATGGTCTTGGAGG	886	azin2,
ADORA3	TCCGCAAGGCTGACCGCTCC	1018	tmigd3,
AGPHD1	ATCATGTTTCTGAAAGCCGC	1237	hykk,
AGXT2L1	CAACGACTTAGCCTTACGCC	1276	etnppl,
AGXT2L2	GCAGTACATGTACGATGAAC	1279	phykpl,
ALS2CR8	TTGACTCTTCACCGTCATTA	1729	carf,
ANAPC15	CCCACAGCTCCAAAGCATCG	1875	lrtomt,
ANLN	AACTCACTCACCTCCGTAAA	2168	KIAA0895,
AQPEP	GTTCCAGCTGGGACGCTAAC	2613	lvrn,
ATP4B	TCTCTCCAGGGGTAACCTTA	3636	znf510,
ATPBD4	CTCTATCGCCGAACCATAAG	3803	dph6,
AZI1	TCACCTTGCCATCCAATGCC	3933	cep131,
B3GNT1	CTGCCGAAAGCGCTCGTCGA	3980	B4GAT1,
BCMO1	GTCGAACCAATGGTTGTATC	4355	bco1,
BET3L	TAACTCAGCTAGCCGCCCCG	4437	trappc3l,
C10orf114	AATCGCTAGCTGCGCATCCG	5036	CASC10,
C10orf118	GCAGAACATCGAGTACCAAA	5039	ccdc186,
C10orf129	AAGTTCCCAAGACAGCGCTC	5053	acsm6,
C10orf137	ACTCTTTATGAGATCGTCTC	5057	edrf1,
...

" *_Ref_NoMatch" - list of genes where theres is no perfect match of the seed sequnece to any sequence in the reference library.
But there may well be incomplete, shortened or error containing matches.

Sequence file (forward read)
Sequence file (reverse read) - optional
Smaple Barcode file
Sample Barcode pairs - optional)
Sample Barcode start
Sample Barcode length
Variable Sample Barcode length
Allow 1 base exchange (in Sample Barcode)

Select Main menu | Utilities | Split Fastq

For each sample barcode pair - as defined in sample barcode and pairs file - individual FastQ files for forward and reverse sequences are created.
(Originalname_S01.FastQ ... OriginlaName_SXX.FastQ).

SeedScan analyzes the input files:

Sequences without BC1/BC2 or invalid pairs are skipped
Matching sequences (BC1 AND BC2 AND valid pairs) are written to the respective new FastQ file

Reading and Writing of multiple files is time consuming.
To avoid excessive disk access, SeedScan buffers output and writes result file in 10MByte junks.
Nevertheless, spliting a FastQ file into 30 samples (i.e. 30 barcode pairs, forward and reverse reads, 250bp) runs at ~25 Kilo-Sequences/s on a local spinning disk.
Running the same task accross a network mounted disk (1-GigaBit network) reduces the processing speed to ~1 Kilo sequences/s.

ClonTracer utilities

Here you may find some options to filter / analyse ClonTracer List files.

Agglomerate error barcodes

During a ClonTracer run a library of ClonTracer Barcodes is extracted from the analyzed sequences.
The library will contain:

all unique ClonTracer Barcodes
variants from ClonTracer barcodes created by sequencing errors:
- Base exchanges
- Insertions
- Deletions

Obviously, there will be:

a high probability for error sequences from highly abundant Barcodes
a high probability that very low abundant Barcodes are in reality error variants from the highly abundant ones.

At an expected sequencing error rate from 1%-5% we could expect 0.3 to even 2 sequencing errors within a 30 BP ClonTracer Barcode.

Thus we could apply a filtering strategy:

Use all high abundant Barcodes, e.g. those with count ≥1000.
Compare all low abundant Barcodes, e.g. those with count ≤10.
If such a low abundant barcode is a variant of a high abundant Barcode,
just remove it from the list and add its counts to the one of the mother Barcode.
Use base exchange variants (1 or 2)
Use Insertion / Deletion variants (1 or 2 bases),
eventually even with base exchange variants on the Insertion/Deletion vartiants.

Filtering a ClonTracer barcode list with 10 million sequences and 1 be/ins/del requires less than a secod.
Filtering allowing 2 errors (2xbe /2xins / 2xdel / ins+be / del+be / ...) may require minutes.

Select SeedScan Main menu | Utilities | CloncTracer Utilities | Agglomerate error barcodes.

In the file selection dialog, select 1 or multiple ClonTracer Barcode lists.

A parameter dialog opens up:

Define the parameters:

Seed abundance	Minimum number of counts for the high abundant barcodes
Error abundance	Maximumn number of counts for the potential error barcodes
Max base exchanges	Max Number of allowed base exhcange errors
Max insertions and BE on inserted	Max Number of allowed insertions, and optional number of base exchange errors on the "inserted" barcodes
Max deletions and BE on inserted	Max Number of allowed deletions, and optional number of base exchange errors on the "deleted" barcodes

It may be a wise strategy to:

First round, capture the simple errors (1 exchange/insertion/deletion),
at high seed and low error abundance
in next rounds
- lower seed abundance, increase error abundance, or
- increase number of allowed errors (base exchange/insertion/deletion), or
- allow additional base exchanges on insertions/deletions.
E.g. define Max Insertions = "1,1" to generate all inserton variants (~120 for a 30 BP barcode),
and on each single insertion variant, all 1 base exchange variants (~120*120 = 14400 variants).

Filtered CT-Barcode lists are created as new files like "OldFileName_new".
Results and statistics are shown in SeedScan's log window:

SeedScan - ClonTracer utilities - Agglomerate error sequences
====================================================================
Ver. 1.001a from 09.07.2018, C.Schwager@DKFZ.DE

Parsing 1 data files ...
#	Name	#-TC-BC	#-Seq
1	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt	650509	14116053

Filtering parameters:
Seed abundance:	1000
Error abundance:	10
Max Base Exchanges:	1
Max Insertions and BE on inserted:	1
Max Deletions and BE on deleted:	0

Processing file 1:	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt
	# of Base Exchanges:	1
	#Seeds:	369	# Merged BC:	12252	# Merged Seq	39144	ET:	0.031s
	# of Insertions:	1 and # of BE ont the inserted:	0
	#Seeds:	333	# Merged BC:	2483	# Merged Seq	8528	ET:	0.031s
	#Total Seeds:	702	# Total Merged BC:	14735	# Total Merged Seq	47672	ET:	0.078s
	REsultfile:	U:\User\Project AN4-ClonTrac\HiSeq4000_11120\180607_ST-K00207_0156_BHVLVMBBXX\AS-243465-LR-35510\fastq\Result_CTCounts_1.txt_new
	from 650509 CTBC,14116053 Seqs to 635774 CTBC,14116053 Seqs; 2.27%
Done.

Additionally, the Log is saved to a file "FirstListFileName_Agg_Err_Log.txt" in the folder where the list files are found.

Filter WSWS

ClonTracer barcodes are synthesized in such a way that hybridized ClonTracer-Target complexes forms an alternating series of

basepair interconnected by three hydrogen bonds = Strong = S (C=G)
basepair interconnected by two hydrogen bonds = Weak = W (A=T)

Consequnently, extracted ClonTacer barcodes MUST reflect this pattern.
Extracted barcodes not showing the WSWSWSWS pattern are either synthesis or cloning artifacts or most probably sequening errors.

It may be recommended to remove such barcodes, especially if multiple errors occur.

Build consensus matrix

This option allows to interactively build a filtered consensus matrix from all selected ClonTracer barcode lists.

You may define filters for:

Minimum number of barcode counts found at lest on sample for the respective barcode
Minimum number of samples with count > 0 for the respective barcode

Barcodes passing both criteria are added to the consensus matrix.
Samples which do not contain the particular CT-BC will get "0" for the respective data cell,
all others will get the respective samples/barcodes count number.

Select SeedScan main menu | Utilities | ClonTracer Utilities | Build consensus matrix.

Select at lest two ClonTracer barcode list files you want to merge into a consensus matrix.

A parameter dialog opens up:

Adjust the parameters:

Min count	The minimum count number found in at least one of the selected sample for the respective barcodes
# > 0	Minimum number of samples with count > 0 for the respective barcode

As higher these filter parameters, as lower the nubmer of passed barcodes,
but, also, as higher the probability the filtered are useful for subsequent analysis and not only data noise.

Click the OK-button.

SeedScan counts the passing ClonTracer barcodes and shows the result in the Message log:

Now you may modify the filter parameters to see the results for higer / less stringent filtering parameters.

As soons as you are satisfied with filter settings, click the Cancel-button.

Now the matrix is build and saved, applying the last selected filters.

Base composition models

Is there a difference in the ClonTracer barcode library between differently handled/treated samples?

One way to visualize the overall composition could be to visualize position dependant base-composition as a "Sequence logo":

for each base position a stack of 4 base characters is shown
the total hight of a stack indicates the conservation of a base.
If only one base shows up at a position => the stack is highest,
If all 4 bases show up at same proportion => stack is lowest.
Width of a stack indicates proportion of bases vs. ambiguites.
If all sequences contain a base (A,C,G,T) at the respective position => stack is widest.
As more ambiguities (or gaps, missing bases) as smaller the stack
Within a stack, height of a base indicates its relative proportion at this sequence position.
Bases are sorted by their relative frequency: most frequent top,... least frequent bottom.

Sequence logos are directly extracted from ClonTracer barcode lists (containg hundred thousands of barcode sequences).
But you may also use other file types:

Tab delimited text files:
- first (header) line ignored
- one sequnece per line
- each line tab delimited
- i-th tab contains a (DNA) sequence
e.g. the seed files (gecko library)
Multiple FASTA files

Sequence logos for muplitple samples are generated and collected in an html file.

For each barcode list two sequence logos are generated:

CT-BC logo	is computed from all identified unique barcodes
Abundance count weighted logo	is computed by weighting each individual barode by it's abundance. Thus highly abundant barcodes may alter the overall base compostion sequence model.

Select SeedScan main menu | Utilities | ClonTracer Utilities | Base composition.

Select one or mutiple ClonTracer barcode list you want to analyze.

The resulting html file contains:

Relativ base composition tables
Sequence logos

for both methods:

And there may be substantial differences between samples and methods.

Merge ClonTracer matrices

This tool may be used to merge consensus matrices build with SeedScan (see above).
The problem here is: result matrices from different ClontTracer analyses may contain differint BC-sequneces.
Thus, matrices can not be appended row wise, but they must be sorted to combine common CT-BCs, or add "spcers" for unique CT_BCs coming from the individual matrices.

Although designed for ClonTracer result matrices, But you may merge any kind of isomorphic data matrices.

SeedScan expects tab delimitd text files with:

Same number of header rows
Same number of colums rows
Identical key column (here ClonTracer bar-code sequence column), with unique keys
First header row, contains names for data columns
Data cells should contain numbers
Non numeric cells are converted to "0" for computation of row/colun sums

Drag (multiple) data files from Windows explorer into the Data files field.
Repeated Drag/Drop operatons will just append the new files to the list of selected.
Click the "X" button to empty the Data files field.
Click the "..." button to open a file selection dialog.
Define the key column containing the unique keys used for merging identical feature rows from the multple data files.
The key column MUST be identical for all data files to merge.
Number of feature (row) annotation columns.
These column will not be include into the merged table.
Header row number: These rows will not be included into the merged table.
Apart from 1st row, which is used as column names in the merged matrix.

In a first step, SeedScan extracts all unique keys (ClonTracer barcode sequences) from all defined data files.
Next, matching data lines from all selected data matrices are placed into the merged matrix.
Missing records from a single data file are replaced with "0" cells.
.
The final result file contains the columns:

Line counter
ClonTracer Barcode sequence
Row sum
all count columns from the selected files

The final result file contains the rows:

Names of the data columns found in the individual data files (1. row)
Column sums
All count columns from the selected files