TiCo input and output formats
For the TiCo prediction two types of input data are necessary. First a file with gene data and second a file with the corresponding nucleotide sequence is needed. The current version of the TiCo webinterface provides the following input and output formats.
TiCo output formatTiCo input format
GLIMMER predictions
For the moment only GLIMMER2.x format is provided for the postprocessing of gene predictions. The file may contain the whole GLIMMER output, but needed is only the section Putative Genes. TiCo searches the file for the line “ Putative Genes: ” and reads all orfs from there on.
Note that a sufficient amount of input data is needed (at least 200 genes) to perform a reasonable clustering.
<id (should be a number)> <start> <stop> [<strand, ...>]
Putative Genes:
2 337 2796 [+1 L=2460]
3 2801 3730 [+2 L= 930]
5 3734 5017 [+2 L=1284]
6 5088 5234 [+3 L= 147] [Vote]
8 5720 5313 [-3 L= 408] [DelayedBy #10 L=21]
10 6459 5686 [-1 L= 774]
12 7959 6532 [-1 L=1428]
14 8175 9188 [+3 L=1014]
15 9303 9890 [+3 L= 588]
17 10494 9931 [-1 L= 564]
19 11356 10646 [-2 L= 711]
...
Simple Coord format
The Simple Coord format just gives the coordinates of the predicted ORFs with an id and the orientation. The input may also contain a score and a label as in the Simple Coord output of TiCo.
>id_left_right_strand[_score][#]
Example:
>2_337_2799_+ >3_2801_3733_+ >5_3734_5020_+ >6_5088_5237_+ >8_5310_5720_- >10_5683_6459_- >12_6529_7959_- >14_8175_9191_+ >15_9303_9893_+ >17_9928_10494_- >19_10643_11356_-
Sequence in Fasta format
The genome sequence should be submitted in Fasta format as shown below. The first line may contain details for identification of the organism, but may also be omitted. As symbols for nucleotides both upper- and lowercase characters are accepted. To be processed the sequence may only contain valid nucleotide symbols according to the IUPAC-standard (table of valid symbols). For the training only the symbols A, C, G and T (upper and lower case) are considered. All other symbols will be ignored.
>gi|6626251|gb|U00096.1|U00096 Escherichia coli K-12 MG1655 complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT
TiCo output format
The output of Tico gives the coordinates of the putative genes with the start positions predicted by TiCo. Additionally the shift of the relocated TIS in proportion to the initially predicted TIS is given. As an inidicator how good the predicted TIS fits the model TiCo calculated, also the PWM-Score is denoted for each predicted TIS. The scores are not comparable for different predictions.
A TIS given with a negative score, means for the respective ORF no TIS candidate with a higher (positive) score has been found. We evaluated that about 65-76% of the ORFs whose predicted TIS has a negative score are potential false positives (i. e. are not annotated). Therefor the ORFs for which only candidates with negative scores are found, are labelled in the output (in the GLIMMER-like output with a #, in the GFF output with the tag weak tis).
GLIMMER-like output
The output is denoted in a GLIMMER-like format. That means, it contains all predictions from the input file in the same format like GLIMMER with two additional columns from the TiCo prediction. In the first column after the GLIMMER output the PWM score is given, in the second the shift of the start during reannotation. Additionally, genes with a negative score are labeled with a hash mark (#) at the end of line.
<id> <start> <stop> [comments] <PWM score> <shift>
The shift is given in respect of the strand of the gene. A positive value means the reannotated start is located upstream of the original start, a negative value indicates a downstream shift. If the value of the shift is 0, the start is not changed from the original prediction.
Putative genes:
2 337 2796 [+1 L=2460] 5.347931 0
3 2801 3730 [+2 L= 930] 11.448764 0
5 3734 5017 [+2 L=1284] 6.230648 0
6 5088 5234 [+3 L= 147] [Vote] 3.815619 0
8 5741 5313 [-3 L= 408] [DelayedBy #10 L=21] -0.111382 -21 #
10 6459 5686 [-1 L= 774] 0.234908 0
12 7959 6532 [-1 L=1428] 19.753130 0
14 8238 9188 [+3 L=1014] 19.169035 63
15 9306 9890 [+3 L= 588] 19.613488 3
17 10494 9931 [-1 L= 564] 4.670315 0
19 11356 10646 [-2 L= 711] 13.285624 0
Output in GFF
The output in GFF (general feature format) is denoted according to the specifications of the sanger institute:
GFF specification
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
The output contains the original predictions and the reannotations done by TiCo. The original annotation is marked with the feature CDS. To mark those genes, whose start position was relocated by TiCo, we added the new feature REANNCDS. If you would like to visualize the entrys with the programm artemis, add the following line to the artemis configuration file ($ARTEMIS_HOME/etc/options):
colour_of_REANNCDS = 1
Every gene is denoted with the PWM-Score, which is calculated by TiCo. Additionally for reannotated TIS the
number of basepairs referring to the initally predicted TIS is given in double quotes with the
keyword shift (e.g. "shift 63"). If a negative PWM-score for the calculated for a TIS this is
denoted with the qualifier \note (e.g. \note="weak tis").
In exceptional cases the relocation is skipped, for the new gene length would fall below the minimun gene length,
which is set by the user. In this case the comment "reannotation skipped" with the shift proposed by
TiCo in brackets is added (e.g. "reannotation skipped (141)").
An example GFF-output is shown below.
##gff-version 2 ##Type DNA Escherichia_coli_K-12_complete_genome glimmer/tico CDS 337 2799 5.347931 + Escherichia_coli_K-12_complete_genome glimmer/tico CDS 2801 3733 11.448764 + Escherichia_coli_K-12_complete_genome glimmer/tico CDS 3734 5020 6.230648 + Escherichia_coli_K-12_complete_genome glimmer/tico CDS 5088 5237 3.815619 + Escherichia_coli_K-12_complete_genome glimmer/tico REANNCDS 5310 5741 -0.111382 - shift -21 ; note "weak tis" Escherichia_coli_K-12_complete_genome glimmer/tico CDS 5310 5720 -0.111382 - note "weak tis" Escherichia_coli_K-12_complete_genome glimmer/tico CDS 5683 6459 0.234908 - Escherichia_coli_K-12_complete_genome glimmer/tico CDS 6529 7959 19.75313 - Escherichia_coli_K-12_complete_genome glimmer/tico REANNCDS 8238 9191 19.169035 + shift 63 ; Escherichia_coli_K-12_complete_genome glimmer/tico CDS 8175 9191 19.169035 + Escherichia_coli_K-12_complete_genome glimmer/tico REANNCDS 9306 9893 19.613488 + shift 3 ; Escherichia_coli_K-12_complete_genome glimmer/tico CDS 9303 9893 19.613488 + Escherichia_coli_K-12_complete_genome glimmer/tico CDS 9928 10494 4.670315 - Escherichia_coli_K-12_complete_genome glimmer/tico CDS 10643 11356 13.285624 -
Output in Simple Coord format
The output in Simple Coord format provides the coordinates of the ORFs with the TIS as predicted by TiCo. Additionally the PWM score is given. For predicted TIS with a negative score, the label "#" is added.
>2_337_2799_+_5.347931 >3_2801_3733_+_11.448764 >5_3734_5020_+_6.230648 >6_5088_5237_+_3.815619 >8_5310_5741_-_-0.111382# >10_5683_6459_-_0.234908 >12_6529_7959_-_19.753130 >14_8238_9191_+_19.169035 >15_9306_9893_+_19.613488 >17_9928_10494_-_4.670315 >19_10643_11356_-_13.285624
Visualization of the PWM
The PWM (Positional Weights Matrix) calculated by TiCo can be visualized with the program WeightsVis.
View an example.