TiCo [Input and output formats]

TiCo input and output formats

For the TiCo prediction two types of input data are necessary. First a file with gene data and second a file with the corresponding nucleotide sequence is needed. The current version of the TiCo webinterface provides the following input and output formats.

TiCo input format

TiCo output format


TiCo input format

GLIMMER predictions

For the moment only GLIMMER2.x format is provided for the postprocessing of gene predictions. The file may contain the whole GLIMMER output, but needed is only the section Putative Genes. TiCo searches the file for the line “ Putative Genes: ” and reads all orfs from there on.

Note that a sufficient amount of input data is needed (at least 200 genes) to perform a reasonable clustering.

<id (should be a number)> <start> <stop> [<strand, ...>]
Putative Genes:
    2      337     2796  [+1 L=2460]
    3     2801     3730  [+2 L= 930]
    5     3734     5017  [+2 L=1284]
    6     5088     5234  [+3 L= 147]  [Vote]
    8     5720     5313  [-3 L= 408]  [DelayedBy #10 L=21]
   10     6459     5686  [-1 L= 774]
   12     7959     6532  [-1 L=1428]
   14     8175     9188  [+3 L=1014]
   15     9303     9890  [+3 L= 588]
   17    10494     9931  [-1 L= 564]
   19    11356    10646  [-2 L= 711]
                ...

Simple Coord format

The Simple Coord format just gives the coordinates of the predicted ORFs with an id and the orientation. The input may also contain a score and a label as in the Simple Coord output of TiCo.

>id_left_right_strand[_score][#]

Example:

>2_337_2799_+
>3_2801_3733_+
>5_3734_5020_+
>6_5088_5237_+
>8_5310_5720_-
>10_5683_6459_-
>12_6529_7959_-
>14_8175_9191_+
>15_9303_9893_+
>17_9928_10494_-
>19_10643_11356_-

Sequence in Fasta format

The genome sequence should be submitted in Fasta format as shown below. The first line may contain details for identification of the organism, but may also be omitted. As symbols for nucleotides both upper- and lowercase characters are accepted. To be processed the sequence may only contain valid nucleotide symbols according to the IUPAC-standard (table of valid symbols). For the training only the symbols A, C, G and T (upper and lower case) are considered. All other symbols will be ignored.

>gi|6626251|gb|U00096.1|U00096 Escherichia coli K-12 MG1655 complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT

top of this site


TiCo output format

The output of Tico gives the coordinates of the putative genes with the start positions predicted by TiCo. Additionally the shift of the relocated TIS in proportion to the initially predicted TIS is given. As an inidicator how good the predicted TIS fits the model TiCo calculated, also the PWM-Score is denoted for each predicted TIS. The scores are not comparable for different predictions.

A TIS given with a negative score, means for the respective ORF no TIS candidate with a higher (positive) score has been found. We evaluated that about 65-76% of the ORFs whose predicted TIS has a negative score are potential false positives (i. e. are not annotated). Therefor the ORFs for which only candidates with negative scores are found, are labelled in the output (in the GLIMMER-like output with a #, in the GFF output with the tag weak tis).

GLIMMER-like output

The output is denoted in a GLIMMER-like format. That means, it contains all predictions from the input file in the same format like GLIMMER with two additional columns from the TiCo prediction. In the first column after the GLIMMER output the PWM score is given, in the second the shift of the start during reannotation. Additionally, genes with a negative score are labeled with a hash mark (#) at the end of line.

<id> <start> <stop> [comments] <PWM score> <shift>

The shift is given in respect of the strand of the gene. A positive value means the reannotated start is located upstream of the original start, a negative value indicates a downstream shift. If the value of the shift is 0, the start is not changed from the original prediction.

Putative genes:
    2      337     2796  [+1 L=2460]  5.347931 0
    3     2801     3730  [+2 L= 930]  11.448764 0
    5     3734     5017  [+2 L=1284]  6.230648 0
    6     5088     5234  [+3 L= 147] [Vote]  3.815619 0
    8     5741     5313  [-3 L= 408] [DelayedBy #10 L=21]  -0.111382 -21 #
   10     6459     5686  [-1 L= 774]  0.234908 0
   12     7959     6532  [-1 L=1428]  19.753130 0
   14     8238     9188  [+3 L=1014]  19.169035 63
   15     9306     9890  [+3 L= 588]  19.613488 3
   17    10494     9931  [-1 L= 564]  4.670315 0
   19    11356    10646  [-2 L= 711]  13.285624 0

Output in GFF

The output in GFF (general feature format) is denoted according to the specifications of the sanger institute:
GFF specification

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

The output contains the original predictions and the reannotations done by TiCo. The original annotation is marked with the feature CDS. To mark those genes, whose start position was relocated by TiCo, we added the new feature REANNCDS. If you would like to visualize the entrys with the programm artemis, add the following line to the artemis configuration file ($ARTEMIS_HOME/etc/options):

colour_of_REANNCDS = 1

Every gene is denoted with the PWM-Score, which is calculated by TiCo. Additionally for reannotated TIS the number of basepairs referring to the initally predicted TIS is given in double quotes with the keyword shift (e.g. "shift 63"). If a negative PWM-score for the calculated for a TIS this is denoted with the qualifier \note (e.g. \note="weak tis").
In exceptional cases the relocation is skipped, for the new gene length would fall below the minimun gene length, which is set by the user. In this case the comment "reannotation skipped" with the shift proposed by TiCo in brackets is added (e.g. "reannotation skipped (141)").

An example GFF-output is shown below.

##gff-version 2
##Type DNA
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      337    2799    5.347931    +
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      2801   3733    11.448764   +
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      3734   5020    6.230648    +
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      5088   5237    3.815619    +
Escherichia_coli_K-12_complete_genome   glimmer/tico  REANNCDS 5310   5741    -0.111382   -     shift -21 ; note "weak tis"
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      5310   5720    -0.111382   -    note "weak tis"
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      5683   6459    0.234908    -
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      6529   7959    19.75313    -
Escherichia_coli_K-12_complete_genome   glimmer/tico  REANNCDS 8238   9191    19.169035   +    shift 63 ;
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      8175   9191    19.169035   +
Escherichia_coli_K-12_complete_genome   glimmer/tico  REANNCDS 9306   9893    19.613488   +    shift 3 ;
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      9303   9893    19.613488   +
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      9928   10494   4.670315	    -
Escherichia_coli_K-12_complete_genome   glimmer/tico  CDS      10643  11356   13.285624   -

Output in Simple Coord format

The output in Simple Coord format provides the coordinates of the ORFs with the TIS as predicted by TiCo. Additionally the PWM score is given. For predicted TIS with a negative score, the label "#" is added.

>2_337_2799_+_5.347931
>3_2801_3733_+_11.448764
>5_3734_5020_+_6.230648
>6_5088_5237_+_3.815619
>8_5310_5741_-_-0.111382#
>10_5683_6459_-_0.234908
>12_6529_7959_-_19.753130
>14_8238_9191_+_19.169035
>15_9306_9893_+_19.613488
>17_9928_10494_-_4.670315
>19_10643_11356_-_13.285624

top of this site

Visualization of the PWM

The PWM (Positional Weights Matrix) calculated by TiCo can be visualized with the program WeightsVis.

View an example.

top of this site