Datasets and Results

Download datasets (tarball, 12 MB)

Post processors
EcoGene 85463.290.392.081.994.2
7254.2 83.386.175.084.7
P. aeruginosa 3281 58.7 83.6 3.6 67.7 85.2
R. solanacearum chr 344051.571.45.056.874.9
R. solanacearum plasm. 167648.966.2 6.055.570.1
B. pseudomallei chr1 339953.264.35.553.369.6
B. pseudomallei chr1232948.967.54.7 52.167.0

For all post processors, predictions of the tool GLIMMER2.02 were used for initial annotation of the coding regions.

For the evaluation of the performance we compared the results of the post processors with the most reliable annotations available for the genomes of E. coli and B. subtilis. For E. coli (NC_000913) this is a set of 854 genes from the EcoGene database (Rudd, 2000), with N-termini verified by protein sequencing.

For B. subtilis (NC_000964), we used four datasets for the performance comparison. The first set covers all non-y genes of the GenBank annotation. These genes are said to be experimentally characterized and have individually verified start sites (Yada et al., Hannenhalli et al.). The second set contains 58 genes confirmed by homology to the closely related organism B. halodurans. Not all start sites of the 58 set have been verified experimentally. For the evaluation of the performance of TICO on short genes, we used a different set (Bsub-short) consisting of three subsets. The subsets contain 123, 72 and 51 genes shorter than 300 nucleotides (Besemer et al., 2001). The set of 123 genes includes such annotations that were confirmed by at least one significant BLAST homology (E-value <1^(-4)). The set of 72 includes such with at least two significant homologies and the set of 51 genes, such with at least ten significant homologies.

Note, that most of the starts in the B. subtilis dataset are not experimentally confirmed. Also note, that the small datasets of B. subtilis containing 58, 123, 72 and 51 annotated genes, are not as representative as the rates on the larger dataset in terms of statistical significance.

The verified datasets described above were downloaded from Center of Theoretical Biology (CTB) at the Peking University.

The performance of TiCo on high-G+C genomes has be evaluated on Pseudomonas aeruginosa (GC: 66.6%) (Stover et al.), Ralstonia solanacearum (GC: 67.04% and 66.86%) (Salanoubat et al.) and Burkholderia pseudomallei (GC: 67.7% and 68.56%) (Holden et al.). The Pseudomonas data have be downloaded from the Pseudomonas aeruginosa Community Annotation Project (PseudoCAP). The data of Ralstonia and Burkholderia have been downloaded from NCBI.


