Classification of Candidate TIS Sequences
Our method is based on the analysis of candidate TIS sequences as obtained from the flanking regions of potential start codons. We implemented a clustering algorithm which performs an unsupervised classification of sequences according to strong-TIS and weak-TIS categories. As potential TIS locations we consider the positions of all admissible start codons in a specified search range around the initial TIS as predicted by a conventional gene finder. In addition, potential start codons have to share the same reading frame of the associated gene and no inframe stop codon has to occur between the candidate start and the annotated stop. For an initial classification we consider each TIS predicted by the gene finder as strong TIS and all other candidates within the same search range as weak TIS. The two classes are represented by an inhomogeneous second order probabilities models. The probabilities are estimated from the position dependent trinucleotide occurrences.
Starting with the above initial classification we iterate the following two successive steps:
Inhomogeneous second order probabilities models for the strong and weak categories are estimated from all strong and weak TIS sequences, respectively. We apply positional smoothing of the trinucleotide probabilities using a discretized Gaussian density function. The smoothing parameter sigma can be modified through our web interface or adjusted by an automated routine (view help page). The TIS sequences are extracted as the flanking regions of the potential start codons with a specified number of upstream and downstream positions. Finally a second order positional weight matrix (PWM) is built from the smoothed probabilities by subtracting the logarithms of the position specific weak model probabilities from the strong ones.
The PWM is used to score all TIS candidates. The candidate with highest positive score among all candidates of the same gene-specific search range is classified as strong TIS, all other candidates from that range are classified as weak TIS.
The two steps of estimation and classification are iterated until the classification does not change anymore or a maximum number of 20 iterations has been reached. The resulting candidates with maximum score from the corresponding ranges are considered as the final TIS predictions of the algorithm.
References of the corresponding publications can be found at the introduction.