Limits...
Computational analysis and experimental validation of gene predictions in Toxoplasma gondii.

Dybas JM, Madrid-Aliste CJ, Che FY, Nieves E, Rykunov D, Angeletti RH, Weiss LM, Kim K, Fiser A - PLoS ONE (2008)

Bottom Line: The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan.Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes.

View Article: PubMed Central - PubMed

Affiliation: Biodefense Proteomics Research Center, Albert Einstein College of Medicine, Bronx, New York, United States of America.

ABSTRACT

Background: Toxoplasma gondii is an obligate intracellular protozoan that infects 20 to 90% of the population. It can cause both acute and chronic infections, many of which are asymptomatic, and, in immunocompromised hosts, can cause fatal infection due to reactivation from an asymptomatic chronic infection. An essential step towards understanding molecular mechanisms controlling transitions between the various life stages and identifying candidate drug targets is to accurately characterize the T. gondii proteome.

Methodology/principal findings: We have explored the proteome of T. gondii tachyzoites with high throughput proteomics experiments and by comparison to publicly available cDNA sequence data. Mass spectrometry analysis validated 2,477 gene coding regions with 6,438 possible alternative gene predictions; approximately one third of the T. gondii proteome. The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan. Computational analysis identified 787 cases of possible gene duplication events and located at least 6,089 gene coding regions. Commonly used gene prediction algorithms produce very disparate sets of protein sequences, with pairwise overlaps ranging from 1.4% to 12%. Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.

Conclusions/significance: This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes.

Show MeSH

Related in: MedlinePlus

Frequency distribution of sequence lengths (amino acid residues per sequence) for TigrScan (dark blue), TwinScan (green), Glimmer (red), Release4 (light blue), and NR (grey) sequences.The frequencies are normalized to the total number of sequences in each dataset. The large graph shows the distributions for sequence lengths of 0–3,000 amino acids. The inset shows the same distributions but for sequence lengths of 0–500 amino acids.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC2587701&req=5

pone-0003899-g001: Frequency distribution of sequence lengths (amino acid residues per sequence) for TigrScan (dark blue), TwinScan (green), Glimmer (red), Release4 (light blue), and NR (grey) sequences.The frequencies are normalized to the total number of sequences in each dataset. The large graph shows the distributions for sequence lengths of 0–3,000 amino acids. The inset shows the same distributions but for sequence lengths of 0–500 amino acids.

Mentions: The majority of the sequences in the combined dataset (94%) are less than 2,000 residues long though some sequences are as long as 14,514 residues. The averages, ranges, and distributions of the sequence lengths already suggest substantial differences among the protein datasets (Table 2, Fig. 1). Glimmer predicted sequences (1,077 residues average length) are, on average, much longer than the TigrScan, TwinScan, Release4, or NR sequences (681, 614, 719, and 510 residues average length, respectively). Furthermore, the maximum sequence length for TigrScan (9,696 residues) is substantially shorter than the maximum lengths of the TwinScan, Glimmer, Release4, or NR sequences (13,936, 14,514, 11,862, and 12,269 residues, respectively). The NR sequences have the shortest average length (510 residues) and the distribution shows the highest frequency of sequences of approximately 200–300 residues. The median lengths for each distribution, compared among each dataset, mimic the characteristics of the differences in average lengths. The Glimmer sequences had the highest median length (695), followed by Release4 (472), TigrScan (457), TwinScan (395), and the NR sequences (341). The shorter average and median lengths of the NR sequences is not necessarily an inherent characteristic of T. gondii proteins but could be an artifact because the researched proteins are likely not a representative sample of the T. gondii proteome. Rather, the predominance of shorter sequences could be a consequence of the greater ease of studying these proteins experimentally, especially in the past.


Computational analysis and experimental validation of gene predictions in Toxoplasma gondii.

Dybas JM, Madrid-Aliste CJ, Che FY, Nieves E, Rykunov D, Angeletti RH, Weiss LM, Kim K, Fiser A - PLoS ONE (2008)

Frequency distribution of sequence lengths (amino acid residues per sequence) for TigrScan (dark blue), TwinScan (green), Glimmer (red), Release4 (light blue), and NR (grey) sequences.The frequencies are normalized to the total number of sequences in each dataset. The large graph shows the distributions for sequence lengths of 0–3,000 amino acids. The inset shows the same distributions but for sequence lengths of 0–500 amino acids.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC2587701&req=5

pone-0003899-g001: Frequency distribution of sequence lengths (amino acid residues per sequence) for TigrScan (dark blue), TwinScan (green), Glimmer (red), Release4 (light blue), and NR (grey) sequences.The frequencies are normalized to the total number of sequences in each dataset. The large graph shows the distributions for sequence lengths of 0–3,000 amino acids. The inset shows the same distributions but for sequence lengths of 0–500 amino acids.
Mentions: The majority of the sequences in the combined dataset (94%) are less than 2,000 residues long though some sequences are as long as 14,514 residues. The averages, ranges, and distributions of the sequence lengths already suggest substantial differences among the protein datasets (Table 2, Fig. 1). Glimmer predicted sequences (1,077 residues average length) are, on average, much longer than the TigrScan, TwinScan, Release4, or NR sequences (681, 614, 719, and 510 residues average length, respectively). Furthermore, the maximum sequence length for TigrScan (9,696 residues) is substantially shorter than the maximum lengths of the TwinScan, Glimmer, Release4, or NR sequences (13,936, 14,514, 11,862, and 12,269 residues, respectively). The NR sequences have the shortest average length (510 residues) and the distribution shows the highest frequency of sequences of approximately 200–300 residues. The median lengths for each distribution, compared among each dataset, mimic the characteristics of the differences in average lengths. The Glimmer sequences had the highest median length (695), followed by Release4 (472), TigrScan (457), TwinScan (395), and the NR sequences (341). The shorter average and median lengths of the NR sequences is not necessarily an inherent characteristic of T. gondii proteins but could be an artifact because the researched proteins are likely not a representative sample of the T. gondii proteome. Rather, the predominance of shorter sequences could be a consequence of the greater ease of studying these proteins experimentally, especially in the past.

Bottom Line: The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan.Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes.

View Article: PubMed Central - PubMed

Affiliation: Biodefense Proteomics Research Center, Albert Einstein College of Medicine, Bronx, New York, United States of America.

ABSTRACT

Background: Toxoplasma gondii is an obligate intracellular protozoan that infects 20 to 90% of the population. It can cause both acute and chronic infections, many of which are asymptomatic, and, in immunocompromised hosts, can cause fatal infection due to reactivation from an asymptomatic chronic infection. An essential step towards understanding molecular mechanisms controlling transitions between the various life stages and identifying candidate drug targets is to accurately characterize the T. gondii proteome.

Methodology/principal findings: We have explored the proteome of T. gondii tachyzoites with high throughput proteomics experiments and by comparison to publicly available cDNA sequence data. Mass spectrometry analysis validated 2,477 gene coding regions with 6,438 possible alternative gene predictions; approximately one third of the T. gondii proteome. The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan. Computational analysis identified 787 cases of possible gene duplication events and located at least 6,089 gene coding regions. Commonly used gene prediction algorithms produce very disparate sets of protein sequences, with pairwise overlaps ranging from 1.4% to 12%. Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.

Conclusions/significance: This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes.

Show MeSH
Related in: MedlinePlus