Limits...
Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data.

Matthews BB, Dos Santos G, Crosby MA, Emmert DB, St Pierre SE, Gramates LS, Zhou P, Schroeder AJ, Falls K, Strelets V, Russo SM, Gelbart WM, FlyBase Consorti - G3 (Bethesda) (2015)

Bottom Line: FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes.The number of identified pseudogenes and mutations in the sequenced strain also increased significantly.We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138 bmatthew@morgan.harvard.edu.

No MeSH data available.


Related in: MedlinePlus

Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4528329&req=5

fig1: Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.

Mentions: From R5.24 to R6.03, the total number of protein-coding genes increased from 13,808 to 13,918, an increase of only 110 genes (∼0.8%). This modest net change masks a higher degree of flux in protein-coding gene annotations that becomes apparent when individual annotations are tracked between the two releases (Figure 1A, File S1). In this interval, 92 protein-coding genes were deleted, 447 were created, and 604 were split, merged, or reclassified to yield 359 new coding genes. Protein-coding gene annotations that persisted from R5.24 to R6.03 also experienced a large degree of flux. These shared gene models are identified by stable annotation IDs (Table S2), but the number of associated transcripts can change substantially. For the 13,112 genes that persisted between R5.24 and R6.03, 1903 transcript isoforms were deleted and 9566 new transcript isoforms were added (Figure 1B, File S2); new transcripts added to existing protein-coding gene models account for most of the 39% increase in the number of protein-coding transcripts. These changes are largely due to the flood of exon junction data, which led to a significant increase in the number of alternatively spliced transcripts per gene, despite our conservative approach to creating new transcript isoforms. This effect on the predicted proteome is also reflected in the number of unique predicted polypeptides, which increased by 20% from R5.24 to R6.03 (Table 1). The availability of high-throughput data, particularly the RNA-Seq coverage data and the TSS data, also made possible the annotation of UTRs in cases where sparse or no cDNA/EST support had previously precluded any such annotation. As such, the number of mRNA annotations lacking 5′ and/or 3′ UTRs is dramatically reduced (five-fold) in R6.03 compared to R5.24 (Table S3, File S2).


Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data.

Matthews BB, Dos Santos G, Crosby MA, Emmert DB, St Pierre SE, Gramates LS, Zhou P, Schroeder AJ, Falls K, Strelets V, Russo SM, Gelbart WM, FlyBase Consorti - G3 (Bethesda) (2015)

Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4528329&req=5

fig1: Changes to the FlyBase gene model annotation set in the era of high-throughput data. The gene model annotation set of FlyBase version FB2010_01 (R5.24), the last version to predate FlyBase incorporation of high-throughput data, was compared to that of FlyBase version FB2014_06 (R6.03) to determine the degree of change over the course of 26 annotation updates. Gene model annotations common to both sets (white) were identified. Gene model annotations specific to R6.03 were then examined to identify those derived from R5.24-specific gene model annotations through gene merge/split or reclassification (yellow). The remaining R5.24-specific annotations were classified as “withdrawn” (red), and the R6.03-specific annotations were classified as “new” (blue). The number of gene models within each category of status change is shown for protein-coding genes (A). For 13,003 of the 13,112 protein-coding genes common to both R5.24 and R6.03 (excluding nine complex cases), we also examined the degree to which the number of associated transcripts changed: a measure of gene model complexity. For these genes, the numbers of associated transcripts that persisted (white), were deleted (red), or were added (blue) at some point between R5.24 and R6.03 are shown (B). Changes to the set of annotated pseudogenes (C) and non-coding RNA genes (D) are also shown.
Mentions: From R5.24 to R6.03, the total number of protein-coding genes increased from 13,808 to 13,918, an increase of only 110 genes (∼0.8%). This modest net change masks a higher degree of flux in protein-coding gene annotations that becomes apparent when individual annotations are tracked between the two releases (Figure 1A, File S1). In this interval, 92 protein-coding genes were deleted, 447 were created, and 604 were split, merged, or reclassified to yield 359 new coding genes. Protein-coding gene annotations that persisted from R5.24 to R6.03 also experienced a large degree of flux. These shared gene models are identified by stable annotation IDs (Table S2), but the number of associated transcripts can change substantially. For the 13,112 genes that persisted between R5.24 and R6.03, 1903 transcript isoforms were deleted and 9566 new transcript isoforms were added (Figure 1B, File S2); new transcripts added to existing protein-coding gene models account for most of the 39% increase in the number of protein-coding transcripts. These changes are largely due to the flood of exon junction data, which led to a significant increase in the number of alternatively spliced transcripts per gene, despite our conservative approach to creating new transcript isoforms. This effect on the predicted proteome is also reflected in the number of unique predicted polypeptides, which increased by 20% from R5.24 to R6.03 (Table 1). The availability of high-throughput data, particularly the RNA-Seq coverage data and the TSS data, also made possible the annotation of UTRs in cases where sparse or no cDNA/EST support had previously precluded any such annotation. As such, the number of mRNA annotations lacking 5′ and/or 3′ UTRs is dramatically reduced (five-fold) in R6.03 compared to R5.24 (Table S3, File S2).

Bottom Line: FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes.The number of identified pseudogenes and mutations in the sequenced strain also increased significantly.We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.

View Article: PubMed Central - PubMed

Affiliation: Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138 bmatthew@morgan.harvard.edu.

No MeSH data available.


Related in: MedlinePlus