Limits...
GOParGenPy: a high throughput method to generate gene ontology data matrices.

Kumar AA, Holm L, Toronen P - BMC Bioinformatics (2013)

Bottom Line: It can use any available version of the GO structure and allows the user to select the source of GO annotation.GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, PO Box 56, (Viikinkaari 5), Helsinki 00014, Finland. ajay.kumar@helsinki.fi

ABSTRACT

Background: Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes.

Results: We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.

Conclusions: GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.

Show MeSH
Generation of sparse matrix with toy data. Figure shows how a normal full matrix is converted to a sparse matrix. For each non-zero entry in the original matrix the sparse matrix stores three values: The row index, the column index and the value in the cell. GOParGenPy also reports the column and row names in separate files. Here the name tables include also indexing values for clarity.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3750654&req=5

Figure 2: Generation of sparse matrix with toy data. Figure shows how a normal full matrix is converted to a sparse matrix. For each non-zero entry in the original matrix the sparse matrix stores three values: The row index, the column index and the value in the cell. GOParGenPy also reports the column and row names in separate files. Here the name tables include also indexing values for clarity.

Mentions: Finally, user can specify whether a sparse or full binary matrix is generated with genes as row names and GO classes as column names. Reported GO classes are those occurring in the input annotation file and their parent nodes. Selection of the sparse matrix option is highly recommended as the package is intended for large datasets (>20,000 GO annotated genes). Sparse matrices are memory efficient representations for matrices where most of the values are zero. This is the case with GO data matrices as large part of GO classes have less than one percent of genes as members and the non-members are given value zero. We use the sparse matrix representation with three columns. These columns represent the row number and column number of non-zero value and the value in the cell. FigureĀ 2 demonstrates this process.


GOParGenPy: a high throughput method to generate gene ontology data matrices.

Kumar AA, Holm L, Toronen P - BMC Bioinformatics (2013)

Generation of sparse matrix with toy data. Figure shows how a normal full matrix is converted to a sparse matrix. For each non-zero entry in the original matrix the sparse matrix stores three values: The row index, the column index and the value in the cell. GOParGenPy also reports the column and row names in separate files. Here the name tables include also indexing values for clarity.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3750654&req=5

Figure 2: Generation of sparse matrix with toy data. Figure shows how a normal full matrix is converted to a sparse matrix. For each non-zero entry in the original matrix the sparse matrix stores three values: The row index, the column index and the value in the cell. GOParGenPy also reports the column and row names in separate files. Here the name tables include also indexing values for clarity.
Mentions: Finally, user can specify whether a sparse or full binary matrix is generated with genes as row names and GO classes as column names. Reported GO classes are those occurring in the input annotation file and their parent nodes. Selection of the sparse matrix option is highly recommended as the package is intended for large datasets (>20,000 GO annotated genes). Sparse matrices are memory efficient representations for matrices where most of the values are zero. This is the case with GO data matrices as large part of GO classes have less than one percent of genes as members and the non-members are given value zero. We use the sparse matrix representation with three columns. These columns represent the row number and column number of non-zero value and the value in the cell. FigureĀ 2 demonstrates this process.

Bottom Line: It can use any available version of the GO structure and allows the user to select the source of GO annotation.GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Biotechnology, University of Helsinki, PO Box 56, (Viikinkaari 5), Helsinki 00014, Finland. ajay.kumar@helsinki.fi

ABSTRACT

Background: Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes.

Results: We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases.

Conclusions: GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.

Show MeSH