Limits...
Simcluster: clustering enumeration gene expression data on the simplex space.

Vêncio RZ, Varuzza L, de B Pereira CA, Brentani H, Shmulevich I - BMC Bioinformatics (2007)

Bottom Line: These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled.Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space.We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.

Results: Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster.

Conclusion: Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

Show MeSH
Clustering of simulated data using Euclidean distance. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size n = 100,000,000. Method: Euclidean distance with average linkage agglomerative hierarchical clustering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2147035&req=5

Figure 4: Clustering of simulated data using Euclidean distance. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size n = 100,000,000. Method: Euclidean distance with average linkage agglomerative hierarchical clustering.

Mentions: The total of virtually sequenced tags N for each sample is simulated from a Poisson distribution, N ~ Poisson(n), to create a realistic virtual sequencing library. All generated data and results are available as Additional File 3. For example, the actual simulation for n = 1,000,000 virtually sequenced tags assigned N = 1,001,794 for the LPS-120 library; N = 998,382 for the CPG-40 library; and so on. The same process is repeated for increasing n from 100,000 to 100,000,000. Since n ≪ T for all n considered, the multinomial sampling is used and its mean is taken for each library, according to the assumed "infinite urn" model. The results for the largest simulation are shown in Figures 3, 4, 5, 6 and individual results for all separate increasing n sizes are available as Additional File 3.


Simcluster: clustering enumeration gene expression data on the simplex space.

Vêncio RZ, Varuzza L, de B Pereira CA, Brentani H, Shmulevich I - BMC Bioinformatics (2007)

Clustering of simulated data using Euclidean distance. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size n = 100,000,000. Method: Euclidean distance with average linkage agglomerative hierarchical clustering.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2147035&req=5

Figure 4: Clustering of simulated data using Euclidean distance. Transcript enumeration data produced by the simulation of a virtual transcriptome according to the Affymetrix expression levels. Sample size n = 100,000,000. Method: Euclidean distance with average linkage agglomerative hierarchical clustering.
Mentions: The total of virtually sequenced tags N for each sample is simulated from a Poisson distribution, N ~ Poisson(n), to create a realistic virtual sequencing library. All generated data and results are available as Additional File 3. For example, the actual simulation for n = 1,000,000 virtually sequenced tags assigned N = 1,001,794 for the LPS-120 library; N = 998,382 for the CPG-40 library; and so on. The same process is repeated for increasing n from 100,000 to 100,000,000. Since n ≪ T for all n considered, the multinomial sampling is used and its mean is taken for each library, according to the assumed "infinite urn" model. The results for the largest simulation are shown in Figures 3, 4, 5, 6 and individual results for all separate increasing n sizes are available as Additional File 3.

Bottom Line: These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled.Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space.We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA. rvencio@gmail.com

ABSTRACT

Background: Transcript enumeration methods such as SAGE, MPSS, and sequencing-by-synthesis EST "digital northern", are important high-throughput techniques for digital gene expression measurement. As other counting or voting processes, these measurements constitute compositional data exhibiting properties particular to the simplex space where the summation of the components is constrained. These properties are not present on regular Euclidean spaces, on which hybridization-based microarray data is often modeled. Therefore, pattern recognition methods commonly used for microarray data analysis may be non-informative for the data generated by transcript enumeration techniques since they ignore certain fundamental properties of this space.

Results: Here we present a software tool, Simcluster, designed to perform clustering analysis for data on the simplex space. We present Simcluster as a stand-alone command-line C package and as a user-friendly on-line tool. Both versions are available at: http://xerad.systemsbiology.net/simcluster.

Conclusion: Simcluster is designed in accordance with a well-established mathematical framework for compositional data analysis, which provides principled procedures for dealing with the simplex space, and is thus applicable in a number of contexts, including enumeration-based gene expression data.

Show MeSH