Limits...
Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.

Aflitos SA, Severing E, Sanchez-Perez G, Peters S, de Jong H, de Ridder D - BMC Bioinformatics (2015)

Bottom Line: We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy at the species level.CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data.This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).

View Article: PubMed Central - PubMed

Affiliation: Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. sauloalves.aflitos@wur.nl.

ABSTRACT

Background: Identification of biological specimens is a requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances.

Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy at the species level.

Conclusion: CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).

No MeSH data available.


1-nearest-neighbour analysis for species and supra-species levels at each taxonomic level for CNIDARIA and REFERENCEFREE using 21-mers and Jaccard distance. Supra-species level analysis contains 30 samples (Additional file 5: Table S5) from 8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms. Species level analysis contains 33 samples (Additional file 5: Table S5) from 11 species of the Solanum clade. Classification reports the Leave-One-Out Cross-Validation error estimate (LOOCV) for 21-mers. Error bars indicate the minimum and maximum performance found across the 71 distance metrics tested
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4630969&req=5

Fig2: 1-nearest-neighbour analysis for species and supra-species levels at each taxonomic level for CNIDARIA and REFERENCEFREE using 21-mers and Jaccard distance. Supra-species level analysis contains 30 samples (Additional file 5: Table S5) from 8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms. Species level analysis contains 33 samples (Additional file 5: Table S5) from 11 species of the Solanum clade. Classification reports the Leave-One-Out Cross-Validation error estimate (LOOCV) for 21-mers. Error bars indicate the minimum and maximum performance found across the 71 distance metrics tested

Mentions: To verify the accuracy of the clustering of the samples, we used the 1-nearest neighbour algorithm on 30 samples for supra-species level analysis (8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms, described in Additional file 5: Table S5) and on 33 samples for species level analysis (11 species of the Solanum clade, described in Additional file 6: Table S6). The 1-nearest neighbour classifier reports the percentage of samples for which the sample with the smallest distance belongs to the same rank at each phylogenetic level (species, genus, family, order, phylum and kingdom). We report the percentage of samples correctly classified in Fig. 2 and Additional file 7: Table S7.Fig. 2


Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data.

Aflitos SA, Severing E, Sanchez-Perez G, Peters S, de Jong H, de Ridder D - BMC Bioinformatics (2015)

1-nearest-neighbour analysis for species and supra-species levels at each taxonomic level for CNIDARIA and REFERENCEFREE using 21-mers and Jaccard distance. Supra-species level analysis contains 30 samples (Additional file 5: Table S5) from 8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms. Species level analysis contains 33 samples (Additional file 5: Table S5) from 11 species of the Solanum clade. Classification reports the Leave-One-Out Cross-Validation error estimate (LOOCV) for 21-mers. Error bars indicate the minimum and maximum performance found across the 71 distance metrics tested
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4630969&req=5

Fig2: 1-nearest-neighbour analysis for species and supra-species levels at each taxonomic level for CNIDARIA and REFERENCEFREE using 21-mers and Jaccard distance. Supra-species level analysis contains 30 samples (Additional file 5: Table S5) from 8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms. Species level analysis contains 33 samples (Additional file 5: Table S5) from 11 species of the Solanum clade. Classification reports the Leave-One-Out Cross-Validation error estimate (LOOCV) for 21-mers. Error bars indicate the minimum and maximum performance found across the 71 distance metrics tested
Mentions: To verify the accuracy of the clustering of the samples, we used the 1-nearest neighbour algorithm on 30 samples for supra-species level analysis (8 genus, 7 families, 7 orders, 4 phylum and 3 kingdoms, described in Additional file 5: Table S5) and on 33 samples for species level analysis (11 species of the Solanum clade, described in Additional file 6: Table S6). The 1-nearest neighbour classifier reports the percentage of samples for which the sample with the smallest distance belongs to the same rank at each phylogenetic level (species, genus, family, order, phylum and kingdom). We report the percentage of samples correctly classified in Fig. 2 and Additional file 7: Table S7.Fig. 2

Bottom Line: We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy at the species level.CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data.This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).

View Article: PubMed Central - PubMed

Affiliation: Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands. sauloalves.aflitos@wur.nl.

ABSTRACT

Background: Identification of biological specimens is a requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but generally do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances.

Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on genome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy at supra-species level and 78% accuracy at the species level.

Conclusion: CNIDARIA allows for fast, resource-efficient comparison and identification of both raw and assembled genome and transcriptome data. This can help answer both fundamental (e.g. in phylogeny, ecological diversity analysis) and practical questions (e.g. sequencing quality control, primer design).

No MeSH data available.