Limits...
Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains

View Article: PubMed Central - PubMed

ABSTRACT

The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

No MeSH data available.


The reference and clustering trees based on various models for different depths of metatranscriptomic marine samples in Experiment 4.(a) Reference tree of different depths of metatranscriptomic samples from the ocean. (b) The best clustering tree with VLMC background sequence model. (c) The best clustering tree when using FOMC and lp-norm measures.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120338&req=5

f4: The reference and clustering trees based on various models for different depths of metatranscriptomic marine samples in Experiment 4.(a) Reference tree of different depths of metatranscriptomic samples from the ocean. (b) The best clustering tree with VLMC background sequence model. (c) The best clustering tree when using FOMC and lp-norm measures.

Mentions: Table 3 shows the triples distance between the reference tree and the derived clustering trees using different dissimilarity measures and background sequence models. Using the VLMC background sequence model, both and can recover the reference tree. The best results from both VLMC and FOMC background sequence models show clear separations between metagenomic and metatranscriptomic groups, as shown in Fig. 4b,c, respectively. For both background sequence models, samples from the same depth are clustered first, then the samples belonging to the photic zone (25 m, 75 m and 125 m) are merged, and, finally, samples belonging to the mesopelagic zone (500 m). However, for the FOMC background sequence model, the metatranscriptomic samples from 25 m and 125 m are clustered first, which is inconsistent with gradient relationships. In contrast, VLMC background sequence model produces clustering of metagenomic and metatranscriptomic samples as expected with 25 and 50 m first and then 125 m.


Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains
The reference and clustering trees based on various models for different depths of metatranscriptomic marine samples in Experiment 4.(a) Reference tree of different depths of metatranscriptomic samples from the ocean. (b) The best clustering tree with VLMC background sequence model. (c) The best clustering tree when using FOMC and lp-norm measures.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120338&req=5

f4: The reference and clustering trees based on various models for different depths of metatranscriptomic marine samples in Experiment 4.(a) Reference tree of different depths of metatranscriptomic samples from the ocean. (b) The best clustering tree with VLMC background sequence model. (c) The best clustering tree when using FOMC and lp-norm measures.
Mentions: Table 3 shows the triples distance between the reference tree and the derived clustering trees using different dissimilarity measures and background sequence models. Using the VLMC background sequence model, both and can recover the reference tree. The best results from both VLMC and FOMC background sequence models show clear separations between metagenomic and metatranscriptomic groups, as shown in Fig. 4b,c, respectively. For both background sequence models, samples from the same depth are clustered first, then the samples belonging to the photic zone (25 m, 75 m and 125 m) are merged, and, finally, samples belonging to the mesopelagic zone (500 m). However, for the FOMC background sequence model, the metatranscriptomic samples from 25 m and 125 m are clustered first, which is inconsistent with gradient relationships. In contrast, VLMC background sequence model produces clustering of metagenomic and metatranscriptomic samples as expected with 25 and 50 m first and then 125 m.

View Article: PubMed Central - PubMed

ABSTRACT

The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

No MeSH data available.