Limits...
Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains

View Article: PubMed Central - PubMed

ABSTRACT

The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

No MeSH data available.


Probability density distributions of 22 Marine Microbial Eukaryotes RNA-Seq data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5120338&req=5

f9: Probability density distributions of 22 Marine Microbial Eukaryotes RNA-Seq data.

Mentions: The value of K is determined by minimizing AICR(K). However, no simple analytical formula exists between K and AICR(K), making it a challenge to find the optimal K for all sequencing data. To solve this problem, we developed the following heuristic approach to determine the value of K. In our study, one branch is pruned when its Kullback-Leibler divergence is less than the threshold K. Therefore, K is meaningful only when it is within the value range of Kullback-Leibler divergence. In Experiment on 22 Marine Microbial Eukaryotes, the probability density distribution of the Kullback-Leibler divergence is shown as Fig. 9. The values of Kullback-Leibler divergence in most tuples are between 100 and 500. Optimal results are generally obtained with K setting around the peak points and the right two inflexions (point A, B and C in Fig. 9). Hence, we only implement local search around these three points for the optimal K that minimizes loss functionAICR(K).


Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains
Probability density distributions of 22 Marine Microbial Eukaryotes RNA-Seq data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5120338&req=5

f9: Probability density distributions of 22 Marine Microbial Eukaryotes RNA-Seq data.
Mentions: The value of K is determined by minimizing AICR(K). However, no simple analytical formula exists between K and AICR(K), making it a challenge to find the optimal K for all sequencing data. To solve this problem, we developed the following heuristic approach to determine the value of K. In our study, one branch is pruned when its Kullback-Leibler divergence is less than the threshold K. Therefore, K is meaningful only when it is within the value range of Kullback-Leibler divergence. In Experiment on 22 Marine Microbial Eukaryotes, the probability density distribution of the Kullback-Leibler divergence is shown as Fig. 9. The values of Kullback-Leibler divergence in most tuples are between 100 and 500. Optimal results are generally obtained with K setting around the peak points and the right two inflexions (point A, B and C in Fig. 9). Hence, we only implement local search around these three points for the optimal K that minimizes loss functionAICR(K).

View Article: PubMed Central - PubMed

ABSTRACT

The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

No MeSH data available.