Limits...
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.

Li W, Freudenberg J, Miramontes P - BMC Bioinformatics (2014)

Bottom Line: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k.Moreover, the proportion of non-singletons decreases with read lengths much slower than linear.A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA. wtli2012@gmail.com.

ABSTRACT

Background: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp.

Results: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf's curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications.

Conclusion: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

Show MeSH

Related in: MedlinePlus

Fitting rank-frequency distribution ofk-mers at k = 30, 50, 150, 500 using three functions. Red: quadratic logarithmic (logf∼ log(r)+ log((r))2, f: frequency of a k-mer type, r: rank of a k-mer type, and the ∼ symbol represents linear regression); blue: reverse Beta rank function (log(r)∼ log(f)+ log(max(f)+1-f)); Orange: Weibull function (log(f)∼ log(log((max(r)+1)/r))).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC3927684&req=5

Figure 5: Fitting rank-frequency distribution ofk-mers at k = 30, 50, 150, 500 using three functions. Red: quadratic logarithmic (logf∼ log(r)+ log((r))2, f: frequency of a k-mer type, r: rank of a k-mer type, and the ∼ symbol represents linear regression); blue: reverse Beta rank function (log(r)∼ log(f)+ log(max(f)+1-f)); Orange: Weibull function (log(f)∼ log(log((max(r)+1)/r))).

Mentions: For RFDs deviating from the Zipf’s law, functions with two parameters may be used to account for the concave or convex shape of the curve in log-log scale [42]. We found that the quadratic logarithmic function, but not the Weibull function, fits the RFDs well (Figure 5). The Beta rank function usually exhibit “S” shapes [45], whereas the RFD in Figure 4 shows a “Z” shape. This motivated us to use a novel reverse Beta function to fit the data (Figure 5). The “Z” shaped log-log RFD means that if the power-law function is the default functional relationship between frequency and rank, frequencies of the intermediately-ranked k-mers decrease faster than the two tails. The “S” shaped log-log RFD implies the opposite.


Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.

Li W, Freudenberg J, Miramontes P - BMC Bioinformatics (2014)

Fitting rank-frequency distribution ofk-mers at k = 30, 50, 150, 500 using three functions. Red: quadratic logarithmic (logf∼ log(r)+ log((r))2, f: frequency of a k-mer type, r: rank of a k-mer type, and the ∼ symbol represents linear regression); blue: reverse Beta rank function (log(r)∼ log(f)+ log(max(f)+1-f)); Orange: Weibull function (log(f)∼ log(log((max(r)+1)/r))).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC3927684&req=5

Figure 5: Fitting rank-frequency distribution ofk-mers at k = 30, 50, 150, 500 using three functions. Red: quadratic logarithmic (logf∼ log(r)+ log((r))2, f: frequency of a k-mer type, r: rank of a k-mer type, and the ∼ symbol represents linear regression); blue: reverse Beta rank function (log(r)∼ log(f)+ log(max(f)+1-f)); Orange: Weibull function (log(f)∼ log(log((max(r)+1)/r))).
Mentions: For RFDs deviating from the Zipf’s law, functions with two parameters may be used to account for the concave or convex shape of the curve in log-log scale [42]. We found that the quadratic logarithmic function, but not the Weibull function, fits the RFDs well (Figure 5). The Beta rank function usually exhibit “S” shapes [45], whereas the RFD in Figure 4 shows a “Z” shape. This motivated us to use a novel reverse Beta function to fit the data (Figure 5). The “Z” shaped log-log RFD means that if the power-law function is the default functional relationship between frequency and rank, frequencies of the intermediately-ranked k-mers decrease faster than the two tails. The “S” shaped log-log RFD implies the opposite.

Bottom Line: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k.Moreover, the proportion of non-singletons decreases with read lengths much slower than linear.A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

View Article: PubMed Central - HTML - PubMed

Affiliation: The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA. wtli2012@gmail.com.

ABSTRACT

Background: The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp.

Results: We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf's curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications.

Conclusion: Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.

Show MeSH
Related in: MedlinePlus