Limits...
Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm.

Elhaik E, Graur D, Josić K, Landan G - Nucleic Acids Res. (2010)

Bottom Line: However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not.Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion.By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

View Article: PubMed Central - PubMed

Affiliation: McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA. eelhaik1@jhmi.edu

ABSTRACT
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

Show MeSH
Cumulative  distributions calculated for 10 000 sequences of σGC = 1% and different lengths L. The thresholds (Dt) are obtained from the top 0.01% percentile of each cumulative distribution and are marked for three sequence lengths.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2926622&req=5

Figure 3: Cumulative distributions calculated for 10 000 sequences of σGC = 1% and different lengths L. The thresholds (Dt) are obtained from the top 0.01% percentile of each cumulative distribution and are marked for three sequence lengths.

Mentions: Reducing this bias in IsoPlotter required modeling the dependencies between a segment length and its standard deviation σGC on the one hand, and the statistic on the other hand. We generated 10 000 sequences for each of 13 length parameters L ranging from 1 kb to 1 Mb and for each of five GC content standard deviation parameters σGC ranging from 1% to 10%. For each simulation setting, we calculated by allowing IsoPlotter to partition the sequence once and obtained the threshold (Dt) from the top 0.01% percentile of the cumulative distribution (Figure 3). A log-scale plot of Dt as a function of sequence length and variability reveals a near perfect linear relationship (Figure 4). A log-scale linear regression on the length (L) and GC content dispersion (σGC) fits the empirical data extremely well (r2 = 0.995, P < 10−16) with the following coefficients:(6)Figure 3.


Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm.

Elhaik E, Graur D, Josić K, Landan G - Nucleic Acids Res. (2010)

Cumulative  distributions calculated for 10 000 sequences of σGC = 1% and different lengths L. The thresholds (Dt) are obtained from the top 0.01% percentile of each cumulative distribution and are marked for three sequence lengths.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2926622&req=5

Figure 3: Cumulative distributions calculated for 10 000 sequences of σGC = 1% and different lengths L. The thresholds (Dt) are obtained from the top 0.01% percentile of each cumulative distribution and are marked for three sequence lengths.
Mentions: Reducing this bias in IsoPlotter required modeling the dependencies between a segment length and its standard deviation σGC on the one hand, and the statistic on the other hand. We generated 10 000 sequences for each of 13 length parameters L ranging from 1 kb to 1 Mb and for each of five GC content standard deviation parameters σGC ranging from 1% to 10%. For each simulation setting, we calculated by allowing IsoPlotter to partition the sequence once and obtained the threshold (Dt) from the top 0.01% percentile of the cumulative distribution (Figure 3). A log-scale plot of Dt as a function of sequence length and variability reveals a near perfect linear relationship (Figure 4). A log-scale linear regression on the length (L) and GC content dispersion (σGC) fits the empirical data extremely well (r2 = 0.995, P < 10−16) with the following coefficients:(6)Figure 3.

Bottom Line: However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not.Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion.By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

View Article: PubMed Central - PubMed

Affiliation: McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA. eelhaik1@jhmi.edu

ABSTRACT
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen-Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, D(JS), using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas D(JS) failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.

Show MeSH