Limits...
Array-based genotyping in S.cerevisiae using semi-supervised clustering.

Bourgon R, Mancera E, Brozzi A, Steinmetz LM, Huber W - Bioinformatics (2009)

Bottom Line: The resulting data permit the identification of loci at which genetic variation is associated with quantitative traits, or fine mapping of meiotic recombination, which is a key determinant of genetic diversity among individuals.We also demonstrate that oligonucleotide probe response depends significantly on genomic background, even when the probe's specific target sequence is unchanged.As a result, supervised classifiers trained on reference strains may not generalize well to diverged strains; ssG's semi-supervised approach, on the other hand, adapts automatically.

View Article: PubMed Central - PubMed

Affiliation: EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. bourgon@ebi.ac.uk

ABSTRACT

Motivation: Microarrays provide an accurate and cost-effective method for genotyping large numbers of individuals at high resolution. The resulting data permit the identification of loci at which genetic variation is associated with quantitative traits, or fine mapping of meiotic recombination, which is a key determinant of genetic diversity among individuals. Several issues inherent to short oligonucleotide arrays -- cross-hybridization, or variability in probe response to target -- have the potential to produce genotyping errors. There is a need for improved statistical methods for array-based genotyping.

Results: We developed ssGenotyping (ssG), a multivariate, semi-supervised approach for using microarrays to genotype haploid individuals at thousands of polymorphic sites. Using a meiotic recombination dataset, we show that ssG is more accurate than existing supervised classification methods, and that it produces denser marker coverage. The ssG algorithm is able to fit probe-specific affinity differences and to detect and filter spurious signal, permitting high-confidence genotyping at nucleotide resolution. We also demonstrate that oligonucleotide probe response depends significantly on genomic background, even when the probe's specific target sequence is unchanged. As a result, supervised classifiers trained on reference strains may not generalize well to diverged strains; ssG's semi-supervised approach, on the other hand, adapts automatically.

Show MeSH

Related in: MedlinePlus

Parental-only (dashed) versus semi-supervised (solid) fitted distributions for two probe sets, with projection into the first two PCs for visualization. Contours represent ±2 SDs in each PC. Clusters are well separated in (A) but not (B). For both probe sets, assuming Σ1 = Σ2 is not justified for either fit type. The parental-only fits do not describe segregant behavior well, particularly for (A). For (B), the supervised parental-only fits exaggerate the degree of separation between the classes—potentially leading to retention of an error-prone probe set.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2666814&req=5

Figure 3: Parental-only (dashed) versus semi-supervised (solid) fitted distributions for two probe sets, with projection into the first two PCs for visualization. Contours represent ±2 SDs in each PC. Clusters are well separated in (A) but not (B). For both probe sets, assuming Σ1 = Σ2 is not justified for either fit type. The parental-only fits do not describe segregant behavior well, particularly for (A). For (B), the supervised parental-only fits exaggerate the degree of separation between the classes—potentially leading to retention of an error-prone probe set.

Mentions: In the Gaussian mixture case, the M step of the EM algorithm—which maximizes an estimate of the conditional expectation of the log likelihood—only requires estimates of P(Yi = g/Xi) for all i ∈ S. To initialize these conditional probabilities (hereafter denoted pig), we applied a simple clustering algorithm—k-means, with the two clusters seeded with parental observations—to the combined parental and segregant data, and then set each p(0)ig to either 0 or 1, depending on the outcome of this clustering. (Alternately, one could begin with the E step and initialize the and using the parental data; this strategy produced identical results.) Defining p·g ≡ ∑ipig, it is straightforward to show that the M step's objective function is maximized in μg by(2)i.e. by a weighted average of the observations, with weights determined by (estimated) conditional probability of membership in class g. The objective function is maximized in Σg by(3)i.e. by a weighted version of the standard empirical covariance estimate. In the meiotic recombination context, it is natural to assume that, for a given polymorphism, a segregant is equally likely to inherit either of the two parental alleles, so we fixed π1 = π2 = 0.5. In other contexts, π1 and π2 can easily be estimated. To begin the next iteration, we updated pig for all i ∈ S, by(4)and continued until a convergence criterion was met. Here, φ(·; μ, Σ) denotes the density of a multivariate normal distribution with mean μ and covariance matrix Σ. We also define to be . Final assignment of genotype for the segregants was then obtained by comparing pi1 and pi2. This is analogous to (1), although the two distributions are now multivariate, and the parameter estimates are derived from a combination of the parental and offspring data rather than from parental data alone. The contrast between the two fit types (semi-supervised versus supervised parental-only) can be substantial, as illustrated in Figure 3.Fig. 3.


Array-based genotyping in S.cerevisiae using semi-supervised clustering.

Bourgon R, Mancera E, Brozzi A, Steinmetz LM, Huber W - Bioinformatics (2009)

Parental-only (dashed) versus semi-supervised (solid) fitted distributions for two probe sets, with projection into the first two PCs for visualization. Contours represent ±2 SDs in each PC. Clusters are well separated in (A) but not (B). For both probe sets, assuming Σ1 = Σ2 is not justified for either fit type. The parental-only fits do not describe segregant behavior well, particularly for (A). For (B), the supervised parental-only fits exaggerate the degree of separation between the classes—potentially leading to retention of an error-prone probe set.
© Copyright Policy - creative-commons
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2666814&req=5

Figure 3: Parental-only (dashed) versus semi-supervised (solid) fitted distributions for two probe sets, with projection into the first two PCs for visualization. Contours represent ±2 SDs in each PC. Clusters are well separated in (A) but not (B). For both probe sets, assuming Σ1 = Σ2 is not justified for either fit type. The parental-only fits do not describe segregant behavior well, particularly for (A). For (B), the supervised parental-only fits exaggerate the degree of separation between the classes—potentially leading to retention of an error-prone probe set.
Mentions: In the Gaussian mixture case, the M step of the EM algorithm—which maximizes an estimate of the conditional expectation of the log likelihood—only requires estimates of P(Yi = g/Xi) for all i ∈ S. To initialize these conditional probabilities (hereafter denoted pig), we applied a simple clustering algorithm—k-means, with the two clusters seeded with parental observations—to the combined parental and segregant data, and then set each p(0)ig to either 0 or 1, depending on the outcome of this clustering. (Alternately, one could begin with the E step and initialize the and using the parental data; this strategy produced identical results.) Defining p·g ≡ ∑ipig, it is straightforward to show that the M step's objective function is maximized in μg by(2)i.e. by a weighted average of the observations, with weights determined by (estimated) conditional probability of membership in class g. The objective function is maximized in Σg by(3)i.e. by a weighted version of the standard empirical covariance estimate. In the meiotic recombination context, it is natural to assume that, for a given polymorphism, a segregant is equally likely to inherit either of the two parental alleles, so we fixed π1 = π2 = 0.5. In other contexts, π1 and π2 can easily be estimated. To begin the next iteration, we updated pig for all i ∈ S, by(4)and continued until a convergence criterion was met. Here, φ(·; μ, Σ) denotes the density of a multivariate normal distribution with mean μ and covariance matrix Σ. We also define to be . Final assignment of genotype for the segregants was then obtained by comparing pi1 and pi2. This is analogous to (1), although the two distributions are now multivariate, and the parameter estimates are derived from a combination of the parental and offspring data rather than from parental data alone. The contrast between the two fit types (semi-supervised versus supervised parental-only) can be substantial, as illustrated in Figure 3.Fig. 3.

Bottom Line: The resulting data permit the identification of loci at which genetic variation is associated with quantitative traits, or fine mapping of meiotic recombination, which is a key determinant of genetic diversity among individuals.We also demonstrate that oligonucleotide probe response depends significantly on genomic background, even when the probe's specific target sequence is unchanged.As a result, supervised classifiers trained on reference strains may not generalize well to diverged strains; ssG's semi-supervised approach, on the other hand, adapts automatically.

View Article: PubMed Central - PubMed

Affiliation: EMBL, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. bourgon@ebi.ac.uk

ABSTRACT

Motivation: Microarrays provide an accurate and cost-effective method for genotyping large numbers of individuals at high resolution. The resulting data permit the identification of loci at which genetic variation is associated with quantitative traits, or fine mapping of meiotic recombination, which is a key determinant of genetic diversity among individuals. Several issues inherent to short oligonucleotide arrays -- cross-hybridization, or variability in probe response to target -- have the potential to produce genotyping errors. There is a need for improved statistical methods for array-based genotyping.

Results: We developed ssGenotyping (ssG), a multivariate, semi-supervised approach for using microarrays to genotype haploid individuals at thousands of polymorphic sites. Using a meiotic recombination dataset, we show that ssG is more accurate than existing supervised classification methods, and that it produces denser marker coverage. The ssG algorithm is able to fit probe-specific affinity differences and to detect and filter spurious signal, permitting high-confidence genotyping at nucleotide resolution. We also demonstrate that oligonucleotide probe response depends significantly on genomic background, even when the probe's specific target sequence is unchanged. As a result, supervised classifiers trained on reference strains may not generalize well to diverged strains; ssG's semi-supervised approach, on the other hand, adapts automatically.

Show MeSH
Related in: MedlinePlus