Probabilistic estimation of microarray data reliability and underlying gene expression.
Bottom Line:
On a set of known genes it performs better than the t-test despite the crude discretization into only two expression levels.The consistency indicators, i.e. the error probabilities, correlate well with variations in the biological material and thus prove efficient.Already at two discrete expression levels in each sample, it gives a good explanation of the data and is comparable to standard techniques.
Affiliation: Complex Systems Division, Department of Theoretical Physics, University of Lund, Sölvegatan 14A, SE-223 62 Lund, Sweden. sven@thep.lu.se
ABSTRACT
Show MeSH
Background: The availability of high throughput methods for measurement of mRNA concentrations makes the reliability of conclusions drawn from the data and global quality control of samples and hybridization important issues. We address these issues by an information theoretic approach, applied to discretized expression values in replicated gene expression data. Results: Our approach yields a quantitative measure of two important parameter classes: First, the probability P(sigma/S) that a gene is in the biological state sigma in a certain variety, given its observed expression S in the samples of that variety. Second, sample specific error probabilities which serve as consistency indicators of the measured samples of each variety. The method and its limitations are tested on gene expression data for developing murine B-cells and a t-test is used as reference. On a set of known genes it performs better than the t-test despite the crude discretization into only two expression levels. The consistency indicators, i.e. the error probabilities, correlate well with variations in the biological material and thus prove efficient. Conclusions: The proposed method is effective in determining differential gene expression and sample reliability in replicated microarray data. Already at two discrete expression levels in each sample, it gives a good explanation of the data and is comparable to standard techniques. |
Related In:
Results -
Collection
getmorefigures.php?uid=PMC222958&req=5
Mentions: Figure 2 shows the error in the estimation of the parameters describing the underlying distribution. We notice that even for fully correlated patterns the estimation error is less than 20% of the correct values. The estimation of the probability for biologically varying genes is somewhat worse, for fully correlated patterns the error is almost 50%. For real data one can, however, expect a much smaller correlation. The average error in the estimates of the error probabilities, as seen in Figure 3, shows the expected behavior: The average error grows with the correlation for the uncorrelated samples, while the estimate for the correlated observations is almost unaffected. Intuitively, the model compensates for the correlation by increasing P(σ1) and P(σ0) as well as the error probabilities and lowering P(σr). For correlation factors above 0.50, due to the compensation effect, the model deteriorates in explaining the data. This can be seen in the sum P(σ0) + P(σr) + P(σ1) which initially drops from almost 1 to 0.99 as the correlation factor rises from 0 to 0.50 and then from 0.99 to 0.96 for correlation factors in the range 0.50 to 0.98 (data not shown). We hence conclude that it is reasonable not to impose the condition P(σ0) + P(σ1) + P(σr) = 1 in the model, as this sum indicates if samples are strongly correlated in genes whose expression vary around the threshold of discretization. |
View Article: PubMed Central - HTML - PubMed
Affiliation: Complex Systems Division, Department of Theoretical Physics, University of Lund, Sölvegatan 14A, SE-223 62 Lund, Sweden. sven@thep.lu.se
Background: The availability of high throughput methods for measurement of mRNA concentrations makes the reliability of conclusions drawn from the data and global quality control of samples and hybridization important issues. We address these issues by an information theoretic approach, applied to discretized expression values in replicated gene expression data.
Results: Our approach yields a quantitative measure of two important parameter classes: First, the probability P(sigma/S) that a gene is in the biological state sigma in a certain variety, given its observed expression S in the samples of that variety. Second, sample specific error probabilities which serve as consistency indicators of the measured samples of each variety. The method and its limitations are tested on gene expression data for developing murine B-cells and a t-test is used as reference. On a set of known genes it performs better than the t-test despite the crude discretization into only two expression levels. The consistency indicators, i.e. the error probabilities, correlate well with variations in the biological material and thus prove efficient.
Conclusions: The proposed method is effective in determining differential gene expression and sample reliability in replicated microarray data. Already at two discrete expression levels in each sample, it gives a good explanation of the data and is comparable to standard techniques.