Limits...
Simulation of microarray data with realistic characteristics.

Nykter M, Aho T, Ahdesmäki M, Ruusuvuori P, Lehmussola A, Yli-Harja O - BMC Bioinformatics (2006)

Bottom Line: It includes several error models that have been proposed earlier and it can be used with different types of input data.The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays.All this makes the model a valuable tool for example in validation of data analysis algorithms.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Signal Processing, Tampere University of Technology, Tampere, Finland. matti.nykter@tut.fi

ABSTRACT

Background: Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed.

Results: We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples.

Conclusion: The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms.

Show MeSH
Scatter plots of the simulated data. MA-plot is shown for (a) noise free simulated data, (b) noise free data extracted from slide, and (c) realistic simulated data. Lowess fit is shown over each scatter plot to illustrate the trends in the data. All data points extracted from slide are scaled to [0,1] interval, thus the scale of the x-axis is different to the simulated noise free data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC1574357&req=5

Figure 9: Scatter plots of the simulated data. MA-plot is shown for (a) noise free simulated data, (b) noise free data extracted from slide, and (c) realistic simulated data. Lowess fit is shown over each scatter plot to illustrate the trends in the data. All data points extracted from slide are scaled to [0,1] interval, thus the scale of the x-axis is different to the simulated noise free data.

Mentions: After the error model is applied, we generate the slide images. As an example we show two slide images in Figure 8, generated at time instants 10 and 200 minutes corresponding to the time scale in Figures 6 and 7. It can be observed that in the beginning only one spot shows a difference in the expression. This corresponds to the gene that was knocked out. At the later time instant the effect of the knockout has spread through the network, and many genes show change in their expression. That is, spot colors have changed from yellow to red or green. In this example the amount of biological and measurement errors is in the minimum in order to point out the spreading effects of the knock out. Adding more measurement and biological errors would introduce more changes in the expressions already at the 10 minutes time instant. To illustrate that the simulated data has properties similar to real microarray data, we show scatter plots of the simulated data. For this example the ground truth data is drawn from predetermined distribution. Common assumption is that the ground truth expression values are from an exponential distribution I = λe-λx [7,15]. We draw 10000 expression values from this distribution, with λ = 1/3000. As we are interested in evaluating the quality of the data, we do not introduce any differentially expressed genes, but simulate a self versus self experiment as explained in [15]. Red and green intensities IR and IG are drawn from a normal distribution N (I, αI), where α = 0.1 and I is a realization from an exponential distribution. Next, the final red and green intensities IR and IG are transformed with , where x = {IR, IG} with parameters a0 = 1.04, a1 = 0.5 for IR and a0 = 0.95, a1 = -0.2 for IG. This is a simplified version of the ground truth data generation proposed in [15]. Scatter plot of the data is shown in Figure 9(a). To illustrate what is the effect of reading the spot values from a simulated slide image, we simulate a microarray without adding any biological or measurement errors to the ground truth data. Only slide manufacturing and hybridization errors are introduced. Scatter plot of the noise free data extracted from a simulated slide image is shown in Figure 9(b). It can be observed that the extraction of the data alone introduces some errors.


Simulation of microarray data with realistic characteristics.

Nykter M, Aho T, Ahdesmäki M, Ruusuvuori P, Lehmussola A, Yli-Harja O - BMC Bioinformatics (2006)

Scatter plots of the simulated data. MA-plot is shown for (a) noise free simulated data, (b) noise free data extracted from slide, and (c) realistic simulated data. Lowess fit is shown over each scatter plot to illustrate the trends in the data. All data points extracted from slide are scaled to [0,1] interval, thus the scale of the x-axis is different to the simulated noise free data.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC1574357&req=5

Figure 9: Scatter plots of the simulated data. MA-plot is shown for (a) noise free simulated data, (b) noise free data extracted from slide, and (c) realistic simulated data. Lowess fit is shown over each scatter plot to illustrate the trends in the data. All data points extracted from slide are scaled to [0,1] interval, thus the scale of the x-axis is different to the simulated noise free data.
Mentions: After the error model is applied, we generate the slide images. As an example we show two slide images in Figure 8, generated at time instants 10 and 200 minutes corresponding to the time scale in Figures 6 and 7. It can be observed that in the beginning only one spot shows a difference in the expression. This corresponds to the gene that was knocked out. At the later time instant the effect of the knockout has spread through the network, and many genes show change in their expression. That is, spot colors have changed from yellow to red or green. In this example the amount of biological and measurement errors is in the minimum in order to point out the spreading effects of the knock out. Adding more measurement and biological errors would introduce more changes in the expressions already at the 10 minutes time instant. To illustrate that the simulated data has properties similar to real microarray data, we show scatter plots of the simulated data. For this example the ground truth data is drawn from predetermined distribution. Common assumption is that the ground truth expression values are from an exponential distribution I = λe-λx [7,15]. We draw 10000 expression values from this distribution, with λ = 1/3000. As we are interested in evaluating the quality of the data, we do not introduce any differentially expressed genes, but simulate a self versus self experiment as explained in [15]. Red and green intensities IR and IG are drawn from a normal distribution N (I, αI), where α = 0.1 and I is a realization from an exponential distribution. Next, the final red and green intensities IR and IG are transformed with , where x = {IR, IG} with parameters a0 = 1.04, a1 = 0.5 for IR and a0 = 0.95, a1 = -0.2 for IG. This is a simplified version of the ground truth data generation proposed in [15]. Scatter plot of the data is shown in Figure 9(a). To illustrate what is the effect of reading the spot values from a simulated slide image, we simulate a microarray without adding any biological or measurement errors to the ground truth data. Only slide manufacturing and hybridization errors are introduced. Scatter plot of the noise free data extracted from a simulated slide image is shown in Figure 9(b). It can be observed that the extraction of the data alone introduces some errors.

Bottom Line: It includes several error models that have been proposed earlier and it can be used with different types of input data.The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays.All this makes the model a valuable tool for example in validation of data analysis algorithms.

View Article: PubMed Central - HTML - PubMed

Affiliation: Institute of Signal Processing, Tampere University of Technology, Tampere, Finland. matti.nykter@tut.fi

ABSTRACT

Background: Microarray technologies have become common tools in biological research. As a result, a need for effective computational methods for data analysis has emerged. Numerous different algorithms have been proposed for analyzing the data. However, an objective evaluation of the proposed algorithms is not possible due to the lack of biological ground truth information. To overcome this fundamental problem, the use of simulated microarray data for algorithm validation has been proposed.

Results: We present a microarray simulation model which can be used to validate different kinds of data analysis algorithms. The proposed model is unique in the sense that it includes all the steps that affect the quality of real microarray data. These steps include the simulation of biological ground truth data, applying biological and measurement technology specific error models, and finally simulating the microarray slide manufacturing and hybridization. After all these steps are taken into account, the simulated data has realistic biological and statistical characteristics. The applicability of the proposed model is demonstrated by several examples.

Conclusion: The proposed microarray simulation model is modular and can be used in different kinds of applications. It includes several error models that have been proposed earlier and it can be used with different types of input data. The model can be used to simulate both spotted two-channel and oligonucleotide based single-channel microarrays. All this makes the model a valuable tool for example in validation of data analysis algorithms.

Show MeSH