Limits...
Adaptive informatics for multifactorial and high-content biological data.

Millard BL, Niepel M, Menden MP, Muhlich JL, Sorger PK - Nat. Methods (2011)

Bottom Line: Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types.We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy.We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.

View Article: PubMed Central - PubMed

Affiliation: Center for Cell Decision Processes, Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, USA.

ABSTRACT
Whereas genomic data are universally machine-readable, data from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.

Show MeSH

Related in: MedlinePlus

Challenges in management of multi-dimensional data. (a) The schematic illustrates that experimental design is the key determinant of data dimensionality. Different experimental protocols result in experiments that yield data with different numbers and types of dimensions; in the case of single-cell pharmacology, this includes the selection of drug, dose, time and type of assay and number of cells analyzed (represented by axes whose length represents magnitude). (b) High-throughput immunofluorescence experiments managed by ImageRail have a hierarchy describable in an entity-relationship model, as illustrated in the schematic (above). The graph (below) shows the amount of data in Terabytes (black line) during successive steps in the image processing workflow as compared to the number of individual data entries (red line).
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3105758&req=5

Figure 1: Challenges in management of multi-dimensional data. (a) The schematic illustrates that experimental design is the key determinant of data dimensionality. Different experimental protocols result in experiments that yield data with different numbers and types of dimensions; in the case of single-cell pharmacology, this includes the selection of drug, dose, time and type of assay and number of cells analyzed (represented by axes whose length represents magnitude). (b) High-throughput immunofluorescence experiments managed by ImageRail have a hierarchy describable in an entity-relationship model, as illustrated in the schematic (above). The graph (below) shows the amount of data in Terabytes (black line) during successive steps in the image processing workflow as compared to the number of individual data entries (red line).

Mentions: It is widely accepted that biomedical data should be machine-readable and web-accessible. Relational database management systems (RDBMS)1,2 have proven highly effective with sequence data that are string-based, invariant in organization and interpretable without knowledge of the experiments, instruments or algorithms used to gather them. It has proven more difficult to manage data arising from complex biochemical measurements, imaging, flow cytometry and phenotypic assays of cells and tissues. The interpretation of these data, which are often unstructured (e.g., images), is critically dependent on experimental context, and this context changes frequently. The difficulty in developing satisfactory database solutions for “high-content” data is widely ascribed to insufficient standardization or poor implementation3, but we believe the problem is more fundamental: it reflects the impossibility of fully specifying a priori complex experimental designs. Flexible and creative design is the essence of good experimental science, and since design determines data structure (the number of time points, repeats, conditions, etc.), structures frequently change (Fig. 1a). To accommodate these changes, database schema must be reconfigured frequently, a complex and time-consuming task. Thus, most experimental data reside in unlinked, loosely annotated spreadsheets that are easily fragmented or lost4,5. When data scope and complexity demand a more capable repository, a new database is often created ad hoc.


Adaptive informatics for multifactorial and high-content biological data.

Millard BL, Niepel M, Menden MP, Muhlich JL, Sorger PK - Nat. Methods (2011)

Challenges in management of multi-dimensional data. (a) The schematic illustrates that experimental design is the key determinant of data dimensionality. Different experimental protocols result in experiments that yield data with different numbers and types of dimensions; in the case of single-cell pharmacology, this includes the selection of drug, dose, time and type of assay and number of cells analyzed (represented by axes whose length represents magnitude). (b) High-throughput immunofluorescence experiments managed by ImageRail have a hierarchy describable in an entity-relationship model, as illustrated in the schematic (above). The graph (below) shows the amount of data in Terabytes (black line) during successive steps in the image processing workflow as compared to the number of individual data entries (red line).
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3105758&req=5

Figure 1: Challenges in management of multi-dimensional data. (a) The schematic illustrates that experimental design is the key determinant of data dimensionality. Different experimental protocols result in experiments that yield data with different numbers and types of dimensions; in the case of single-cell pharmacology, this includes the selection of drug, dose, time and type of assay and number of cells analyzed (represented by axes whose length represents magnitude). (b) High-throughput immunofluorescence experiments managed by ImageRail have a hierarchy describable in an entity-relationship model, as illustrated in the schematic (above). The graph (below) shows the amount of data in Terabytes (black line) during successive steps in the image processing workflow as compared to the number of individual data entries (red line).
Mentions: It is widely accepted that biomedical data should be machine-readable and web-accessible. Relational database management systems (RDBMS)1,2 have proven highly effective with sequence data that are string-based, invariant in organization and interpretable without knowledge of the experiments, instruments or algorithms used to gather them. It has proven more difficult to manage data arising from complex biochemical measurements, imaging, flow cytometry and phenotypic assays of cells and tissues. The interpretation of these data, which are often unstructured (e.g., images), is critically dependent on experimental context, and this context changes frequently. The difficulty in developing satisfactory database solutions for “high-content” data is widely ascribed to insufficient standardization or poor implementation3, but we believe the problem is more fundamental: it reflects the impossibility of fully specifying a priori complex experimental designs. Flexible and creative design is the essence of good experimental science, and since design determines data structure (the number of time points, repeats, conditions, etc.), structures frequently change (Fig. 1a). To accommodate these changes, database schema must be reconfigured frequently, a complex and time-consuming task. Thus, most experimental data reside in unlinked, loosely annotated spreadsheets that are easily fragmented or lost4,5. When data scope and complexity demand a more capable repository, a new database is often created ad hoc.

Bottom Line: Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types.We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy.We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.

View Article: PubMed Central - PubMed

Affiliation: Center for Cell Decision Processes, Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, USA.

ABSTRACT
Whereas genomic data are universally machine-readable, data from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically typed data hypercubes (SDCubes) that combine hierarchical data format 5 (HDF5) and extensible markup language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We applied ImageRail to collect and analyze drug dose-response landscapes in human cell lines at single-cell resolution.

Show MeSH
Related in: MedlinePlus