Limits...
Ultra-Structure database design methodology for managing systems biology data and analyses.

Maier CW, Long JG, Hemminger BM, Giddings MC - BMC Bioinformatics (2009)

Bottom Line: We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework.It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code.Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Microbiology and Immunology, UNC Chapel Hill, NC, USA. maier@med.unc.edu

ABSTRACT

Background: Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping).

Results: We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research.

Conclusion: We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

Show MeSH
Location-finding interface. This interface for querying Locations in our Ultra-Structure system is generated based on the contents of the rulebase and user input. Here, the user has a region of the genome that has been mapped as the source of a peptide detected by MS/MS analysis and she wishes to see what other annotations contain this mapped region. Initially, only panel A is shown. The region the user has is the "query location"; in the top selection box, the user indicates that it comes from the 18th draft of the human genome. The user indicates she is interested in other annotations from the same draft with the drop-down box labeled "Target Context". Finally, based on the relationships the system knows that can exist between the given Query Context and Target Context (defined by rules), the user selects a Relationship; here, she chooses "overlap-contained-by" to indicate that she wants all Locations that fully contain her query location. Upon selecting the Query Context, the interface in Panel B appears, requesting details pertinent to a Location of the chosen Context. Since the user has a genomic location, the system asks for details such as "Base Sequence" (e.g., a chromosome), and "Strand". When the user enters this information and presses the "Fetch BioEntities" button, the system reads the rules that govern the "overlap-contained-by" Relationship between Locations from the 18th draft of the human genome, finds all the Locations that satisfy that Relationship, and returns a list of all BioEntities that are associated with those Locations. In this scenario, six such BioEntities are returned, along with detailed Location information. The two rows indicated by the arrows represent annotated introns (suggesting heretofore unknown translational activity), while the other four represent annotations that themselves contain the introns (and, by extension, the user's query Location).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2748085&req=5

Figure 9: Location-finding interface. This interface for querying Locations in our Ultra-Structure system is generated based on the contents of the rulebase and user input. Here, the user has a region of the genome that has been mapped as the source of a peptide detected by MS/MS analysis and she wishes to see what other annotations contain this mapped region. Initially, only panel A is shown. The region the user has is the "query location"; in the top selection box, the user indicates that it comes from the 18th draft of the human genome. The user indicates she is interested in other annotations from the same draft with the drop-down box labeled "Target Context". Finally, based on the relationships the system knows that can exist between the given Query Context and Target Context (defined by rules), the user selects a Relationship; here, she chooses "overlap-contained-by" to indicate that she wants all Locations that fully contain her query location. Upon selecting the Query Context, the interface in Panel B appears, requesting details pertinent to a Location of the chosen Context. Since the user has a genomic location, the system asks for details such as "Base Sequence" (e.g., a chromosome), and "Strand". When the user enters this information and presses the "Fetch BioEntities" button, the system reads the rules that govern the "overlap-contained-by" Relationship between Locations from the 18th draft of the human genome, finds all the Locations that satisfy that Relationship, and returns a list of all BioEntities that are associated with those Locations. In this scenario, six such BioEntities are returned, along with detailed Location information. The two rows indicated by the arrows represent annotated introns (suggesting heretofore unknown translational activity), while the other four represent annotations that themselves contain the introns (and, by extension, the user's query Location).

Mentions: The animation procedure takes as input a Location, a Location Context (indicating the kind of Locations to search for), and a Relationship. The Location acts as an anchor for the search, which will retrieve all Locations in the given context such that the given Relationship holds. The animation procedure uses this information to generate a search key for rules in the Location Relationship ruleform (Figure 8b) that define what conditions must be true to satisfy the user's query. Figure 8b shows the rules that would be selected for a particular genomic Location overlap Relationship. The considerations of these rules, defining the conditions that exist between query Location and retrieved Location, are used to dynamically generate an SQL statement that retrieves the requested Locations. These are then used to query another ruleform to find BioEntities that are located at these Locations; this information is then displayed to the user in a web browser interface (Figure 9).


Ultra-Structure database design methodology for managing systems biology data and analyses.

Maier CW, Long JG, Hemminger BM, Giddings MC - BMC Bioinformatics (2009)

Location-finding interface. This interface for querying Locations in our Ultra-Structure system is generated based on the contents of the rulebase and user input. Here, the user has a region of the genome that has been mapped as the source of a peptide detected by MS/MS analysis and she wishes to see what other annotations contain this mapped region. Initially, only panel A is shown. The region the user has is the "query location"; in the top selection box, the user indicates that it comes from the 18th draft of the human genome. The user indicates she is interested in other annotations from the same draft with the drop-down box labeled "Target Context". Finally, based on the relationships the system knows that can exist between the given Query Context and Target Context (defined by rules), the user selects a Relationship; here, she chooses "overlap-contained-by" to indicate that she wants all Locations that fully contain her query location. Upon selecting the Query Context, the interface in Panel B appears, requesting details pertinent to a Location of the chosen Context. Since the user has a genomic location, the system asks for details such as "Base Sequence" (e.g., a chromosome), and "Strand". When the user enters this information and presses the "Fetch BioEntities" button, the system reads the rules that govern the "overlap-contained-by" Relationship between Locations from the 18th draft of the human genome, finds all the Locations that satisfy that Relationship, and returns a list of all BioEntities that are associated with those Locations. In this scenario, six such BioEntities are returned, along with detailed Location information. The two rows indicated by the arrows represent annotated introns (suggesting heretofore unknown translational activity), while the other four represent annotations that themselves contain the introns (and, by extension, the user's query Location).
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2748085&req=5

Figure 9: Location-finding interface. This interface for querying Locations in our Ultra-Structure system is generated based on the contents of the rulebase and user input. Here, the user has a region of the genome that has been mapped as the source of a peptide detected by MS/MS analysis and she wishes to see what other annotations contain this mapped region. Initially, only panel A is shown. The region the user has is the "query location"; in the top selection box, the user indicates that it comes from the 18th draft of the human genome. The user indicates she is interested in other annotations from the same draft with the drop-down box labeled "Target Context". Finally, based on the relationships the system knows that can exist between the given Query Context and Target Context (defined by rules), the user selects a Relationship; here, she chooses "overlap-contained-by" to indicate that she wants all Locations that fully contain her query location. Upon selecting the Query Context, the interface in Panel B appears, requesting details pertinent to a Location of the chosen Context. Since the user has a genomic location, the system asks for details such as "Base Sequence" (e.g., a chromosome), and "Strand". When the user enters this information and presses the "Fetch BioEntities" button, the system reads the rules that govern the "overlap-contained-by" Relationship between Locations from the 18th draft of the human genome, finds all the Locations that satisfy that Relationship, and returns a list of all BioEntities that are associated with those Locations. In this scenario, six such BioEntities are returned, along with detailed Location information. The two rows indicated by the arrows represent annotated introns (suggesting heretofore unknown translational activity), while the other four represent annotations that themselves contain the introns (and, by extension, the user's query Location).
Mentions: The animation procedure takes as input a Location, a Location Context (indicating the kind of Locations to search for), and a Relationship. The Location acts as an anchor for the search, which will retrieve all Locations in the given context such that the given Relationship holds. The animation procedure uses this information to generate a search key for rules in the Location Relationship ruleform (Figure 8b) that define what conditions must be true to satisfy the user's query. Figure 8b shows the rules that would be selected for a particular genomic Location overlap Relationship. The considerations of these rules, defining the conditions that exist between query Location and retrieved Location, are used to dynamically generate an SQL statement that retrieves the requested Locations. These are then used to query another ruleform to find BioEntities that are located at these Locations; this information is then displayed to the user in a web browser interface (Figure 9).

Bottom Line: We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework.It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code.Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

View Article: PubMed Central - HTML - PubMed

Affiliation: Department of Microbiology and Immunology, UNC Chapel Hill, NC, USA. maier@med.unc.edu

ABSTRACT

Background: Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping).

Results: We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research.

Conclusion: We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

Show MeSH