Limits...
An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow.

Paterson T, Law A - BMC Bioinformatics (2009)

Bottom Line: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data.The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards.Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK. trevor.paterson@roslin.ed.ac.uk

ABSTRACT

Background: Genomic analysis, particularly for less well-characterized organisms, is greatly assisted by performing comparative analyses between different types of genome maps and across species boundaries. Various providers publish a plethora of on-line resources collating genome mapping data from a multitude of species. Datasources range in scale and scope from small bespoke resources for particular organisms, through larger web-resources containing data from multiple species, to large-scale bioinformatics resources providing access to data derived from genome projects for model and non-model organisms. The heterogeneity of information held in these resources reflects both the technologies used to generate the data and the target users of each resource. Currently there is no common information exchange standard or protocol to enable access and integration of these disparate resources. Consequently data integration and comparison must be performed in an ad hoc manner.

Results: We have developed a simple generic XML schema (GenomicMappingData.xsd - GMD) to allow export and exchange of mapping data in a common lightweight XML document format. This schema represents the various types of data objects commonly described across mapping datasources and provides a mechanism for recording relationships between data objects. The schema is sufficiently generic to allow representation of any map type (for example genetic linkage maps, radiation hybrid maps, sequence maps and physical maps). It also provides mechanisms for recording data provenance and for cross referencing external datasources (including for example ENSEMBL, PubMed and Genbank.). The schema is extensible via the inclusion of additional datatypes, which can be achieved by importing further schemas, e.g. a schema defining relationship types. We have built demonstration web services that export data from our ArkDB database according to the GMD schema, facilitating the integration of data retrieval into Taverna workflows.

Conclusion: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data. The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards. Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

Show MeSH
Taverna Workflow 2. Graphical representation of a workflow using ArkDB, PubMed and GenBank web services captured from Taverna Workbench v1.7.0.0. An ArkDB accession number for a Marker (which could itself be retrieved from another ArkDB web service) is handed in as a parameter to the ArkDB fetchMarkerByAccession service. The result returned from this web service is again a DataSet document conforming to the GenomicMappingData schema, which can be parsed by the Taverna XMLSplitter, to retrieve the DataSet Element and the ExternalReferences Element, which is then parsed by the Taverna 'XPath from Text' java widget to retrieve the SourceIDs for any PubMed and GenBank external references (xrefs) in this result document. (An XPath expression to retrieve all GenBank IDs from the document being '//gmd:ExternalReference/gmd:DataSource/@idref [.="GenBank"]/../../gmd:SourceID'). These IDs are used to parameterize requests to PubMed and GenBank web services to retrieve the source records for these xrefs, using Taverna NCBI java widgets ('Get PubMed XML by PMID' and 'Get Nucleotide GBSeq XML'). In this case the Workflow outputs the original request document which details all of the information held in ArkDB about this Marker (its Mappings, and any Relationships to other objects such as Markers and Sequences) and also documents for any referenced PubMed citations or GenBank sequences referred to in the result document. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2743669&req=5

Figure 3: Taverna Workflow 2. Graphical representation of a workflow using ArkDB, PubMed and GenBank web services captured from Taverna Workbench v1.7.0.0. An ArkDB accession number for a Marker (which could itself be retrieved from another ArkDB web service) is handed in as a parameter to the ArkDB fetchMarkerByAccession service. The result returned from this web service is again a DataSet document conforming to the GenomicMappingData schema, which can be parsed by the Taverna XMLSplitter, to retrieve the DataSet Element and the ExternalReferences Element, which is then parsed by the Taverna 'XPath from Text' java widget to retrieve the SourceIDs for any PubMed and GenBank external references (xrefs) in this result document. (An XPath expression to retrieve all GenBank IDs from the document being '//gmd:ExternalReference/gmd:DataSource/@idref [.="GenBank"]/../../gmd:SourceID'). These IDs are used to parameterize requests to PubMed and GenBank web services to retrieve the source records for these xrefs, using Taverna NCBI java widgets ('Get PubMed XML by PMID' and 'Get Nucleotide GBSeq XML'). In this case the Workflow outputs the original request document which details all of the information held in ArkDB about this Marker (its Mappings, and any Relationships to other objects such as Markers and Sequences) and also documents for any referenced PubMed citations or GenBank sequences referred to in the result document. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml.

Mentions: The use of additional widgets to join third party datasources is demonstrated in Figure 3. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml. In this workflow, details of a single marker are retrieved and any external references to PubMed or GenBank Sequence resources are parsed using the Taverna XPath widget. The XML records for these are then retrieved from NCBI using Taverna widgets that connect to NCBI web services. The results of running this workflow with the input markerAccession = 'ARKMKR00023953' include two Pubmed documents, a Genbank Sequence document and the details of mappings of this marker on four maps in ArkDB. Because Taverna is able to 'scavenge' (import) publicly available web services, in particular any service described by a WSDL document, workflows can access and therefore integrate data held in a wide range of bioinformatics resources. If further genomic mapping resources provided WSDL accessible data – formatted according to our proposed schema – then semantically meaningful data integration across resource and species boundaries would be possible using the tools currently available in Taverna.


An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow.

Paterson T, Law A - BMC Bioinformatics (2009)

Taverna Workflow 2. Graphical representation of a workflow using ArkDB, PubMed and GenBank web services captured from Taverna Workbench v1.7.0.0. An ArkDB accession number for a Marker (which could itself be retrieved from another ArkDB web service) is handed in as a parameter to the ArkDB fetchMarkerByAccession service. The result returned from this web service is again a DataSet document conforming to the GenomicMappingData schema, which can be parsed by the Taverna XMLSplitter, to retrieve the DataSet Element and the ExternalReferences Element, which is then parsed by the Taverna 'XPath from Text' java widget to retrieve the SourceIDs for any PubMed and GenBank external references (xrefs) in this result document. (An XPath expression to retrieve all GenBank IDs from the document being '//gmd:ExternalReference/gmd:DataSource/@idref [.="GenBank"]/../../gmd:SourceID'). These IDs are used to parameterize requests to PubMed and GenBank web services to retrieve the source records for these xrefs, using Taverna NCBI java widgets ('Get PubMed XML by PMID' and 'Get Nucleotide GBSeq XML'). In this case the Workflow outputs the original request document which details all of the information held in ArkDB about this Marker (its Mappings, and any Relationships to other objects such as Markers and Sequences) and also documents for any referenced PubMed citations or GenBank sequences referred to in the result document. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2743669&req=5

Figure 3: Taverna Workflow 2. Graphical representation of a workflow using ArkDB, PubMed and GenBank web services captured from Taverna Workbench v1.7.0.0. An ArkDB accession number for a Marker (which could itself be retrieved from another ArkDB web service) is handed in as a parameter to the ArkDB fetchMarkerByAccession service. The result returned from this web service is again a DataSet document conforming to the GenomicMappingData schema, which can be parsed by the Taverna XMLSplitter, to retrieve the DataSet Element and the ExternalReferences Element, which is then parsed by the Taverna 'XPath from Text' java widget to retrieve the SourceIDs for any PubMed and GenBank external references (xrefs) in this result document. (An XPath expression to retrieve all GenBank IDs from the document being '//gmd:ExternalReference/gmd:DataSource/@idref [.="GenBank"]/../../gmd:SourceID'). These IDs are used to parameterize requests to PubMed and GenBank web services to retrieve the source records for these xrefs, using Taverna NCBI java widgets ('Get PubMed XML by PMID' and 'Get Nucleotide GBSeq XML'). In this case the Workflow outputs the original request document which details all of the information held in ArkDB about this Marker (its Mappings, and any Relationships to other objects such as Markers and Sequences) and also documents for any referenced PubMed citations or GenBank sequences referred to in the result document. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml.
Mentions: The use of additional widgets to join third party datasources is demonstrated in Figure 3. The Workflow document is available as additional file 6: demoWorkFlow2.xml, and the resulting data document is available as additional file 3: demo2result.xml. In this workflow, details of a single marker are retrieved and any external references to PubMed or GenBank Sequence resources are parsed using the Taverna XPath widget. The XML records for these are then retrieved from NCBI using Taverna widgets that connect to NCBI web services. The results of running this workflow with the input markerAccession = 'ARKMKR00023953' include two Pubmed documents, a Genbank Sequence document and the details of mappings of this marker on four maps in ArkDB. Because Taverna is able to 'scavenge' (import) publicly available web services, in particular any service described by a WSDL document, workflows can access and therefore integrate data held in a wide range of bioinformatics resources. If further genomic mapping resources provided WSDL accessible data – formatted according to our proposed schema – then semantically meaningful data integration across resource and species boundaries would be possible using the tools currently available in Taverna.

Bottom Line: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data.The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards.Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK. trevor.paterson@roslin.ed.ac.uk

ABSTRACT

Background: Genomic analysis, particularly for less well-characterized organisms, is greatly assisted by performing comparative analyses between different types of genome maps and across species boundaries. Various providers publish a plethora of on-line resources collating genome mapping data from a multitude of species. Datasources range in scale and scope from small bespoke resources for particular organisms, through larger web-resources containing data from multiple species, to large-scale bioinformatics resources providing access to data derived from genome projects for model and non-model organisms. The heterogeneity of information held in these resources reflects both the technologies used to generate the data and the target users of each resource. Currently there is no common information exchange standard or protocol to enable access and integration of these disparate resources. Consequently data integration and comparison must be performed in an ad hoc manner.

Results: We have developed a simple generic XML schema (GenomicMappingData.xsd - GMD) to allow export and exchange of mapping data in a common lightweight XML document format. This schema represents the various types of data objects commonly described across mapping datasources and provides a mechanism for recording relationships between data objects. The schema is sufficiently generic to allow representation of any map type (for example genetic linkage maps, radiation hybrid maps, sequence maps and physical maps). It also provides mechanisms for recording data provenance and for cross referencing external datasources (including for example ENSEMBL, PubMed and Genbank.). The schema is extensible via the inclusion of additional datatypes, which can be achieved by importing further schemas, e.g. a schema defining relationship types. We have built demonstration web services that export data from our ArkDB database according to the GMD schema, facilitating the integration of data retrieval into Taverna workflows.

Conclusion: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data. The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards. Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

Show MeSH