Limits...
An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow.

Paterson T, Law A - BMC Bioinformatics (2009)

Bottom Line: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data.The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards.Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK. trevor.paterson@roslin.ed.ac.uk

ABSTRACT

Background: Genomic analysis, particularly for less well-characterized organisms, is greatly assisted by performing comparative analyses between different types of genome maps and across species boundaries. Various providers publish a plethora of on-line resources collating genome mapping data from a multitude of species. Datasources range in scale and scope from small bespoke resources for particular organisms, through larger web-resources containing data from multiple species, to large-scale bioinformatics resources providing access to data derived from genome projects for model and non-model organisms. The heterogeneity of information held in these resources reflects both the technologies used to generate the data and the target users of each resource. Currently there is no common information exchange standard or protocol to enable access and integration of these disparate resources. Consequently data integration and comparison must be performed in an ad hoc manner.

Results: We have developed a simple generic XML schema (GenomicMappingData.xsd - GMD) to allow export and exchange of mapping data in a common lightweight XML document format. This schema represents the various types of data objects commonly described across mapping datasources and provides a mechanism for recording relationships between data objects. The schema is sufficiently generic to allow representation of any map type (for example genetic linkage maps, radiation hybrid maps, sequence maps and physical maps). It also provides mechanisms for recording data provenance and for cross referencing external datasources (including for example ENSEMBL, PubMed and Genbank.). The schema is extensible via the inclusion of additional datatypes, which can be achieved by importing further schemas, e.g. a schema defining relationship types. We have built demonstration web services that export data from our ArkDB database according to the GMD schema, facilitating the integration of data retrieval into Taverna workflows.

Conclusion: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data. The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards. Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

Show MeSH
Taverna Workflow 3. Summary representation of a Workflow that uses ArkDB, DDBJ and GenBank web services to find potential sequence homologies to Sheep Chromosome Y, captured from Taverna Workbench v1.7.0.0. (The actual Workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). The input parameters 'Sheep' and 'Y' are used to retrieve all Maps for this Chromosome from ArkDB, the Map accession numbers are used to retrieve full details of all these Maps including all the Mappings of Markers, and the Marker accession numbers are used to retrieve full details of all the Markers on these Maps. Where these Markers are associated with a sequence, the fasta formatted sequences are retrieved from GenBank and these are used to perform a BlastN search using DDBJ services. All GMD DataSet documents are output for the Map and Marker details, as are the fasta sequences from GenBank and the Sequence similarities found at DDBJ.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC2743669&req=5

Figure 4: Taverna Workflow 3. Summary representation of a Workflow that uses ArkDB, DDBJ and GenBank web services to find potential sequence homologies to Sheep Chromosome Y, captured from Taverna Workbench v1.7.0.0. (The actual Workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). The input parameters 'Sheep' and 'Y' are used to retrieve all Maps for this Chromosome from ArkDB, the Map accession numbers are used to retrieve full details of all these Maps including all the Mappings of Markers, and the Marker accession numbers are used to retrieve full details of all the Markers on these Maps. Where these Markers are associated with a sequence, the fasta formatted sequences are retrieved from GenBank and these are used to perform a BlastN search using DDBJ services. All GMD DataSet documents are output for the Map and Marker details, as are the fasta sequences from GenBank and the Sequence similarities found at DDBJ.

Mentions: Figure 4 demonstrates a more useful workflow created in Taverna, which is capable of finding sequence similarities to any chromosome represented in ArkDB. (The actual workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). In this case the full details for all Maps of a named Species Chromosome are retrieved, together with full details of all the Markers, and any referenced Sequences associated with these markers are fetched from GenBank. These sequences are then used to perform a BlastN search against a sequence database (here we use a simple blast service provided by DDBJ [25]) which will potentially find any cross species homologies to markers mapped to the input chromosome. When the workflow is run with the input parameters 'Sheep' and 'Y', 5 maps are found in ArkDB with a small number of Markers on each. Four of these are associated with sequences retrieved from GenBank, and BlastN comparisons are then performed upon these. Further additions to the workflow could be used to parse out the details of sequences of interest to retrieve possible links to homologous and potentially orthologous regions of chromosomes of other species. Many web services can return XML formatted documents, with associated XSD schemas, which facilitates this parsing and chaining of services (e.g. EBI's SOAP services [26]).


An XML transfer schema for exchange of genomic and genetic mapping data: implementation as a web service in a Taverna workflow.

Paterson T, Law A - BMC Bioinformatics (2009)

Taverna Workflow 3. Summary representation of a Workflow that uses ArkDB, DDBJ and GenBank web services to find potential sequence homologies to Sheep Chromosome Y, captured from Taverna Workbench v1.7.0.0. (The actual Workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). The input parameters 'Sheep' and 'Y' are used to retrieve all Maps for this Chromosome from ArkDB, the Map accession numbers are used to retrieve full details of all these Maps including all the Mappings of Markers, and the Marker accession numbers are used to retrieve full details of all the Markers on these Maps. Where these Markers are associated with a sequence, the fasta formatted sequences are retrieved from GenBank and these are used to perform a BlastN search using DDBJ services. All GMD DataSet documents are output for the Map and Marker details, as are the fasta sequences from GenBank and the Sequence similarities found at DDBJ.
© Copyright Policy - open-access
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC2743669&req=5

Figure 4: Taverna Workflow 3. Summary representation of a Workflow that uses ArkDB, DDBJ and GenBank web services to find potential sequence homologies to Sheep Chromosome Y, captured from Taverna Workbench v1.7.0.0. (The actual Workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). The input parameters 'Sheep' and 'Y' are used to retrieve all Maps for this Chromosome from ArkDB, the Map accession numbers are used to retrieve full details of all these Maps including all the Mappings of Markers, and the Marker accession numbers are used to retrieve full details of all the Markers on these Maps. Where these Markers are associated with a sequence, the fasta formatted sequences are retrieved from GenBank and these are used to perform a BlastN search using DDBJ services. All GMD DataSet documents are output for the Map and Marker details, as are the fasta sequences from GenBank and the Sequence similarities found at DDBJ.
Mentions: Figure 4 demonstrates a more useful workflow created in Taverna, which is capable of finding sequence similarities to any chromosome represented in ArkDB. (The actual workflow is more detailed and uses a temporary file to reduce the number of calls to the external services by removing GenBank sequence ID redundancy, this diagram is available as additional file 7: fig5.pdf, and the Workflow document as additional files 8, 9 and 10: demoWorkFlow3.xml, appendtoFile.xml, listFromFile.xml). In this case the full details for all Maps of a named Species Chromosome are retrieved, together with full details of all the Markers, and any referenced Sequences associated with these markers are fetched from GenBank. These sequences are then used to perform a BlastN search against a sequence database (here we use a simple blast service provided by DDBJ [25]) which will potentially find any cross species homologies to markers mapped to the input chromosome. When the workflow is run with the input parameters 'Sheep' and 'Y', 5 maps are found in ArkDB with a small number of Markers on each. Four of these are associated with sequences retrieved from GenBank, and BlastN comparisons are then performed upon these. Further additions to the workflow could be used to parse out the details of sequences of interest to retrieve possible links to homologous and potentially orthologous regions of chromosomes of other species. Many web services can return XML formatted documents, with associated XSD schemas, which facilitates this parsing and chaining of services (e.g. EBI's SOAP services [26]).

Bottom Line: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data.The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards.Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

View Article: PubMed Central - HTML - PubMed

Affiliation: Division of Genetics and Genomics, The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Edinburgh, UK. trevor.paterson@roslin.ed.ac.uk

ABSTRACT

Background: Genomic analysis, particularly for less well-characterized organisms, is greatly assisted by performing comparative analyses between different types of genome maps and across species boundaries. Various providers publish a plethora of on-line resources collating genome mapping data from a multitude of species. Datasources range in scale and scope from small bespoke resources for particular organisms, through larger web-resources containing data from multiple species, to large-scale bioinformatics resources providing access to data derived from genome projects for model and non-model organisms. The heterogeneity of information held in these resources reflects both the technologies used to generate the data and the target users of each resource. Currently there is no common information exchange standard or protocol to enable access and integration of these disparate resources. Consequently data integration and comparison must be performed in an ad hoc manner.

Results: We have developed a simple generic XML schema (GenomicMappingData.xsd - GMD) to allow export and exchange of mapping data in a common lightweight XML document format. This schema represents the various types of data objects commonly described across mapping datasources and provides a mechanism for recording relationships between data objects. The schema is sufficiently generic to allow representation of any map type (for example genetic linkage maps, radiation hybrid maps, sequence maps and physical maps). It also provides mechanisms for recording data provenance and for cross referencing external datasources (including for example ENSEMBL, PubMed and Genbank.). The schema is extensible via the inclusion of additional datatypes, which can be achieved by importing further schemas, e.g. a schema defining relationship types. We have built demonstration web services that export data from our ArkDB database according to the GMD schema, facilitating the integration of data retrieval into Taverna workflows.

Conclusion: The data exchange standard we present here provides a useful generic format for transfer and integration of genomic and genetic mapping data. The extensibility of our schema allows for inclusion of additional data and provides a mechanism for typing mapping objects via third party standards. Web services retrieving GMD-compliant mapping data demonstrate that use of this exchange standard provides a practical mechanism for achieving data integration, by facilitating syntactically and semantically-controlled access to the data.

Show MeSH