Limits...
Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

Soranno PA, Bissell EG, Cheruvelil KS, Christel ST, Collins SM, Fergus CE, Filstrup CT, Lapierre JF, Lottig NR, Oliver SK, Scott CE, Smith NJ, Stopyak S, Yuan S, Bremigan MT, Downing JA, Gries C, Henry EN, Skaff NK, Stanley EH, Stow CA, Tan PN, Wagner T, Webster KE - Gigascience (2015)

Bottom Line: Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data.The largest challenge of this task was the heterogeneity of the data, formats, and metadata.Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

View Article: PubMed Central - PubMed

Affiliation: Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI 48824 USA.

ABSTRACT
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

No MeSH data available.


Related in: MedlinePlus

Database schema for LAGOS including the two main modules: LAGOSGEO (green box) and LAGOSLIMNO (blue box). The component that links the two models is the ‘aggregated lakes’ table (LAGOS lakes) that has the unique identifier and spatial location for all 50,000 lakes. LAGOSGEO data are stored in horizontal tables that are all linked back to the spatial extents for which they are calculated and ultimately linked to each of the 50,000 individual lakes. The LAGOSGEO data includes information for each lake, calculated at a range of different spatial extents that the lake is located within (such as its watershed, its HUC 12, or its state). Each green box identifies a theme of data, the number of metrics that are calculated for that theme, and the number of years over which the data are sampled. LAGOSLIMNO data are stored in vertical tables that are also all linked back to the aggregated lakes table. The ‘limno values’ table and associated tables (in blue) include the values from the ecosystem-level datasets for water quality; each value also has other tables linked to it that describe features of that data value such as the water depth at which it was taken, the flags associated with it, and other metadata at the data value level. The ‘program-level’ tables (in purple) include information about the program responsible for collecting the data. Finally, the ‘source lakes’ table and associated tables include information about each lake where available. Note that a single source can have multiple programs that represent different datasets provided to LAGOS
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
getmorefigures.php?uid=PMC4488039&req=5

Fig6: Database schema for LAGOS including the two main modules: LAGOSGEO (green box) and LAGOSLIMNO (blue box). The component that links the two models is the ‘aggregated lakes’ table (LAGOS lakes) that has the unique identifier and spatial location for all 50,000 lakes. LAGOSGEO data are stored in horizontal tables that are all linked back to the spatial extents for which they are calculated and ultimately linked to each of the 50,000 individual lakes. The LAGOSGEO data includes information for each lake, calculated at a range of different spatial extents that the lake is located within (such as its watershed, its HUC 12, or its state). Each green box identifies a theme of data, the number of metrics that are calculated for that theme, and the number of years over which the data are sampled. LAGOSLIMNO data are stored in vertical tables that are also all linked back to the aggregated lakes table. The ‘limno values’ table and associated tables (in blue) include the values from the ecosystem-level datasets for water quality; each value also has other tables linked to it that describe features of that data value such as the water depth at which it was taken, the flags associated with it, and other metadata at the data value level. The ‘program-level’ tables (in purple) include information about the program responsible for collecting the data. Finally, the ‘source lakes’ table and associated tables include information about each lake where available. Note that a single source can have multiple programs that represent different datasets provided to LAGOS

Mentions: A description of the major components and data themes that are integrated to create LAGOS. P is phosphorus, N is nitrogen, C is carbon. Further detail is provided in Figures 5 and 6


Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

Soranno PA, Bissell EG, Cheruvelil KS, Christel ST, Collins SM, Fergus CE, Filstrup CT, Lapierre JF, Lottig NR, Oliver SK, Scott CE, Smith NJ, Stopyak S, Yuan S, Bremigan MT, Downing JA, Gries C, Henry EN, Skaff NK, Stanley EH, Stow CA, Tan PN, Wagner T, Webster KE - Gigascience (2015)

Database schema for LAGOS including the two main modules: LAGOSGEO (green box) and LAGOSLIMNO (blue box). The component that links the two models is the ‘aggregated lakes’ table (LAGOS lakes) that has the unique identifier and spatial location for all 50,000 lakes. LAGOSGEO data are stored in horizontal tables that are all linked back to the spatial extents for which they are calculated and ultimately linked to each of the 50,000 individual lakes. The LAGOSGEO data includes information for each lake, calculated at a range of different spatial extents that the lake is located within (such as its watershed, its HUC 12, or its state). Each green box identifies a theme of data, the number of metrics that are calculated for that theme, and the number of years over which the data are sampled. LAGOSLIMNO data are stored in vertical tables that are also all linked back to the aggregated lakes table. The ‘limno values’ table and associated tables (in blue) include the values from the ecosystem-level datasets for water quality; each value also has other tables linked to it that describe features of that data value such as the water depth at which it was taken, the flags associated with it, and other metadata at the data value level. The ‘program-level’ tables (in purple) include information about the program responsible for collecting the data. Finally, the ‘source lakes’ table and associated tables include information about each lake where available. Note that a single source can have multiple programs that represent different datasets provided to LAGOS
© Copyright Policy - OpenAccess
Related In: Results  -  Collection

License 1 - License 2
Show All Figures
getmorefigures.php?uid=PMC4488039&req=5

Fig6: Database schema for LAGOS including the two main modules: LAGOSGEO (green box) and LAGOSLIMNO (blue box). The component that links the two models is the ‘aggregated lakes’ table (LAGOS lakes) that has the unique identifier and spatial location for all 50,000 lakes. LAGOSGEO data are stored in horizontal tables that are all linked back to the spatial extents for which they are calculated and ultimately linked to each of the 50,000 individual lakes. The LAGOSGEO data includes information for each lake, calculated at a range of different spatial extents that the lake is located within (such as its watershed, its HUC 12, or its state). Each green box identifies a theme of data, the number of metrics that are calculated for that theme, and the number of years over which the data are sampled. LAGOSLIMNO data are stored in vertical tables that are also all linked back to the aggregated lakes table. The ‘limno values’ table and associated tables (in blue) include the values from the ecosystem-level datasets for water quality; each value also has other tables linked to it that describe features of that data value such as the water depth at which it was taken, the flags associated with it, and other metadata at the data value level. The ‘program-level’ tables (in purple) include information about the program responsible for collecting the data. Finally, the ‘source lakes’ table and associated tables include information about each lake where available. Note that a single source can have multiple programs that represent different datasets provided to LAGOS
Mentions: A description of the major components and data themes that are integrated to create LAGOS. P is phosphorus, N is nitrogen, C is carbon. Further detail is provided in Figures 5 and 6

Bottom Line: Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data.The largest challenge of this task was the heterogeneity of the data, formats, and metadata.Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

View Article: PubMed Central - PubMed

Affiliation: Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI 48824 USA.

ABSTRACT
Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.

No MeSH data available.


Related in: MedlinePlus