Limits...
SNPpy--database management for SNP data from genome wide association studies.

Mitha F, Herodotou H, Borisov N, Jiang C, Yoder J, Owzar K - PLoS ONE (2011)

Bottom Line: To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats.It does low level and flexible data validation, including validation of patient data.SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, United States of America. faheem@faheem.info

ABSTRACT

Background: We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software.

Results: The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses.

Conclusions: By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

Show MeSH
Workflow Chart.This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.
© Copyright Policy
Related In: Results  -  Collection


getmorefigures.php?uid=PMC3198468&req=5

pone-0024982-g001: Workflow Chart.This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.

Mentions: Consider the task of converting GWAS data, consisting of phenotype and genotype data, into standard format files like PED/MAP or TPED/TFAM, or merging data across different GWAS datasets. An approach based on manipulating these files requires the software to accommodate different source data formats like those used by Affymetrix and Illumina. Such functionality would thus depend on both the source data format and the output data format, so the number of functions required is of the order of , where and are the number of source formats and the number of output formats respectively. In contrast, our system breaks this task down into two distinct parts. The first part involves loading data from various files into the database; a task which is commonly referred to as Extract-Transform-Load (ETL). The second part involves exporting data from the relational database via Structured Query Language (SQL) queries to standard format data files for use in further analyses. This process is illustrated in Figure 1. For simplicity, assume one database layout/schema with possible minor variations caused by differences in source formats. Then we can write a single ETL function for the import of each commercial or ad-hoc source format. The database export functions need not depend on the source format, since the database layout is sufficiently independent from the source format. Therefore the data export for each output format can be implemented as a single SQL query. Thus the number of functions required here is of the order of +, and this simplifies the task.


SNPpy--database management for SNP data from genome wide association studies.

Mitha F, Herodotou H, Borisov N, Jiang C, Yoder J, Owzar K - PLoS ONE (2011)

Workflow Chart.This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.
© Copyright Policy
Related In: Results  -  Collection

Show All Figures
getmorefigures.php?uid=PMC3198468&req=5

pone-0024982-g001: Workflow Chart.This figure shows the data workflow. First the genotypic and phenotypic data are loaded into the database. The data is then exported from the database as standard format files, including a possible filtering and/or merging step. Finally, the output files are further analyzed using third party tools.
Mentions: Consider the task of converting GWAS data, consisting of phenotype and genotype data, into standard format files like PED/MAP or TPED/TFAM, or merging data across different GWAS datasets. An approach based on manipulating these files requires the software to accommodate different source data formats like those used by Affymetrix and Illumina. Such functionality would thus depend on both the source data format and the output data format, so the number of functions required is of the order of , where and are the number of source formats and the number of output formats respectively. In contrast, our system breaks this task down into two distinct parts. The first part involves loading data from various files into the database; a task which is commonly referred to as Extract-Transform-Load (ETL). The second part involves exporting data from the relational database via Structured Query Language (SQL) queries to standard format data files for use in further analyses. This process is illustrated in Figure 1. For simplicity, assume one database layout/schema with possible minor variations caused by differences in source formats. Then we can write a single ETL function for the import of each commercial or ad-hoc source format. The database export functions need not depend on the source format, since the database layout is sufficiently independent from the source format. Therefore the data export for each output format can be implemented as a single SQL query. Thus the number of functions required here is of the order of +, and this simplifies the task.

Bottom Line: To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats.It does low level and flexible data validation, including validation of patient data.SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

View Article: PubMed Central - PubMed

Affiliation: Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, United States of America. faheem@faheem.info

ABSTRACT

Background: We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software.

Results: The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses.

Conclusions: By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.

Show MeSH