Limits...
The HIV mutation browser: a resource for human immunodeficiency virus mutagenesis and polymorphism data.

Davey NE, Satagopam VP, Santiago-Mozos S, Villacorta-Martin C, Bharat TA, Schneider R, Briggs JA - PLoS Comput. Biol. (2014)

Bottom Line: The results of this research effort are recorded in the scientific literature, but it is difficult for virologists to rapidly find it.Accordingly, the HIV research community would benefit from a resource cataloguing the available HIV mutation literature.The current release of the HIV mutation browser describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers.

View Article: PubMed Central - PubMed

Affiliation: Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany; Department of Physiology and Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, California, United States of America.

ABSTRACT
Huge research effort has been invested over many years to determine the phenotypes of natural or artificial mutations in HIV proteins--interpretation of mutation phenotypes is an invaluable source of new knowledge. The results of this research effort are recorded in the scientific literature, but it is difficult for virologists to rapidly find it. Manually locating data on phenotypic variation within the approximately 270,000 available HIV-related research articles, or the further 1,500 articles that are published each month is a daunting task. Accordingly, the HIV research community would benefit from a resource cataloguing the available HIV mutation literature. We have applied computational text-mining techniques to parse and map mutagenesis and polymorphism information from the HIV literature, have enriched the data with ancillary information and have developed a public, web-based interface through which it can be intuitively explored: the HIV mutation browser. The current release of the HIV mutation browser describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers. The mutation information for each protein is organised in a residue-centric manner and each residue is linked to the relevant experimental literature. The importance of HIV as a global health burden advocates extensive effort to maximise the efficiency of HIV research. The HIV mutation browser provides a valuable new resource for the research community. The HIV mutation browser is available at: http://hivmut.org.

Show MeSH

Related in: MedlinePlus

Example illustrations of issues associated with parsing and mapping mutation data.(A) Representative simple and complex examples of sentences recognised by the templates used to perform the mutation text-mining of articles (see Table S1 for complete list). Information of interest representing the wildtype residue (blue) and mutated residue (green) are coloured and position of the mutations are underlined. (B) Illustration of the distinct numbering schemes for different chains of the same protein. The shown peptide sequence is a short region (497–527) of the HIV Envelope glycoprotein gp160 (Env) overlapping the site cleaved by the host furin to produce the Surface protein gp120 and Transmembrane protein gp41 chains. The cleavage site is denoted by a grey triangle. The numbering above the sequence defines the position relative to the start of the gp160 protein and the numbering below the sequence defines the position relative to the start of the gp41 chain. (C) Examples of three sentences from the HIV literature where each article uses a different nomenclature or numbering scheme (blue) to describe the same mutation at the same site, Valine at residue 513 in the gp160 protein. Each article also refers to the protein by the chain name, gp41, rather than the name of the unprocessed protein, Envelope glycoprotein gp160 (Env), used for mapping in the HIV mutation resource. One example sentence refers to the gp41 chain while utilising the numbering for the unprocessed protein.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4256008&req=5

pcbi-1003951-g004: Example illustrations of issues associated with parsing and mapping mutation data.(A) Representative simple and complex examples of sentences recognised by the templates used to perform the mutation text-mining of articles (see Table S1 for complete list). Information of interest representing the wildtype residue (blue) and mutated residue (green) are coloured and position of the mutations are underlined. (B) Illustration of the distinct numbering schemes for different chains of the same protein. The shown peptide sequence is a short region (497–527) of the HIV Envelope glycoprotein gp160 (Env) overlapping the site cleaved by the host furin to produce the Surface protein gp120 and Transmembrane protein gp41 chains. The cleavage site is denoted by a grey triangle. The numbering above the sequence defines the position relative to the start of the gp160 protein and the numbering below the sequence defines the position relative to the start of the gp41 chain. (C) Examples of three sentences from the HIV literature where each article uses a different nomenclature or numbering scheme (blue) to describe the same mutation at the same site, Valine at residue 513 in the gp160 protein. Each article also refers to the protein by the chain name, gp41, rather than the name of the unprocessed protein, Envelope glycoprotein gp160 (Env), used for mapping in the HIV mutation resource. One example sentence refers to the gp41 chain while utilising the numbering for the unprocessed protein.

Mentions: There is no globally applied nomenclature to define directed mutations in mutagenesis experiments or polymorphisms [5], [11], [12]. A set of templates that define phrases and shorthand widely used to describe directed mutations in mutagenesis experiments or polymorphisms was created based on the work of Caporaso et al.[5] (Figure 4A, see Table S1 for full list). Each article in the HIV literature dataset was converted to plain text and scanned using this set of templates. These templates consist of 3 pieces of information: the position in the protein which has been mutated, the amino acid present in that position in the wildtype sequence of the isolate, and the non-wildtype amino acid to which the residue has been mutated. For example, consider the sentence “As reported previously, S52A and S56A mutations of Vpu had no effect on virus release” [13]. S52A and S56A refer to the experimental mutagenesis of a serine to an alanine at position 52 and 56 in the Vpu protein.


The HIV mutation browser: a resource for human immunodeficiency virus mutagenesis and polymorphism data.

Davey NE, Satagopam VP, Santiago-Mozos S, Villacorta-Martin C, Bharat TA, Schneider R, Briggs JA - PLoS Comput. Biol. (2014)

Example illustrations of issues associated with parsing and mapping mutation data.(A) Representative simple and complex examples of sentences recognised by the templates used to perform the mutation text-mining of articles (see Table S1 for complete list). Information of interest representing the wildtype residue (blue) and mutated residue (green) are coloured and position of the mutations are underlined. (B) Illustration of the distinct numbering schemes for different chains of the same protein. The shown peptide sequence is a short region (497–527) of the HIV Envelope glycoprotein gp160 (Env) overlapping the site cleaved by the host furin to produce the Surface protein gp120 and Transmembrane protein gp41 chains. The cleavage site is denoted by a grey triangle. The numbering above the sequence defines the position relative to the start of the gp160 protein and the numbering below the sequence defines the position relative to the start of the gp41 chain. (C) Examples of three sentences from the HIV literature where each article uses a different nomenclature or numbering scheme (blue) to describe the same mutation at the same site, Valine at residue 513 in the gp160 protein. Each article also refers to the protein by the chain name, gp41, rather than the name of the unprocessed protein, Envelope glycoprotein gp160 (Env), used for mapping in the HIV mutation resource. One example sentence refers to the gp41 chain while utilising the numbering for the unprocessed protein.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4256008&req=5

pcbi-1003951-g004: Example illustrations of issues associated with parsing and mapping mutation data.(A) Representative simple and complex examples of sentences recognised by the templates used to perform the mutation text-mining of articles (see Table S1 for complete list). Information of interest representing the wildtype residue (blue) and mutated residue (green) are coloured and position of the mutations are underlined. (B) Illustration of the distinct numbering schemes for different chains of the same protein. The shown peptide sequence is a short region (497–527) of the HIV Envelope glycoprotein gp160 (Env) overlapping the site cleaved by the host furin to produce the Surface protein gp120 and Transmembrane protein gp41 chains. The cleavage site is denoted by a grey triangle. The numbering above the sequence defines the position relative to the start of the gp160 protein and the numbering below the sequence defines the position relative to the start of the gp41 chain. (C) Examples of three sentences from the HIV literature where each article uses a different nomenclature or numbering scheme (blue) to describe the same mutation at the same site, Valine at residue 513 in the gp160 protein. Each article also refers to the protein by the chain name, gp41, rather than the name of the unprocessed protein, Envelope glycoprotein gp160 (Env), used for mapping in the HIV mutation resource. One example sentence refers to the gp41 chain while utilising the numbering for the unprocessed protein.
Mentions: There is no globally applied nomenclature to define directed mutations in mutagenesis experiments or polymorphisms [5], [11], [12]. A set of templates that define phrases and shorthand widely used to describe directed mutations in mutagenesis experiments or polymorphisms was created based on the work of Caporaso et al.[5] (Figure 4A, see Table S1 for full list). Each article in the HIV literature dataset was converted to plain text and scanned using this set of templates. These templates consist of 3 pieces of information: the position in the protein which has been mutated, the amino acid present in that position in the wildtype sequence of the isolate, and the non-wildtype amino acid to which the residue has been mutated. For example, consider the sentence “As reported previously, S52A and S56A mutations of Vpu had no effect on virus release” [13]. S52A and S56A refer to the experimental mutagenesis of a serine to an alanine at position 52 and 56 in the Vpu protein.

Bottom Line: The results of this research effort are recorded in the scientific literature, but it is difficult for virologists to rapidly find it.Accordingly, the HIV research community would benefit from a resource cataloguing the available HIV mutation literature.The current release of the HIV mutation browser describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers.

View Article: PubMed Central - PubMed

Affiliation: Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany; Department of Physiology and Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, California, United States of America.

ABSTRACT
Huge research effort has been invested over many years to determine the phenotypes of natural or artificial mutations in HIV proteins--interpretation of mutation phenotypes is an invaluable source of new knowledge. The results of this research effort are recorded in the scientific literature, but it is difficult for virologists to rapidly find it. Manually locating data on phenotypic variation within the approximately 270,000 available HIV-related research articles, or the further 1,500 articles that are published each month is a daunting task. Accordingly, the HIV research community would benefit from a resource cataloguing the available HIV mutation literature. We have applied computational text-mining techniques to parse and map mutagenesis and polymorphism information from the HIV literature, have enriched the data with ancillary information and have developed a public, web-based interface through which it can be intuitively explored: the HIV mutation browser. The current release of the HIV mutation browser describes the phenotypes of 7,608 unique mutations at 2,520 sites in the HIV proteome, resulting from the analysis of 120,899 papers. The mutation information for each protein is organised in a residue-centric manner and each residue is linked to the relevant experimental literature. The importance of HIV as a global health burden advocates extensive effort to maximise the efficiency of HIV research. The HIV mutation browser provides a valuable new resource for the research community. The HIV mutation browser is available at: http://hivmut.org.

Show MeSH
Related in: MedlinePlus