UniProt: a hub for protein information.
Bottom Line: We present a new website that has been designed using a user-experience design process.We have introduced an annotation score for all entries in UniProt to represent the relative amount of knowledge known about each protein.These scores will be helpful in identifying which proteins are the best characterized and most informative for comparative analysis.
Related in: MedlinePlus
Mentions: The number of protein sequences in UniProt continues to rise at an accelerated pace. As shown in Figure 1, during 12 releases from June 2013 to August 2014 the number of sequences in UniProt has doubled from 40.4 to 80.7 million records showing exponential growth. The UniRef databases, which cluster sequences at 100%, 90% and 50% identity illustrate the levels of sequence redundancy. From 2004 to 2014 the relative reduction in database size went from 5%/42%/70% to 54%/73%/88% for UniRef100, UniRef90 and UniRef50, respectively. Most of the growth in sequences is due to the increased submission of complete genomes to the nucleotide sequence databases (4). The majority of these genomes are derived from whole genome shotgun studies with bacterial genomes accounting for 80% of the data. In addition, there is an increase in submissions of multiple genomes for strains of the same organism or closely related species. While this wealth of protein information presents our users with new opportunities for proteome-wide analysis and interpretation, it also creates challenges in capturing, searching, preserving and presenting proteome data to the scientific community. To make room for new sequences we have increased our accession number format from 6 to 10 characters. Details of the new format are available at www.uniprot.org/help/accession_numbers. To help cope with these challenges in tracking and displaying proteomes we have reorganized our handling and display of proteomes and introduced a new proteome identifier that uniquely identifies the set of proteins corresponding to a single assembly of a completely sequenced genome. For users that prefer to use a single best-annotated proteome from a particular taxonomic group for their analysis, UniProt selects a proteome. UniProt reference proteomes are derived via consultation with the research community or computationally determined from proteome clusters (5) where the reference proteome is selected from the cluster by an algorithm that considers the best overall annotation score. Reference proteomes have been chosen to provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity found within UniProt (Figure 2). There are currently 2290 reference proteomes selected. They are the focus of both manual and automatic annotation, aiming to provide the best annotated protein sets for the selected species. They include model organisms and other proteomes of interest to biomedical and biotechnological research. The protein sets for these species are generated in collaboration with the INSDC, Ensembl and RefSeq reference genomes. An example of a reference proteome can be found in the new proteome information page and proteome identifier http://www.uniprot.org/proteomes/UP000000803.