Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

UniProt release 2010_12

Published November 30, 2010


Fishing for new mutations in the human exome

Understanding the role of genetic variants in human health and disease is crucial in modern biology and medicine. The International HapMap Project and, more recently, the 1000 Genomes Project are progressively unveiling the map of human genome variation at the scale of the human population, generating a flood of interesting data. Smaller research projects focused on disease-causing mutations also contribute through the development of new fruitful approaches. One of the current trends in large and small scale projects is exome sequencing. The rationale is that the clear majority of allelic variants known to underlie mendelian disorders disrupt protein-coding sequences. Restricting sequencing to exons decreases the sample size to 2-5% of that of the whole genome, thus saving time and money, while allowing the identification of missense and nonsense mutations, of small insertions and deletions (indels), as well as of splice donor and acceptor site variants. By definition, exome sequencing does not permit the discovery of mutations in non-coding, regulatory or intronic genomic regions which are known to affect disease.

The exome sequencing strategy is proving to be quite effective, as it has recently been used to pinpoint several genes whose mutations are associated with diseases, including DHODH involved in postaxial acrofacial dysostosis (Ng et al., 2010), WDR62 in severe cerebral cortical malformations (Bilguvar et al., 2010) and MLL2 in Kabuki syndrome (Ng et al., 2010).

The annotation of single amino acid polymorphisms (SAPs) has always been a priority in UniProtKB/Swiss-Prot, including not only ‘neutral’ polymorphisms, resulting from normal variations among individuals, but also disease-associated mutations. Thus missense SAPs identified by the exome-sequencing strategy have been quickly annotated and integrated in the ‘Sequence annotation (Features)’ section of their respective entries (Q02127, O43379 and O14686). The associated phenotypes are described in the ‘General annotation (Comments)’ section in ‘Involvement in disease’ (Q02127, O43379 and O14686).

Over the years, we have developed a defined format to describe SAPs in the ‘Sequence annotation (Features)’ section, including dbSNP accession numbers, when they exist, and links to bibliographic references. Disease-causing mutations are tagged, whenever possible, with the official abbreviation of the phenotype provided by the OMIM database. In addition to missense mutations, in-frame indels are also reported (P35453, P02730 or P33897). When it is not possible to represent the whole variation landscape for a given protein within the UniProtKB entry, we try and provide cross-references to specialized resources (see for instance the ‘Web resources’ section in human p53 entry). Our annotation effort does not include the representation of mutations that cause major changes to a protein sequence, such as frameshift mutations or variations at splice sites, as their deleterious effects on protein function are usually obvious.

Close to 63’000 human SAPs are currently stored in UniProtKB/Swiss-Prot and about 30% of them are reported as disease-associated in the literature. SAPs selected from this pool are mapped to reference nucleotide sequences from RefSeq and LRG, following the guidelines established by the Human Genome Variation Society for sequence variant designation, and submitted to dbSNP (see for instance dbSNP/Swiss-Prot variant rs121908210). Thanks to a tight collaboration with Ensembl, all human variants stored in UniProtKB and characterized by a dbSNP accession number (or submitted to dbSNP) can also be accessed from the Ensembl database and viewed in the context of their nucleotide sequence (see variant rs1269215 stored in UniProtKB entry Q9BVK8). Our ultimate goal is to spread information about protein variations to the broadest possible audience.

UniProtKB news

Line length limit

Historically, UniProtKB flat file entries were formatted to not exceed 75 characters per line. This limitation served on one hand to display them nicely on small screens and to allow them to be processed by programs that had memory limitations. Meanwhile, computers have become more powerful and most programs have been adapted accordingly. UniProt has already made a few exceptions to the line length limit for data that cannot be wrapped, such as URLs or DOIs, or where wrapping does not increase readability, such as for protein names and a few cross-references to other databases. Especially for the latter, we have increasingly more additional information to incorporate. We will continue to wrap lines at 75 characters where it helps to increase readability, but allow for more characters where necessary. The new upper limit is 255 characters per line, as some users still depend on software with this limitation.

Changes to cross-references to RefSeq

We have introduced an additional field to the cross-reference (DR line in the flat file) to the NCBI Reference Sequences database to show the RefSeq nucleotide accession number.

The format of the explicit links in the flat file is:

DR   RefSeq; RefSeq protein accession number; RefSeq nucleotide accession number.

Example: P00816

Previous format in the flat file:

DR   RefSeq; AP_000992.1; -.
DR RefSeq; NP_414874.1; -.

New format:

DR   RefSeq; AP_000992.1; AC_000091.1.
DR RefSeq; NP_414874.1; NC_000913.2.

Changes to keywords

New keywords:

Changes in subcellular location controlled vocabulary

New subcellular locations:

Changes in the controlled vocabulary for PTMs

New term for the feature key ‘Modified residue’ (‘MOD_RES’ in the flat file):
  • 3’-nitrotyrosine