Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

UniProt release 5.0

Published May 10, 2005

Headlines

UniProtKB/Swiss-Prot major release (47.0)

Release 47.0 of UniProtKB/Swiss-Prot contains 181'571 sequence entries, comprising 65'742'349 amino acids abstracted from 128'438 references. 11'531 sequences have been added since release 46, the sequence data of 841 existing entries has been updated and the annotations of 166'572 entries have been revised. This represents an increase of 6%.

Many improvements were carried out in the last 3 months. In particular, we have introduced 5 new feature keys in order to better describe different types of regions in a protein sequence, and we added an additional qualifier to our cross-references to nucleotide sequence databases, the molecule type.

All the recent changes to the Swiss-Prot format are described in detail in the continuously updated document:

UniProt Knowledgebase release 5.0 includes Swiss-Prot release 47.0 and TrEMBL release 30.0.

Full statistics and release notes

UniProtKB News

Format change in the DR line

The DR (Database cross-Reference) lines are used as pointers to information in external data resources that is related to UniProtKB entries. Until now, the format of a DR line was:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER].

We have introduced a forth identifier, changing the DR line format to:

DR   DATABASE_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER[; TERTIARY_IDENTIFIER][; QUATERNARY_IDENTIFIER].

The database cross-references to EMBL is affected by this modification (see below).

Changes in the cross-references to EMBL

The biological source of the molecule has been added as quaternary identifier to the cross-reference (DR line) of the EMBL database.

Former format:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER.

New format:

DR   EMBL; ACCESSION_NUMBER; PROTEIN_ID; STATUS_IDENTIFIER; MOLECULE_TYPE.

The molecule type is controlled vocabulary and currently includes:

  • Genomic_DNA
  • Genomic_RNA
  • mRNA
  • Unassigned_DNA
  • Unassigned_RNA
  • Other_DNA
  • Other_RNA
  • -

Examples:

DR   EMBL; M68939; AAA26107.1; -; Genomic_DNA.
DR   EMBL; U56386; AAB72034.1; -; mRNA.

Cross-references to PANTHER

Cross-references have been added to PANTHER, which stands for Protein ANalysis THrough Evolutionary Relationships, a classification system that was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to families and subfamilies, molecular functions, biological processes and pathways. PANTHER is available at https://panther.appliedbiosystems.com/.

The format of the explicit links in the flat file is:

Resource abbreviation PANTHER
Resource identifier PANTHER's unique identifier for a protein family or sub-family.
Optional information 1 PANTHER's entry name for a protein family or sub-family.
Optional information 2 Number of domains found, which is generally 1, rarely 2 for the fusion of identical domains/proteins.
Example
O59826:
DR   PANTHER; PTHR11732:SF69; KCNAB_channel; 1.

New feature (FT) keys and redefinition of existing FT keys

The feature keys DOMAIN and SITE were used to describe distinct types of regions in a protein sequence and we found this situation unsatisfactory. We therefore redefined these two feature keys and introduced 5 new ones.

Redefinition of the feature keys DOMAIN and SITE:

  • DOMAIN - Extent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold. Example:
    FT   DOMAIN       37     94       Ig-like.
  • SITE - Any interesting single amino-acid site on the sequence, that is not defined by another feature key. It can also apply to an amino acid bond which is represented by the positions of the two flanking amino acids. Example:
    FT   SITE        176    177       Cleavage (by trypsin) (Probable).

Description of the 5 new feature keys:

  • COILED - Extent of a coiled-coil region. Example:
    FT   COILED      100    165       Potential.
  • COMPBIAS - Extent of a compositionally biased region. Example:
    FT   COMPBIAS     43     57       Pro/Thr-rich.
  • MOTIF - Short (<=20 amino acids) sequence motif of biological interest. Example:
    FT   MOTIF       153    154       Di-leucine motif.
  • REGION - Extent of a region of interest in the sequence. Example:
    FT   REGION       23     54       Binds to CD4.
  • TOPO_DOM - Topological domain. Example:
    FT   TOPO_DOM    139    182       Cytoplasmic (Potential).

The introduction of these new feature keys allows to establish a clear sorting order for feature tables. The following order is used:

1. Molecule processing
     * INIT_MET, SIGNAL, PROPEP, TRANSIT, CHAIN, PEPTIDE
2. Regions
     * TOPO_DOM, TRANSMEM
     * DOMAIN, REPEAT
     * CA_BIND, ZN_FING, DNA_BIND, NP_BIND
     * REGION
     * COILED
     * MOTIF
     * COMPBIAS
3. Sites
     * ACT_SITE
     * METAL
     * BINDING
     * SITE
4. Amino acid modifications (pre and PTM)
     * SE_CYS
     * MOD_RES
     * LIPID
     * CARBOHYD
     * DISULFID
     * CROSSLNK
5. Natural variations
     * VARSPLIC
     * VARIANT
6. Experimental info
     * MUTAGEN
     * UNSURE
     * CONFLICT
     * NON_CONS
     * NON_TER
7. Secondary structure
      * HELIX, TURN, STRAND

Keys of equal priority (listed on one line above) are ordered according to sequence positions.