Skip Header

You are using a version of browser that may not display all the features of this website. Please consider upgrading your browser.

UniProt release 2014_09

Published October 1, 2014

Headline

Small is beautiful (and useful)

In large scale studies, small proteins tend to be overlooked. They are difficult to predict using software tools and they often escape detection by mass spectrometry. When cDNA sequences are submitted, short coding sequences (CDS) are only rarely annotated and hence do not appear in any protein databases, including UniProtKB/TrEMBL or GenPept, and their nucleotide sequences can be tagged as ‘non-coding RNAs’. In UniProtKB/Swiss-Prot, we are aware of the problem, but we are often reluctant to annotate uncharacterized small ORFs, fearing to introduce imaginary sequences in a database we wish to be as reliable as possible. That is why we are thrilled when new data become available that allow us to fill the gap.

This happened a few months ago, with the publication of 2 articles that brought the ‘noncoding transcript’ AK092578 under the spotlight. Pauli et al. were investigating inductive events during early embryogenesis in zebrafish. In order to find new signaling peptides, they sequenced RNAs extracted from embryos at different developmental stages and combined this approach with ribosome profiling to select for transcripts most likely to be translated. This led to the discovery of 399 novel coding genes. 28 of them contained a signal peptide, but no transmembrane domain, making them good candidates for signaling proteins. Pauli et al. focused their attention on one of them, apela, that they called toddler, encoded by AK092578, so far considered to be a noncoding transcript. A few weeks earlier, Chng et al. had already published the identification of the same protein, which they named elabela.

Apela is a highly conserved protein among vertebrates; this conservation is particularly striking in the 30 amino acid long mature peptide, the last 13 residues being nearly invariant in all vertebrate species studied. Apela is expressed in the zygote, with a peak during gastrulation, and becomes undetectable by 4 days post-fertilization. Its disruption leads to a dramatic phenotype, including small or absent hearts, posterior accumulation of blood cells, malformed pharyngeal endoderm, and abnormal left-right positioning and formation of the liver. Most mutant embryos eventually die between 5 and 7 days of development. Interestingly, this phenotype was reminiscent of that observed for apelin receptor (aplnr) deficiency.

The pathway leading to aplnr activation that could explain the observed mutant phenotype remained unsolved for several years. Indeed, aplnr disruption in zebrafish demonstrated that aplnr was required prior to the onset of gastrulation for proper cardiac morphogenesis, but its known ligand, apln, was not expressed until midgastrulation, too late to play a role in such a very early event. Along the same line, it had been reported that Aplnr mutant animals were not born in the expected Mendelian ratio, and many showed cardiovascular developmental defects, while Apln-deficient mice were viable, fertile, and showed normal development. Taken together, these observations suggested that Aplnr might have yet another ligand, expressed very early in embryonic development. The newly discovered apela protein seemed to fulfill the conditions and, using different strategies, both groups convincingly showed that apela is indeed aplnr’s first ligand.

Human, mouse and zebrafish Apela orthologs have been updated accordingly and these entries are now available.

UniProtKB news

Evidence in the UniProtKB flat file format

The evidence for annotations in UniProtKB entries has been available for several years in the XML and RDF representation of the data and we have now added this information also to the text format (aka flat file format).

Representation of evidence

This section describes how evidence is represented, independently of the context in which they can be found.

An individual evidence description consists of a mandatory evidence type, represented by a code from the Evidence Codes Ontology (ECO) and, where applicable, the source of the data which is usually another database record that is represented by the database name and record identifier, but in the case of publications that are not in PubMed we indicate instead the corresponding UniProtKB reference number.

Examples:

  • An evidence type without source: {type}, e.g.
    {ECO:0000305}
    {ECO:0000250}
    {ECO:0000255}
    
  • An evidence type with source: {type|source}, e.g.
    {ECO:0000269|PubMed:10433554}
    {ECO:0000303|Ref.6}
    {ECO:0000305|PubMed:16683188} 
    {ECO:0000250|UniProtKB:Q8WUF5}
    {ECO:0000312|EMBL:BAG16761.1}
    {ECO:0000313|EMBL:BAG16761.1}
    {ECO:0000255|HAMAP-Rule:MF_00205}
    {ECO:0000256|HAMAP-Rule:MF_00205}
    {ECO:0000244|PDB:1K83}
    {ECO:0000213|PDB:1K83}
    
  • Several evidence attributions: {type|source, type|source, ...}, e.g.
    {ECO:0000269|PubMed:10433554, ECO:0000303|Ref.6}
    

Change of the representation of different line and annotation types

This section describes in which line and annotation types evidence may be found and where it is placed. We use here the symbolic representation {evidence} as a placeholder for all evidence representations that are described in the previous section.

DE lines

Evidence may be found at the end of subcategory fields, e.g.

DE   RecName: Full=Palmitoyl-protein thioesterase-dolichyl pyrophosphate phosphatase fusion 1 {evidence};
DE   Contains:
DE     RecName: Full=Palmitoyl-protein thioesterase {evidence};
DE              Short=PPT {evidence};
DE              EC=3.1.2.22 {evidence};
DE     AltName: Full=Palmitoyl-protein hydrolase {evidence};
DE   Contains:
DE     RecName: Full=Dolichyldiphosphatase {evidence};
DE              EC=3.6.1.43 {evidence};
DE     AltName: Full=Dolichyl pyrophosphate phosphatase {evidence};
DE   Flags: Precursor;
GN lines

Evidence may be found after each gene designation, e.g.

GN   Name=cysA1 {evidence}; Synonyms=cysA {evidence};
GN   OrderedLocusNames=Rv3117 {evidence}, MT3199 {evidence};
GN   ORFNames=MTCY164.27 {evidence};
GN   and
GN   Name=cysA2 {evidence}; OrderedLocusNames=Rv0815c {evidence}, MT0837
GN   {evidence}; ORFNames=MTV043.07c {evidence};
OG lines

Evidence may be found after an organelle or plasmid, e.g.

OG   Mitochondrion {evidence}.
OG   Plasmid pWR100 {evidence}, Plasmid pINV_F6_M1382 {evidence}, and
OG   Plasmid pCP301 {evidence}.
OX lines

Evidence may be found after the taxonomy identifier, e.g.

OX   NCBI_TaxID=9606 {evidence};
RN lines

Evidence may be found after the reference number, e.g.

RN   [1] {evidence}
RC lines

Evidence may be found after each value, e.g.

RC   STRAIN=C57BL/6J {evidence}, and DBA/2J {evidence}; TISSUE=Brain
RC   {evidence};
KW lines

Evidence may be found after each keyword, e.g.

KW   ATP-binding {evidence}; Cell cycle {evidence}; Cell division {evidence};
KW   DNA replication {evidence};
CC lines

The evidence location depends on the annotation type.

Unstructured annotations:

Evidence may initially be found at the end of the annotations because this is how they have historically been attributed, e.g.

CC   -!- FUNCTION: Possesses kinase activity. May be involved in
CC       trafficking and/or processing of RNA. {evidence}.

At a later time, we intend to start attributing evidence at a more fine-grained level by placing them behind the sentences or paragraphs to which they apply, e.g.

CC   -!- FUNCTION: Possesses kinase activity. {evidence}. May be involved
CC       in trafficking and/or processing of RNA. {evidence}.

Structured annotations:

ALTERNATIVE PRODUCTS:

Evidence may be found behind the values of the Name= and Synonyms= fields. It may also be found in Comment= and Note= fields where it is placed as in unstructured annotations, e.g.

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=13;
CC         Comment=Additional isoforms seem to exist. {evidence};
CC       Name=1 {evidence}; Synonyms=LST1/A {evidence};
CC         IsoId=O00453-1; Sequence=Displayed;
..
CC       Name=12;
CC         IsoId=O00453-12; Sequence=VSP_047367;
CC         Note=No experimental confirmation available. {evidence};

BIOPHYSICOCHEMICAL PROPERTIES:

In the structured subtopics Absorption and Kinetic parameters evidence may be found at the end of the Abs(max)=, KM= and Vmax= fields. It may also be found in Note= fields and the unstructured subtopics pH dependence, Redox potential and Temperature dependence, where it is placed as in unstructured annotations, e.g.

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       Absorption:
CC         Abs(max)=465 nm {evidence};
CC         Note=The above maximum is for the oxidized form. Shows a maximal
CC         peak at 330 nm in the reduced form. These absorption peaks are
CC         for the tryptophylquinone cofactor. {evidence};
CC       Kinetic parameters:
CC         KM=5.4 uM for tyramine {evidence};
CC         Vmax=17 umol/min/mg enzyme {evidence};
CC         Note=The enzyme is substrate inhibited at high substrate
CC         concentrations (Ki=1.08 mM for tyramine). {evidence};

CC   -!- BIOPHYSICOCHEMICAL PROPERTIES:
CC       pH dependence:
CC         Optimum pH is 7-8 for ATPase activity. Is more active at pH 8 to
CC         10 than at pH 5.5. {evidence};
CC       Temperature dependence:
CC         Optimum temperature is 80 degrees Celsius for ATPase activity.
CC         {evidence};

RNA EDITING:

Evidence may be found behind the modified positions as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- RNA EDITING: Modified_positions=207 {evidence}; Note=Partially
CC       edited. Target of Adar. {evidence};

(Please note that we have taken this occasion to make an additional small format change to this annotation type: We have replaced the full-stop at the end of the annotation with a semi-colon to be consistent with other structured annotation types that consist of a list of Field=Value; items.)

MASS SPECTROMETRY:

In MASS SPECTROMETRY annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence attributions are thus displayed in a separate field instead of adding them at the end of each field. A new Evidence= field has replaced the previously existing Source= field, e.g.

CC   -!- MASS SPECTROMETRY: Mass=2189.4; Method=Electrospray; Range=167-
CC       186; Note=Monophosphorylated.; Evidence={evidence};

SEQUENCE CAUTION:

In SEQUENCE CAUTION annotations the same evidence applies to all fields (incl. the optional Note= field) and all evidence is thus displayed in a separate new Evidence= field instead of being added at the end of each field, e.g.

CC   -!- SEQUENCE CAUTION:
CC       Sequence=AAL25396.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=ABF70206.1; Type=Miscellaneous discrepancy; Note=Intron retention.; Evidence={evidence};
CC       Sequence=CAA32567.1; Type=Erroneous gene model prediction; Evidence={evidence};
CC       Sequence=CAA32568.1; Type=Erroneous gene model prediction; Evidence={evidence};

SUBCELLULAR LOCATION:

Evidence may be found at the same places where previously the non-experimental qualifiers By similarity, Probable and Potential were displayed (see Syntax modification of the ‘Subcellular location’ subtopic) as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- SUBCELLULAR LOCATION: Golgi apparatus, trans-Golgi network
CC       membrane {evidence}; Multi-pass membrane protein {evidence}.
CC       Note=Predominantly found in the trans-Golgi network (TGN). Not
CC       redistributed to the plasma membrane in response to elevated
CC       copper levels. {evidence}.
CC   -!- SUBCELLULAR LOCATION: Isoform 2: Cytoplasm {evidence}.
CC   -!- SUBCELLULAR LOCATION: WND/140 kDa: Mitochondrion {evidence}.

DISEASE:

Evidence may be found at end of the disease description as well as in the optional Note= field where it is placed as in unstructured annotations, e.g.

CC   -!- DISEASE: Sarcoidosis 1 (SS1) [MIM:181000]: An idiopathic,
CC       systemic, inflammatory disease characterized by the formation of
CC       immune granulomas in involved organs. Granulomas predominantly
CC       invade the lungs and the lymphatic system, but also skin, liver,
CC       spleen, eyes and other organs may be involved. {evidence}.
CC       Note=Disease susceptibility is associated with variations
CC       affecting the gene represented in this entry. {evidence}.
FT lines

Evidence may be found at the end of the feature description, e.g.

FT   VARIANT     341    341       P -> L (in AH2; strongly reduced
FT                                activity). {evidence}.
FT                                /FTId=VAR_065665.
FT   CONFLICT     52     53       RT -> KI (in Ref. 8; AAD14329).
FT                                {evidence}.

Changes to the controlled vocabulary of human diseases

New diseases:

Modified diseases:

Deleted disease:

  • Glycogen storage disease 14

Changes to keywords

New keyword:

Modified keyword: