Tutorial - PIR Site Rules (PIRSR)

Table of Content

Introduction
Feature types
Rule Curation
Reference

Introduction

Protein Information Resource (PIR) has developed a computational method that provides annotation of functional residues within uncharacterized proteins using position-specific site conditional template annotation rules (PIR Site Rule, PIRSR in short) [1] manually curated and defined by structural biologists on the basis of known structural and experimental data.

Each rule specifies a set of test conditions such that candidate uncharacterized proteins must pass that include:

Match a whole protein based HMM.
Organism in which the protein was found belongs to certain taxonomic scope.
Match a site-specific profile HMM.
Match functionally and structurally characterized residues of a manually curated template protein sequence.

Positive matches trigger the appropriate annotation for active site residues, binding site residues, modified residues, or other functionally important amino acids. This process has generated high-quality annotations in UniProtKB/TrEMBL [2] (automatically annotated but not reviewed) protein sequences. An example PIRSR can be viewed at http://pir.georgetown.edu/cgi-bin/pirrule?id=PIRSR017689-50. PIR Site rules are written in UniRule flat file format (.uru) [3].

Feature types

As of release 2017_10, PIR Site rules (total: 903) support 16 types of functional site annotations as shown below and described below according to [4]:

ACT_SITE: Amino acid(s) involved in the activity of an enzyme.
BINDING: Binding site for any chemical group (co-enzyme, prosthetic group, etc.).
CARBOHYD: Glycosylation site.
CHAIN: Extent of a polypeptide chain in the mature protein.
CROSSLNK: Posttranslationally formed amino acid bonds.
DISULFID: Disulfide bond.
DNA_BIND: Extent of a DNA-binding region.
LIPID: Covalent binding of a lipid moiety.
METAL: Binding site for a metal ion.
MOD_RES: Posttranslational modification of a residue.
MOTIF: Short (up to 20 amino acids) sequence motif of biological interest.
NP_BIND: Extent of a nucleotide phosphate-binding region.
PROPEP: Extent of a propeptide.
REGION: Extent of a region of interest in the sequence.
SITE: Any interesting single amino-acid site on the sequence, that is not defined by another feature key.
ZN_FING: Extent of a zinc finger region.

These are collected from template protein annotations and specified in the rule. Other related UniProtKB annotations such as keywords (KW) and comments (CC) are also collected from template protein annotations and specified in the rule. The keywords provide information that can be used to generate indexes of the sequence entries based on functional, structural, or other categories. Comments are free text comment on the protein entry. PIRSR comments support 5 topics as shown in below and described below below according to [4]:

COFACTOR: Description of any non-protein substance required by an enzyme for its catalytic activity.
PTM: Description of any chemical alternation of a polypeptide (proteolytic cleavage, amino acid modifications including crosslinks). This topic complements information given in the feature table or indicates polypeptide modifications for which position-specific data is not available.
SIMILARITY: Description of the similarity (sequence or structural) of a protein with other proteins
SUBCELLULAR LOCATION: Description of the subcellular location of the chain/peptide/isoform.
SUBUNIT: Description of the quaternary structure of a protein and any kind of interactions with other proteins or protein complexes; except for receptor-ligand interactions, which are described in the topic FUNCTION.

Rule Curation

PIRSRs are manually curated and defined by structural biologists on the basis of known structural and experimental data. The overall curation workflow is shown below. Internally, we have built a web-based user interface to facilitate the curation efforts.

Curated homeomorphic protein families (PIRSF)

PIRSRs are defined starting with curated PIRSF families that contain at least one known 3D structure with experimentally verified site information in published scientific literature. PIRSF is a whole protein classification system that provides comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. One of the proteins with known 3D structure with experimentally verified site information in published scientific literature would be selected as template protein.

Build site-specific profile HMM

A set of UniProtKB/Swiss-Prot [2] (manually annotated and reviewed by human experts) proteins in a given PIRSF homeomorphic protein family including the template protein is used to create multiple sequence alignment. The alignment is manually examined/edited by the curator to identify any conserved sites as candidate site features. The finalized multiple sequence alignment is used to build site-specific profile HMM model (SRHMM) using HMMER3 [5].

Select site feature annotations

Various feature information about the candidate sites are derived from the annotations of chosen template protein. Specifically, they are the annotation fields: FT (features, see feature types for details), CC (comments) and KW (keywords) in UniProtKB/Swiss-Prot entries. Appropriate syntax and controlled vocabulary are used for site description and evidence attribution.

Specify match conditions

A set of match conditions are defined in the rule and must be met to enable prediction of annotations in the rule to a target protein sequence:

Family HMM: The target protein sequence must match the PIRSF/InterPro family HMM specified in the rule. This is defined as "trigger" condition in the rule.
Taxonomic scope: Rule can only be applied to a certain organism, which is defined as Kingdom/sub-taxon in the "scope" section of the rule.
Site HMM: Family HMM may not be suitable as a discriminator for a particular site of interest. The target protein must also match the site-specific profile HMM model defined in the rule. This is defined as "feature group" condition in the rule.
Site residue: The target and template protein sequences are aligned to the site-specific profile HMM. Target residues that match those defined in the rule are eligible for prediction. This is defined as "feature table" condition in the rule.

Test Prediction

The curated PIRSR is applied to the UniProt/Swiss-Prot entries of the same PIRSF/InterPro family to calculate the True Positives (TPs) (annotations exist in Swiss-Prot entries and is predicated by the rule), False Positives (FPs) (annotations do not exist in the Swiss_Prot entries but is predicated by the rule) as well as the precision and confidence as defined in the formula 1 of [6]. The prediction algorithm is shown below.

According to the statistics of test prediction run, the rule is further refined iteratively. Once it is ready, it will be put into production to annotate UniProtKB/TrEMBL entries. An example production rule is shown below.

Reference

Vasudevan S, Vinayaka CR, Natale DA, Huang H, Kahsay RY, Wu CH. Structure-guided rule-based annotation of protein functional sites in UniProt knowledgebase. Methods Mol Biol. 2011;694:91-105. doi: 10.1007/978-1-60761-977-2_7.
The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucl. Acids Res. doi: 10.1093/nar/gkw1099 First published online: November 29, 2016.
Brigitte Boeckmann, Alexandre Gattiker, Edouard de Castro. UniRule: Format of rules for automated protein annotation in UniProtKB/Swiss-Prot. ftp://ftp.expasy.org/databases/prosite/unirule.pdf. Accessed on September 27, 2017.
UniProt consortium. UniProt Knowledgebase User Manual. http://web.expasy.org/docs/userman.html. Accessed on September 26, 2017.
J. Mistry, R. D. Finn, S. R. Eddy, A. Bateman, M. Punta. Challenges in Homology Search: HMMER3 and Convergent Evolution of Coiled-Coil Regions. Nucleic Acids Research, 41:e121, 2013.
Kretschmann E, Fleischmann W, Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics. 2001 Oct;17(10):920-6.