The advent of high throughput next-generation sequencing techniques and De novo genome/transcriptome assembly make the development of new computational methods to annotate uncharacterized proteins more indispensable. Due to technical difficulty and expense, experimental characterization of protein function cannot be scaled to large number of proteins. Therefore, computational methods have become the forefront of protein function prediction. Many computational methods for protein function prediction have been proposed in the past decades. Most such methods focus on "global" annotation, such as molecular function, biological process, domain or family. Methods to provide fine-grained "local" annotation of functional sites at the level of individual amino acids are relatively rare.
We developed a computational method that predicts functional residues within uncharacterized proteins using position-specific site conditional template annotation rules (PIR Site Rules) derived from known protein structural and experimental data curated by structural biologists. Each rule specifies a set of test conditions such that candidate uncharacterized proteins must pass. Positive matches trigger the appropriate annotation for active site residues, binding site residues, modified residues, or other functionally important amino acids. PIR Site Rules, site Hidden Markov Models, template sequences, and functional site prediction workflow have been used to render the high-quality annotations for computationally analyzed UniProt knowledgebase protein sequences (UniProt/TrEML). In order to make our work widely usable for people interested in protein functional site prediction, we streamlined our workflow and developed a standalone Java software package named PIRSitePredict.
To ensure our software is easy to use for our users, the system is designed to take the XML output from InterProScan, which is widely used by the annotation community. The users should run through their protein sequences or Genomic sequences with InterProScan software. Our PIRSitePredict then uses the results in XML format from the InterProScan and related organism information (Kingdom/Sub-taxon) as inputs, applies the curated PIRSRs, Site Hidden Markov Models (SRHMMs) and template protein sequences (a template protein is a representative protein in a protein family that has 3D structure with experimental evidence for the functional sites and modifications) to predict the functional sites for protein sequences matching InterPro signatures. The supported prediction result formats: TSV (Tab-separated values), XML and GFF3. PIRSitePreidct provides online prediction service and downloadable stand-alone software package. The online prediction service is a web application using Spring MVC 4, Thymeleaf, Bootstrap, and jQuery. The stand-alone software package is a Java command line application.