Sequence database setup: UniProt proteomes
Overview
A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.
UniProtKB is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource.
First, you need to discover the Proteome ID for your proteome of interest. For example, go to http://www.uniprot.org/proteomes/ and search for rice by name or by taxonomy ID. The Proteome ID for Oryza sativa subsp. japonica is UP000059680
In Database Manager, create a new custom definition, as follows:
- Fasta or New database; Create New
- Use pre-defined template; UniProt_proteome_template
- Create
- Download from remote URL; Next
- Set up download URL
- Paste the following into the FASTA file URL field, where the proteome ID is for your proteome of interest
http://www.uniprot.org/uniprot/?query=proteome:UP000059680&format=fasta&compress=no&include=yes - Save; Start downloading
- Activate
The complete configuration for the rice proteome in Database Manager would look similar to this (except URL, which is outdated format)
Once configured, You can enable automatic updating by clicking on the database name then choosing Edit schedule.
Download
- Locate the proteome for your organism of interest by searching by name or by taxonomy ID at
http://www.uniprot.org/proteomes/ - Click on the Proteome ID link
- Click on the Download button and choose All protein entries, Fasta (Canonical and isoform), compressed
Taxonomy
Taxonomy is not required for a single organism database
Parse Rules
When a single entry is expanded into entries for multiple isoforms, they share the same ID, so AC must be used as the unique identifier
>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4
AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
Configuration (Mascot 2.3 and earlier)
A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to /usr/local/mascot/sequence/rice_proteome/current, and renamed to rice_proteome_20120414.fasta.
Full text for individual entries can be retrieved across the web from Uniprot:
Host: www.uniprot.org
Port: 80
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"
Always test a new definition before applying the changes to mascot.dat