RAP-DB

FAQ

What is the difference between RAP-DB and MSU Rice Genome Annotation Project (RGAP)?

Currently, both RAP-DB and MSU RGAP 7 use a common genome sequence “Os-Nipponbare-Reference-IRGSP-1.0” as a reference. However, the genome annotations are not exactly the same since annotation pipelines are different between RAP-DB and MSU.

What are the numbers (-00, -01, etc.) following a hyphen in a transcript ID?

We assigned isoform numbers (-01, -02 etc.) for each transcript with evidences of rice full-length cDNAs (FLcDNAs). An isoform number (-00) is assigned to a representative transcript (with the longest protein-coding sequence) predicted by ab initio gene prediction or supported by only non-Oryza (wheat, maize, barley, etc.) transcripts. For detail, please see Sakai et al., 2013, DOI:10.1093/pcp/pcs183.

In some cases, you may find transcripts with (-01, -02, etc.) without any evidences of rice FLcDNAs. As a reason for that, we change isoform number (-00) to (-01, -02, etc.) after the confirmation by literature-based manual curation.

Why are there many sequences that do not start with "M" in the protein sequence?

Our annotation pipeline, at first, determines transcript structures by mapping of full-length cDNA and mRNA sequences to the reference genome. After that, coding sequences (CDSs) were determined based on sequence similarity (homology) to known protein sequences in UniProt. In this step, if we only find a partial coding region for a transcript, we search for a methionine (M) or a stop codon and extend CDS toward the 5’- or 3’-end of the transcript. As a result, in some cases, incomplete protein coding genes, which are not start with a M (ATG) or not end with a stop codon, were generated. Furthermore, truncated mRNA sequences also cause incomplete CDS predictions.

Why are there some CDS sequences that are not divisible by three? How can I get the protein sequence in that case?

In our CDS prediction pipeline, the remaining 1 or 2 bases at the 3’-end of a transcript are assigned as CDS. To get amino acid sequences of RAP-DB transcripts, you can simply translate from the 5’-end of CDS and discard the remaining 1 or 2 bases at the 3’-end of the CDS. Those transcripts are truncated and don’t have stop codon.

How can I use IRGSP-1.0 to study organelle transcriptome?

Genome annotation for Mt and Pt chromosomes provided in the genome browsers (GBrowse and JBrowse) can be downloaded from the download page.

Where can I find the gene description or functional annotation for each transcript?

We provide all gene annotation for each transcript in tab-separated values (TSV) format (Gene annotation information in tab-delimited text format) in the download page. For example, you can get the information of “description (3^rd column)”, “GO annotations (10^th column)” etc.

Which annotation file can I use for RNA-Seq analysis?

We provide “Gene structure (only exon) information” in GTF format for both representative and predicted transcripts. Those GTF files are available in the download page and can be used for RNA-Seq analysis with HISAT and Stringtie.

Which file can I use to align RNA-seq data to the rice reference transcriptome instead of the genome?

You can download all transcript sequences in the download page (Transcript sequences (CDS + UTRs) in FASTA format). The sequences can be used as a reference transcriptome in RNA-seq analysis.

How can I get the reference sequence of a specific region in JBrowse?

1. Select "Reference sequence" > "IRGSP-1.0" from "Available Tracks" on the left column.
2. Select "Set highlight" from the top "View" menu.
3. Enter the coordinates in "Location" box and highlight the area.
4. Click ▼ on the right end of the title of the IRGSP-1.0 track
5. Select "Save track data" and get the sequence of highlighted region.

If you want to highlight more directly, click on the highlighter icon on the right side of the the search window and select the desired region with your mouse.

Inconsistency between cDNA and genomic sequences

The cDNA should be a copy of (part of) the genomic DNA, but their sequences are not always identical for several reasons such as sequencing errors and polymorphisms. The cDNA sequences are in general error-prone because of experimental artefacts. Therefore, when a cDNA sequence is different from a corresponding genomic sequence, we decided to use the genomic one. Even though a RAP locus was identified by a full-length cDNA, the nucleotide sequence displayed in the RAP-DB is not necessarily identical to that of the cDNA.

Rice gene nomenclature

Since the International Rice Genome Sequencing Project (IRGSP) completed the genome sequencing of Oryza sativa L. ssp. japonica cultivar Nipponbare, it was anticipated to decipher all of the genic regions in the genome. Systematic locus identifiers were assigned to the RAP loci on the IRGSP genome assembly. An ID (OsXXg#######) consists of the species name (Os for Oryza sativa), a two-digit number for chromosomes, the type of an identifier (g for genes), and a seven-digit number that indicates a sequential order of loci in a chromosome. This nomenclature was proposed by the Committee on Gene Symbolization, Nomenclature and Linkage in the First and Second Rice Annotation Project Meetings, modified on the basis of intensive discussions, and then approved by IRGSP/RAP. The RAP annotations with the Os identifiers were submitted to DDBJ/EMBL/Genbank under the accession numbers of AP008207-AP008218, and used in RefSeq. The Os code is used also in UniProtKB/Swiss-Prot as a locus identifier. For details, see the following paper.

Susan R. McCouch and CGSNL "Gene nomenclature system for rice" Rice (2008)

Please note that the MSU rice database (formerly known as TIGR osa1) employs a similar system (LOC_Os IDs), but they differ from our Os identifiers. The IDs can be converted at the ID Converter page. A table of the LOC_Os IDs corresponding to the Os IDs is available at the data download page.

Related web resources: