What is the difference between RAP-DB and MSU Rice Genome Annotation Project (RGAP)?

Currently, both RAP-DB and MSU RGAP 7 use a common genome sequence “Os-Nipponbare-Reference-IRGSP-1.0” as a reference. However, the genome annotations are not exactly the same since annotation pipelines are different between RAP-DB and MSU.

What are the numbers (-00, -01, etc.) following a hyphen in a transcript ID?

We assigned isoform numbers (-01, -02 etc.) for each transcript with evidences of rice full-length cDNAs (FLcDNAs). An isoform number (-00) is assigned to a representative transcript (with the longest protein-coding sequence) predicted by ab initio gene prediction or supported by only non-Oryza (wheat, maize, barley, etc.) transcripts. For detail, please see Sakai et al., 2013, DOI:10.1093/pcp/pcs183.

In some cases, you may find transcripts with (-01, -02, etc.) without any evidences of rice FLcDNAs. As a reason for that, we change isoform number (-00) to (-01, -02, etc.) after the confirmation by literature-based manual curation.

Why are there many sequences that do not start with "M" in the protein sequence?

Our annotation pipeline, at first, determines transcript structures by mapping of full-length cDNA and mRNA sequences to the reference genome. After that, coding sequences (CDSs) were determined based on sequence similarity (homology) to known protein sequences in UniProt. In this step, if we only find a partial coding region for a transcript, we search for a methionine (M) or a stop codon and extend CDS toward the 5’- or 3’-end of the transcript. As a result, in some cases, incomplete protein coding genes, which are not start with a M (ATG) or not end with a stop codon, were generated. Furthermore, truncated mRNA sequences also cause incomplete CDS predictions.

Why are there some CDS sequences that are not divisible by three? How can I get the protein sequence in that case?

In our CDS prediction pipeline, the remaining 1 or 2 bases at the 3’-end of a transcript are assigned as CDS. To get amino acid sequences of RAP-DB transcripts, you can simply translate from the 5’-end of CDS and discard the remaining 1 or 2 bases at the 3’-end of the CDS. Those transcripts are truncated and don’t have stop codon.

How can I use IRGSP-1.0 to study organelle transcriptome?

Genome annotation for Mt and Pt chromosomes provided in the genome browsers (GBrowse and JBrowse) can be downloaded from the download page.

Where can I find the gene description or functional annotation for each transcript?

We provide all gene annotation for each transcript in tab-separated values (TSV) format (Gene annotation information in tab-delimited text format) in the download page. For example, you can get the information of “description (3rd column)”, “GO annotations (10th column)” etc.

Which file can I use to align RNA-seq data to the rice reference transcriptome instead of the genome?

You can download all transcript sequences in the download page (Transcript sequences (CDS + UTRs) in FASTA format). The sequences can be used as a reference transcriptome in RNA-seq analysis.