------------------------------------------------------------ The Rice Annotation Project Release 2 ------------------------------------------------------------ CONTENTS IRGSP_masked: The IRGSP genome sequence (build 4) masked by RepeatMasker with TIGR's repeat data. Repeat regions are written in lower-case letters. Assembly Information: List of BAC/PAC clones and gap position information. Repeats: Repeat information created by using RepeatMasker with the TIGR Oryza Repeat Database. GFF Files: RAP annotation release 2 in the GFF format. rep.gff: Representative transcripts selected in each locus. When a locus contained multiple transcripts, one of them had to be chosen. all.gff: All the transcripts including the representatives. Note that the ORFs of non-representative transcripts were not curated. prediction.gff: Protein-coding regions predicted by ab initio methods but not supported by cDNAs. These are not included in all.gff. Those predicted and supported by cDNAs are included in rep.gff and all.gff. We would suggest that you use rep.gff if you would like to use a curated, unambiguos dataset of transcripts and ORFs. For details of the method of cDNA mapping, representative selection, and gene predictions, see the following paper: http://www.genome.org/cgi/content/full/17/2/175 All sequences: All of the sequences with evidence of expression. This data set includes alternative splicing variants. all_nuc.fa: Nucleotide sequences of both protein-coding and non-protein-coding genes (N = 213,864) all_orf.fa: Amino acid sequences (N = 208,550) all_orf_nuc.fa: Nucleotide sequences of protein-coding genes (N = 208,550) Representative sequences: Representative sesquences selected in each locus. rep_nuc.fa: N = 31,439 rep_orf.fa/rep_orf_nuc.fa: N = 30,192 [IMPORTANT NOTE 1] These data sets contain the sequences created from 5'/3'-end sequences of full-length cDNA clones, and gaps between the both ends were filled in using genomic sequences. Therefore, they may contain intronic sequences. [IMPORTANT NOTE 2] There were more than 2,000 cDNAs unmapped to the genome for some reasaon. They did not show up at the regular GBrowse window, but can be searched for by our keyword search function. They are presented by the INSDC accession numbers, because we could not assign the Os code, which is based on the genomic positions of genes. [IMPORTANT NOTE 3] The cDNA should be a copy of (part of) the genomic DNA, but their sequences are not always identical for several reasons such as sequencing errors and polymorphisms. The cDNA sequences are in general error-prone because of experimental artefacts, so that when a cDNA sequence is different from a corresponding genomic sequence, we decided to use the genomic one. Therefore, even though a RAP locus was identified by a full-length cDNA, the nucleotide sequence displayed in the RAP-DB is not necessarily identical to that of the cDNA. Predicted sequences: Protein-coding genes predicted by ab initio methods but with no evidence of expressed transcripts. N = 22,022 RNA: rRNA and tRNA information in the GFF files. rRNA.gff: rRNA detected by RepeatMasker with the TIGR Repeat Database. (N = 781) tRNA.gff: tRNA predicted by tRNAscan-SE ver. 1.23. (N = 746) Table of the Os code and MSU's LOC_Os: A list of the identifires of RAP and MSU loci. This list contains predicted genes without transcription evidence. Note that 2,050 unmapped genes are not included because the Os code was assigned only to those mapped to the genome.