1 Goal of Riken mouse genome encyclopedia project
The goal of our project is to establish a mouse genome encyclopedia and to develop a series of original technologies to achieve it. Riken mouse encyclopedia is the platform for the second goal to draw genome-wide pictures of gene cascades that can account for the mechanism to connect genes to phenotypes. The encyclopedia will consist of five components; the non-redundant mouse full-length cDNA clone bank, the mouse full-length cDNA sequence bank, the chromosomal locations, expression profiles and protein–protein interactions covering as many genes as can be collected. We have been developing the full-length cDNA technologies and high-speed sequencing technologies to analyze these materials. The purpose of our project is not only the analysis of the sequence of full-length cDNAs but also the development of a new approach for research based on the encyclopedia. We would like to develop new post-sequence technologies and systems which can maximize the usefulness of the encyclopedia by utilizing full-length cDNA clones.
2 History and current status of Riken mouse genome encyclopedia project
Since 1995, the Riken genome exploration research group has been developing a series of new systems to construct full-length cDNA libraries [1–3], high-speed sequencing system named RISA [4]. Using this system, approximately 1 000 000 mouse full-length cDNA clones were isolated from 163 independent tissues at different stages from different organs, and 3′ end sequence data could classify these clones into 128 000 clusters. Full stretches of sequences of 60 000 representative clones were sequenced, resulting in 36 000 unique sequences. We are also producing the data of the expression profiles and protein–protein interaction [5–8]. The integrated database, including not only full-length cDNA sequences but also mapping, expression profiles and protein–protein interaction data of all these genes were very useful for the analysis of gene functions, to support a positional candidate approach, and for gene network analysis to connect genes to phenotypes.
3 Achievements
3.1 Development of a system to prepare the mouse genome encyclopedia
To establish the mouse genome encyclopedia, two series of technologies are needed: A method for constructing high quality full-length cDNAs, and RISA (Riken Integrated Sequence Analysis) system (high-speed sequencing system).
At present, our Riken full-length cDNA method consists of four key technologies: an elongation method for first strand cDNA synthesis [3], a selection method for eliminating partial cDNA [1,2], a normalization and subtraction method [9] to avoid redundancy in subsequent sequencing efforts, and a new cloning vector [10]. Also, the following technologies play an important role: protocol for making library from small amount of tissues, removal of poly-A stretch, mixed- and tagged-cDNA libraries.
We also have developed a large-scale plasmid preparation system [11,12], a transcriptional sequencing reaction system [13–15], and a high-speed 384-capillary sequencer that should enable analysis of 40 000 samples/day [4]. The most important point is that all of these technologies have been incorporated into a single system to achieve our goal.
3.2 Full-length cDNA library technology
Our group is approaching this project with originally developed technologies. The Riken full-length cDNA method consists of four key technologies as described below.
3.2.1 Elongation method
The basic principle for the elongation method is that the first strand cDNA synthesis should be undertaken at high temperature because the main cause of partial cDNA synthesis is the secondary structure of the mRNA. The elongation method is based on trehalose-mediated thermostabilization, thermoactivation and thermoprotection of reverse transcriptase [3]. In addition to this development, we also discovered that trehalose has some effect, which may be large, on half of the enzymes generally used.
3.2.2 Selection method
The selection method is named ‘the Cap trapper method’ [1,2]. This method employs chemical modification of the Cap site to oxidize the diol structure, which is specific for non-redundant cDNA. The product dialdehyde structure is connected to the hydrazide. Thus, the diol group can be biotinylated by biotin hydrazide. After synthesis of the first strand synthesis, the Cap site is labeled by the above-mentioned chemical modification, and RNaseI is used to cleave the single strand RNA but not the DNA-RNA hybrid. In the mRNA with partial cDNA, the single-strand RNA is exposed and can be attached by RNaseI. Therefore, RNaseI treatment removes the biotin group from mRNA with partial cDNA, and only the biotin label at the Cap site of mRNA with the full-length first strand cDNA remains. Subsequently, full-length cDNA can be collected using avidin beads.
3.2.3 Normalization-subtraction method
Highly expressed genes and already-collected genes reduce the efficiency of collection of novel cDNAs by one-pass sequencing. To eliminate them, we developed a reiterative normalization and subtraction method utilizing biotinylated RNAs as drivers [9]. Currently an amplified library subtraction system is being introduced in order to rescue from cDNA libraries the cDNAs that have not yet been classified by one-pass sequencing from existing cDNA libraries.
3.2.4 Removal of poly-A tails from FL-cDNAs
Poly-stretches of nucleotides, such as poly (A) tail in cDNAs are known to interfere with the processibility of DNA/RNA polymerase in the sequencing reaction, resulting in reduced read-lengths and rates of successful sequences. To overcome this problem, we developed a new method to remove poly (A) tails from cDNAs using Type II restriction enzymes [16]. We also removed the G-stretch previously used for priming the second strand by a new linker adapter strategy that induces with high efficiency a sequence to prime the second strand cDNA [17].
3.2.5 Development of host/vectors for FL-cDNA libraries
Two kinds of cloning vectors were developed [10]: lambda FLC1 and FLC2. Lambda FLC1 can clone a wide range of cDNAs from 0.5 to over 13 kb long and shows slight size preference for long cDNA clones [18]. We could routinely prepare a cDNA library that once bulk-excised into plasmid, showed average sixes of 2 to 3 kbp. Lambda FLC2 is a modified lambda FLC1, which contains att sites flanking the cloning sites. The att sites allow easy transfer of a cDNA insert into other vectors for expression and other functional studies with lambda recombinase. This system should facilitate the functional analyses of cDNAs in post- sequence research.
3.3 High throughput sequencing system
We established and expanded our large-scale sequencing system. It comprises of FL-cDNA library construction, an E. coli picking system, a plasmid preparation system, a sequencing reaction system, a sequencing system, and the management of samples and data. The current capacity of sequencing is 40 000 samples per day. All samples are well-tracked to avoid confusion of IDs (ID errors), and quality control checking is routinely done.
3.3.1 Plasmid preparation system
We have designed and introduced three instruments: an instrument for medium distribution and E. coli inoculation, a harvester of E. coli culture solution, and a plasmid extractor [13]. These instruments can process 40 000 plasmid extractions per day and are now being optimized to achieve constant yield and quality for sequencing templates.
3.3.2 Transcriptional sequencing (TS)
To develop the TS system [14–16], we originally developed a mutated RNA polymerase, which can incorporate the 3′ dNTP preferentially and uniformly, and fluorescent 3′dNTP dye terminator.
3.3.3 Development of 384-capillary sequencer (RISA sequencer)
We completed the development of the first version of a 384-capillary sequencer (RISA 2) [4] at the end of 1996. Shimadzu from November 1999 has commercially marketed it.
3.4 Data management system
We have developed many programs to analyze sequence data produced in our high-throughput sequencing system [19], such as a set of tools automatically classifying cDNA clones based on the 3′-end sequences and tools for automatically registering sequence data in the encyclopedia database [20].
We have also established an assembly and primer design system. The assembly system can handle three kinds of sequence data produced by different kinds of sequencers in a uniform style. If a gap remains, primers for the primer walking sequencing can be easily designed. Public available sequences, such as EST data, can also be utilized to fill the gaps.
We have a database system based on Sybase DBMS to manage most information derived from our sequencing system, tissue sources for FL-cDNA libraries, clone ID, 3′-end sequences, full-length sequences, and clustering information. This database is updated daily. The summarized information can be viewed with Web browsers in a user-friendly manner.
3.5 Data accumulation
The mouse genome encyclopedia are prepared in five phases. In Phase I, we have constructed the mouse full-length cDNA using as many tissues as can be collected and clustered these clones by end sequencing, to produce non-redundant full-length cDNA. In Phase II, the full-sequence of the rearrayed clones from the non-redundant cDNA library have been determined. In Phase III, the chromosomal location of all full-length cDNA have been identified by in-silico hybridization to the human and mouse genome sequences [21,22]. Phase IV is producing a basic database of gene expression in the body and during the development from embryo to adult [5,6,8]. Finally, Phase V is the step to produce the protein–protein interactions, based on the biggest advantage of full-length cDNA which can express the whole structure of protein [7,23], although partial cDNA (expression sequence tag; EST) and genome can not.
3.6 Data production
3.6.1 cDNA libraries
Almost all cDNAs were normalized and/or subtracted, constructed from over 163 tissues and cells for the first volume of the Riken Mouse Genome Encyclopedia.
3.6.2 One million cDNA clones and their 3′-end sequences
By clustering of the 3′-end sequences, a total of 128 000 clusters of cDNAs have been obtained. However, this classification includes some overlap or redundancy, because of various forms of splicing, alternative poly adenylation sites, some internal priming, and clustering limitation due to sequencing fidelity and other factors.
3.6.3 Full sequences of cDNA clones derived from non-redundant FL-cDNA set
Representative clones of clusters that are based on 3′-end sequences of cDNAs were rearrayed and used for the full-sequencing phase. Apparently novel genes estimated by comparison of 3′-end sequences and public DNA databases were given high priorities for full-sequencing. So far, about 60 000 full-sequences from 128 000 clusters were determined. The 60 000 sequences still contained redundancy, resulting in being clustered into 36 000 completely independent unique sequences.
3.7 Functional annotation of FL-cDNAs
In order to annotate the function of 60177 full-length Riken mouse cDNAs sequenced in RIKEN, we held the FANTOM (Functional Annotation Of Mouse) meeting [24] through 28th August to 8th September to establish the international standard of annotations. This annotation activity covers not only functional information itself, but also many other informative data, such as supplemental descriptions of gene function (gene symbol and its synonym), the functional classification (Gene Ontology, and TIGR EGAD), chromosomal localization (from genetic mapping and physical mapping if available), expression specificity (organ localization and sub-cellular localization of cDNA), mutation information (disease and knockout mouse information) [18,25].
The Riken mouse genome encyclopedia is with the human the most detailed transcriptome described in any organism to date. Analysis of these cDNAs extends known gene families and identifies new ones.
3.8 DNA microarrays
3.8.1 Construction of high-throughput arrayer
A new arrayer has been constructed, having two arms, each of which holds a pin head, and a large stage on which 96 microarrays can be prepared simultaneously. When a 16-pin head is adopted, 96 microarrays of 30K cDNAs can be done in 100 h. A 48-pin head device allowing faster mode is also available. The Maximum performance is expected to be 96 microarrays of 30K cDNAs in 16 h.
3.8.2 19K RIKEN microarray to 40K RIKEN microarray
We have established 19K microarrays of Riken cDNAs. Expression profiles of various tissues and several developmental stages have been investigated [6]. Microarray data are analyzed and stored in our expression database, READ [8]. 19 000 genes were analyzed in 20 tissues. This database serves as a fundamental resource for the functional researches of each cDNA and each gene cascade. The size of the database is expanded to 40K cDNAs in 20 tissues.
3.9 Development of a protein–protein interaction analysis system
To uncover the function of each gene as a systematic genome-wide approach, the protein–protein interaction (PPI) panel covering all genes [7], is very important. PPIs play pivotal roles in the network of cellular biological processes and also they should be potential targets for drugs developments. However, it seems not to be so easy to establish entire PPI panel in mouse, because the estimated total number of mouse genes of (100 000) is far larger than those of budding yeast (∼6000) and C. elegans ().
To address this difficulty, we have developed a high-throughput PPI assay system that consists of a PCR-mediated sample preparation and a modified mammalian two-hybrid method. In the pilot study, the system achieved the examination of more than 106 combinations per day. We have also developed a selection method of assay samples allowing us to find significantly interacting combinations efficiently, based on the demonstration that two genes co-expressed in the same tissues at the same stages preferentially interact with each other. These two key developments paved a way to enable us completion of a rough draft of an entire protein–protein interaction panel in mouse within a few years.
4 Application of cDNA system to other projects
Arabidopsis thaliana full-length cDNA Project
To determine the chromosomal locations, and expression profiles of this plant-model organism our group is collaborating with the Plant Molecular Biology Laboratory with our cDNA cloning technique [26,27]. At the moment 115 000 cDNA clones were constructed based on 3′-end sequences of cDNAs [28]. 15 000 full, sequences at 99.99% accuracy will be finished sequences very soon as an international collaboration. All of these clones were mapped onto the Arabidopsis thaliana genome sequence [29].
5 Future plan
Our final goal is to establish a system for genome-wide understanding of biological phenomena at the molecular level, particularly in the medical field. In order to achieve this goal, the first step is to collect data on all full-length cDNAs, their primary structures, and expression sites. Also important is the chromosomal mapping of the cDNAs at sequencing level in order to connect the gene and the phenotype. We have started developing a system to establish such as encyclopedia using a full-length cDNA system and the RISA sequencing system. To overcome possible resource problems, we chose mouse materials from inbred, congenic and knockout strains, which are available with no limitation for the preparation of tissues such as samples at very early embryonic stages and fertilized eggs. Predictions of human full-length cDNA sequences in-silico can be done by homology search in comparison with our mouse full-length cDNA [21,29]. This enhances the significance of our strategy of choosing mouse cDNA as a target.
We have begun collecting mouse full-length cDNAs and are finding our approach an extremely powerful one for analyzing and explaining why certain genes cause a phenotype. To connect gene(s) and phenotype(s), our encyclopedia is very useful for identifying candidate gene(s) responsible for the phenotype(s) in the positional candidate approach. The cDNA microarrays are also useful for selecting a set of genes, which are transcriptionally regulated downstream of the target gene, using mutant and normal tissues. We plan to continue development of the mouse genome encyclopedia and the technologies to establish it and make it widely useful in order to enable the depiction of genome-wide maps from gene(s) to phenotype.
Acknowledgements
This study has been supported by Research Grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H.