1 Introduction
Though microarray techniques have been available for several years and that large amounts of data have been gathered, major breakthroughs are still yet to come. If heterogeneity in technology, platforms and computing options may be blamed for the delay [1], the lack of thought through exchange infrastructure represents the major hurdle. So far, most microarray data have been published on specific web sites. These resources are usually of limited value due to lack of annotation, both in quantity and quality. These limitations, by preventing cross platform analysis and mining, make it almost impossible to fully exploit the data so far accumulated. The genuine complexity and size of data produced by microarray technology has therefore generated a need for setting up guidelines to achieve data exchangeability. To this end, two standards have been devised to solve, first the problem of the structure of the data and second the problem of the amount of information required for microarray experiment annotation. ArrayExpress aims to provide a public repository by implementing these standards and supplying the infrastructure that should favor microarray data exchange and interpretation.
2 Structuring the microarray data
Exchanging data requires common standards to describe, structure and format data in a way that could be implemented irrespective to technical choices. The MAGE–OM object model is a platform-independent data model capable of describing the intrinsic complexity of the microarray-based experiment. The MAGE–ML language, an XML derived language, and its related Data Type Definition has been generated from the MAGE–OM object model [2]. These three elements have now been granted the status of Bioscience standards by the OMG and are gaining broader acceptance among the most prominent industrial and academic players of the field. ArrayExpress and its environment is the first functional implementation of the MAGE–OM object model allowing data submission in MAGE–ML format.
3 The challenge of data annotation
In addition to defining the standards for data structure and modeling, the huge challenge of annotation has to be addressed both in quantity and quality to ensure complete data compatibility and reusability. The MIAME requirements standing for Minimal Information About a Microarray Experiment [3] have been developed to tackle the issue of the amount of information to be supplied. The standard defines for every critical element of a microarray-based experiment, the necessary information to be provided by anyone willing to share the results of his work.
When dealing with quality of annotation, a critical issue is the need for machine processable descriptions. To achieve automated treatment of the information, consistent annotation is a paramount for mining agents to work efficiently; synonyms and free text should therefore be avoided. To this end, an effort has been carried out to develop field specific ontologies, which capture knowledge, and controlled vocabularies to perform efficient microarray experiment annotation. Among those, the Biomaterial Ontology, established by the MGED society, provides a standard way for annotating biological samples from which mRNA are extracted and used in microarray experiments. The ontology itself relates and cross-references to several controlled vocabulary projects and annotation database thus taking advantage of existing effort. The MGED Biomaterial ontology is available at http://www.cbil.upenn.edu/Ontology/MGEDontology.html.
4 Submission routes to ArrayExpress
Based on the experience gained from the sequencing projects [4], adequate submission procedures have been devised depending on submitter's needs. MAGE–ML pipelines have been tailored for institutions involved in high-throughput projects (e.g., The Sanger Center, TIGR, Affymetrix) or microarray computing projects such as BASE [5]. For smaller scale projects or with limited bioinformatics support, MIAME express, a MIAME compliant web-based tool for submission and annotation, is available. MIAMExpress can be used as a submission tool when all experiments are completed or alternately on a daily basis, as an electronic lab-book. The tool provides a simple and robust tool for submitting experiments, protocols and arrays while ensuring appropriate formatting and annotation. The complexity of the MAGE–ML format conversion is taken care of by the tool so that researchers using MIAMExpress are at one click ahead of submission. MIAMExpress is implemented using perl-cgi scripts and stores the data temporarily in a mySQL database. This transient storage has two purposes: (1) store pending submissions and (2) enable quality control of annotation and structure by the microarray curation team. Throughout the submission process, submitters are assisted and guided by the curation team available at arraysubs@ebi.ac.uk. Last, MIAME express can also be set up as a standalone tool and is available as open-source from http://www.sourceforge.net/.
5 Accessing and mining the data
ArrayExpress data can be viewed through a dedicated query form (http://www.ebi.ac.uk/arrayexpress). All submission types can be queried on accession numbers. Type specific (Experiment, Array and Protocol) query fields allow case insensitive searches on e.g. authors, experimental factor, experiment type and species. Results are displayed as short summaries containing a series of links to the different objects. From there, numerical data corresponding to the gene expression levels are made available as tab-separated file. These can then be directed to ExpressionProfiler (http://www.ebi.ac.uk/microarray/ExpressionProfiler/ep.html), the EBI online analysis tool for further analysis and visualization [6]. Finally, MAGE–ML documents can be downloaded as a compressed file directly from the result interface. Note that sequence or gene identifier based queries are not yet supported and further work is needed to implement those. The task is complex and requires integration of a broad variety of resources from within the EBI and other institutions and will require the development of a specific datawarehouse for ArrayExpress. The MAGE–ML formatted content of Array Express database is available on request from arraysubs@ebi.ac.uk.
6 ArrayExpress future
Even though ArrayExpress is now fully functional, allowing submission, query and export to analysis tool, it is still a tool under development and does not yet take advantage of the full power of the MAGE object model. Hence, work is still ongoing to enhance the query capabilities, especially those related to gene and reporter that should enable cross platform and reporter reliability assessment. Integration of query capabilities based on ontology annotations is also scheduled as part as query functionalities. To achieve microarray data exchange, interconnection with other microarray databases such as GEOmnibus at the NCBI [7] and with the CIBEX project at the DDBJ has to be implemented. This requires devising a MAGE–ML export function. A variant application of that export function could be used to transfer MAGE–ML files to private databases in order to perform local assessments. In addition to software related efforts, we are actively working with different centers and consortia to generate high quality MIAME compliant data, examples of these include the International Genomics Consortium (IGC) [8] who intend to profile thousands of tumor samples and deposit the data in ArrayExpress and ILSI Toxicogenomics projects.
Acknowledgements
The ArrayExpress project is funded by EMBL, a grant from the European Commission (TEMBLOR), and a Toxicogenomics database grant from ILSI. Initial funding was provided by Incyte and we particularly thank Lee Grower.