1 Introduction
Genetic variation is a fundamental reason why humans differ from one another and why some individuals are more susceptible to diseases such as cancer than others, and there is a growing interest in investigating the mechanisms and the phenotypic consequences of genetic variation. A sharp decrease in the costs of DNA sequencing has enabled the sequencing of numerous human genomes and their mining using “Big Data” analytical approaches for unraveling molecular disease processes [1]. One example for such studies is the presently ongoing Pan-Cancer Analysis of Whole Genomes (PCAWG) project (https://dcc.icgc.org/pcawg), a forerunner project in Big Data analytics of genomes from patients, which aims to integrate data from somatic and germline whole genomes, DNA methylomes, transcriptomes, and clinical data from more than 2800 cancer patients amounting to nearly a Petabyte of sequencing data [1]. The objective of PCAWG is to unravel commonalities and to distinguish factors between cancer types and subtypes at the molecular level, to facilitate the molecular classification of malignancies with impact on diagnostics and treatment, and to uncover causalities linking genotype, environment, and phenotype. The unprecedented resource developed through PCAWG will enable standardized analysis of cancer genomes and associated datasets including transcriptomes, DNA methylomes, and clinical data to obtain insights into molecular disease processes relevant to cancer.
The objective of this paper is to briefly review methodologies for analyzing disease processes, with a specific focus on complex DNA rearrangements, in cancer genomes. Additionally, we will provide an outlook to coming Big Data efforts – with one example being the PCAWG project – which will facilitate the understanding of basic processes as well as translational research (Box 1).
2 Cancer genomes can evolve through catastrophes: the molecular process of chromothripsis
Cancer genome sequencing has enabled new insights into how tumors evolve, and has led to quite remarkable findings relating to the fact that cancer is not merely driven by stepwise alterations but can arise in conjunction with bursts of mutational events [2]. One particularly remarkable example for this is chromothripsis, a molecular process first described by Stephens et al. in 2011 based on cancer genome analysis, which can scar individual chromosome arms or one up to several chromosomes when localized chromosome shattering and repair occurs in a one-off catastrophe [3]. Rearrangement patterns associated with chromothripsis occur in approximately 2–3% of human cancers [3] and TP53 germline mutations are linked with the occurrence of chromothripsis in pediatric medulloblastoma [4]. Chromothripsis is also abundant in other cancers, such as bladder [5], breast [6], melanoma [7], and in bone cancers [3]. While the prevalence of chromothripsis in diverse cancer cell lines and cancer genomes [3,4,8] suggest a crucial role of chromothripsis in cancer development, the reproducible inference of this process has remained challenging, requiring that cataclysmic one-off rearrangements can be distinguished from localized genetic lesions that occur in a stepwise fashion. We recently devised a set of conceptual criteria for the inference of complex DNA rearrangements suitable for rigorous statistical analyses, which included previously established [3] as well as novel criteria: clustering of breakpoints, regularity of oscillating copy-number states, interspersed loss and retention of heterozygosity, prevalence of rearrangements affecting a specific haplotype, randomness of DNA segment order and fragment joins, as well as the ability to walk the derivative chromosome [8]. These criteria attempt to reject the alternative hypothesis that DNA rearrangements have occurred in a stepwise (progressive) fashion. Further refinement of these criteria allow inferring chromothripsis events in conjunction with additional stepwise patterns of DNA alteration [9], and collectively, these criteria have begun to be used quite regularly to operationally define chromothripsis events based on cancer genome sequencing data.
3 Methods for mechanistic dissection of catastrophic DNA rearrangement processes
Although cellular catastrophes occurring during key stages of the cell cycle were proposed to play roles in initiating catastrophic DNA rearrangements [10], the mechanistic basis of chromothripsis has not been studied until recently, largely due to lack of suitable cell-based model systems. In order to investigate the underlying mechanism(s) of chromothripsis events, recent studies have described cell-based models for characterizing chromothripsis, which include a methodology termed CAST (Complex Alterations After Selection And Transformation) developed by our group [11]. CAST (Fig. 1) is based on an untransformed model cell line (hTERT-RPE-1), application of genetic or chemical perturbations, selection of DNA alterations conferring a growth advantage by soft agar colony formation and screening for extensive copy-number alterations using low-coverage DNA sequencing followed by in-depth characterization of complex rearrangements by long-range paired-end sequencing [11].
Using this in vitro system, we were able to reproducibly generate chromothripsis events in RPE-1 cells upon global DNA damage by doxorubicin. In addition, we performed in-depth characterization of cell lines harboring chromothripsis alterations that were generated by CAST, and we applied the above-mentioned criteria to obtain additional insights into the process [11]. By doing so, we were able to demonstrate that telomeric stress can initiate chromothripsis events in vitro (when using an siRNA against the shelterin complex component TRF2) consistent with previous genomic analyses suggesting that chromothripsis can be initiated from dicentric chromosomes and breakage-fusion-bridge cycles [9]. The link between telomere crisis and complex DNA rearrangements were further substantiated by recent reports providing compelling evidence for telomeric stress being an initiating factor for complex rearrangements both in vitro mediated by TREX-1 endonuclease following telomere crisis as well as [12] in vivo in medulloblastomas, ependymomas, glioblastomas, and chronic lymphocytic leukemias [13]. Another important implication derived from this in vitro system was the association of an increase in ploidy either in the form of tetraploidy (four chromosome sets) or hyperploidy (presence of extra chromosomes in the form of incomplete sets) with chromothripsis. This is an interesting finding considering previous studies suggesting an increase in genomic instability in response to tetraploidy [14]. In the context of chromothripsis, we were able to uncover that the increase in ploidy predisposes to complex genomic rearrangements also in vivo analyzing primary SHH-medulloblastoma tumor genomes. In these tumors, we found that hyperploidy is not only strongly associated with chromothripsis, but is also plausibly the initiating factor for complex SRs [11].
Another attractive model for chromothripsis is based on the presumption that chromosomes confined in micronuclei can undergo a catastrophic shattering process in a one-off event [15], a model which recently was significantly substantiated through single-cell sequencing data providing compelling evidence for chromothripsis [16]. It is important to note in this regard that there is also a plausible connection between telomere crises and micronuclei formation, whereby upon telomere loss sister chromatid fusion events may mediate the formation of dicentric chromosomes, which in turn may result in micronuclei formation as a consequence of chromosome segregation defects [11].
In summary, due to recent advances in chromothripsis research, we are now beginning to understand the cellular processes and factors that might be involved in instigating such dramatic rearrangements. However, it still remains to be seen to what extent a misregulation of these factors is reflected in real tumors and which cellular process actually operates in vivo during tumorigenesis generating chromothripsis events.
4 Big data: opportunity for genomics and cancer research
Investigations of molecular disease processes such as chromothripsis events can be significantly leveraged by access to large datasets that facilitate correlative studies to pursue hypothesis- and data-driven research. Thanks to improvements in sequencing technology, the volume of genomic data submitted to public archives is now well into the multipetabyte range (1 petabyte is 1015 bytes) [1], and is beginning to match scientific dataset sizes that we previously only knew from physics. A large portion of the genome sequencing data recently generated is from human patients, with cancer genomics playing a forerunner role thanks to recent large-scale funding initiatives (including consortia projects by the International Cancer Genome Consortium [ICGC] for example; see www.icgc.org). Within the ICGC, groups from 17 countries have amassed a data set in excess of two petabytes – equivalent to the capacity of roughly 20,000 smartphones – in just 5 years [1] (Box 2).
Analysis of this treasure trove of data has the potential to enable a range of analyses of relevance to systems biology, computational biology, and biomedicine, which so far could not be undertaken. Cancer genome analyses benefit from large sample sizes, since common cancers are typically classified into numerous subtypes which characteristically show different patterns of mutation. In order to learn about cancer biology and treatment options for patients, and also to uncover relationships between molecular data and clinical data, large sets of patient genomes as well as associated datasets (e.g. epigenomes and transcriptomes) are needed. Furthermore, since major initiatives such as the Cancer Genome Atlas (TCGA) project based in the USA used mainly exome sequencing (i.e., sequencing of those 1–2% of the human genome encoding for protein-coding genes), much less is currently known about mutations in non-coding regions that contain most gene regulatory information – in spite of initial success stories showing that these regions can indeed be very relevant to cancer [17–19]. Additionally, since previous cancer studies focused largely on uncovering somatic DNA variation, relatively little is known about how germline DNA variants may influence somatic mutations, although ∼10% of all cancers likely have a hereditary cause [20].
There is by now wide agreement in the human genomics research community that the challenges of accessing Big Data sets are now limiting the scientists’ ability to do research, and especially to replicate and build on previous work. One main objective of a novel major international initiative, the PCAWG study, is to identify common patterns of mutation in more than 2800 cancer whole genomes from the International Cancer Genome Consortium. PCAWG is exploring the nature and consequences of somatic and germline variations in both coding and non-coding regions, with particular emphasis on cis-regulatory sites, non-coding RNAs, and large-scale structural alterations [1].
5 Challenges of Big Data analyses
A key determinant for, and perhaps one of the biggest benefits of, cloud computing is the rapid scalability of computational analyses (also often referred to as elasticity) – i.e. the ability to scale up vastly when need increases or down if resources are not being used. Several researchers can work in parallel, sharing their data and methods with ease by performing their analyses within cloud-based virtual computers that can be accessed from desktop computers. Thus, the analysis of a big genome data set that might have previously taken months can be executed in days or in weeks. Especially in public/commercial clouds computing capabilities available for provisioning often appear to be unlimited (e.g. with exceptionally high peak time requirements), which enables applications in the life sciences that may not be feasible/practical at all within a single or a few localized data centers. We have recently developed recommendations for the development of National clouds [21]. In our view, an optimal model for the European life science research community is what is termed a hybrid cloud, which combines academic localized data centers provisioning cloud computing to the community and keeping some of the most highly valuable life science datasets locally, as well as public/commercial cloud computing (e.g. using European commercial partners through the Helix Nebula public – private partnership for cloud computing; www.helix-nebula.eu) providing vast scalability on demand to enable new research applications. Another key advantage for a hybrid cloud is that such a cloud model does not require centralized planning, and could (as long as this is based on agreed standards and frameworks) be built around different funding sources, governance structures and organizational models, to facilitate the (data-heavy) future of collaborative basic and translational research in Europe.
6 Conclusion and future
The field of genomics is rapidly changing with the accessibility of new technologies that make human genome sequencing a regular tool employed in molecular biology and genetics laboratories, in biomedical research and in the clinics. The data emerging from these efforts will be of tremendous value not only for human health, but also for molecular research to uncover the basic underpinnings of disease processes such as chromothripsis and to make advances in translational cancer research. Projects such as PCAWG will show us some of the challenges while working with such sets of “Big Data”. More work will be needed to realize collaborative processing of such datasets in different areas of the life sciences, and we propose a hybrid cloud model of collaborative and shared data processing – e.g. to compare crucial data across projects/cohorts to advance the life sciences.
Disclosure of interest
The authors declare that they have no competing interest.
Acknowledgements
We thank the technical working group of the Pan-Cancer Analysis of Whole Genomes (PCAWG) initiative, especially Lincoln Stein, Christina Yung, and Brian O’Connor, for their crucial efforts in compiling the PCAWG donor samples (summarized in Table 1), and thank the members of the Steering committee of PCAWG for facilitating this international initiative.
Data types and data set size of the Pan-cancer analysis of whole genomes (PCAWG) project (before filtering low-quality samples).
Data types | Number of samples (number of repositories holding at least 50%) |
Whole genome data | 2834 (2) |
RNA-Seq | 1367 (1) |
DNA Methyl-Seq | 467 (1) |