1000 Genomes Project reaches new frontiers in human genetics

27 Oct 2010

Your browser does not support inline frames or is currently configured not to display inline frames. The 1000 Genomes Project, a major international collaboration to build a more detailed map of human genetic variation and genetic association with diseases, has completed its pilot phase.

The original human genome project only gave relatively crude detail of the human genome and no indication of the variation between humans.

Since then, with technological advances, the cost and resources needed to map the genome has reduce many fold — from over $1bn to just tens of thousands of dollars — enabling many research teams to study genetic associations with diseases. Over 1000 regions on the genome have now been associated with traits such as disease susceptibility, response to medication or physical characteristics.

However, recent research has highlighted important gaps in the databases that contain all this genetic information. To fill the gaps, the 1000 Genomes Project has undertaken a thorough and systematic investigation of genetic variation between individuals and populations.

The results of the pilot phase are now published in the journal Nature [1] and freely available through the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and the US National Center for Biotechnology Information (NCBI).

Launched in 2008, the Project first conducted three pilot studies led by Paul Flicek, to determine the best strategy for characterising more than 95% of the genetic variants that can be found in 1% or more of three different geographic population groups (Europeans, East Asians and West Africans).

Disease researchers will use the catalog, which is being developed over the next two years, to study the contribution of genetic variation to illness. In addition to distributing the results on the
Project’s own web sites, the pilot data set is available via the Amazon Web services (AWS) computing cloud to enable anyone to access this unprecedentedly large data set, even if they do not have capacity to download it locally.

A previous public project, the International HapMap Project, provided an initial database of over 3 million human DNA variants present in 270 DNA samples. Information and methods developed by the HapMap Project fuelled a first generation of so-called “Genome Wide Association Studies” (abbreviated GWAS) that have localized over 600 novel genetic risk factors for common diseases such as diabetes, heart attack, inflammatory bowel disease, breast cancer, schizophrenia, and other disorders. These studies were limited by technology, however, to studying a subset of more common DNA variants (those with frequency greater than 5-10%).

The 1000 Genomes Project exploits next-generation DNA sequencing technologies to develop a much more complete database — one that goes much lower in frequency, and one that is extended to more human populations. This database will contain all forms of variation — single letter changes (termed SNPs), small insertions and deletions (termed “indels”) and large changes in the structure and copy number of chromosomes (termed “copy number variations”). This integrated map is a novel contribution, as previous studies have focused exclusively on one form of DNA variation (even though each of our genomes contains all variety of variation).

“The increased resolution of the 1000 Genomes map will provide researchers with far more detailed sequence information beyond common variants, including millions of less-common and rare variants”, said Elaine Mardis, PhD, co-director of the Washington University Genome Center and member of the project steering committee. “Researchers who have found regions of the genome associated with disease will be able to look at this data to see an almost complete set
of genetic variants in those regions that might contribute directly to disease.”

The project partners, working in nine different centres, plan to sequence the genomes of more than 2500 people from five large population groups by the project’s completion in 2012.

Considering that one person’s genome contains around 3 billion DNA base pairs, that’s a lot of data. In this pilot phase alone, a total of 4.9 terabases of DNA sequence were generated (1 terabyte is 1000 gigabases, about the size of 300 human genomes).

"The amount of information delivered by this first stage of the project is remarkable," said Richard Durbin of the Sanger Institute in the UK. "In less than two years, we identified 15 million single-letter changes, 1 million small deletions or insertions and 20,000 larger variants. The majority of these variants — around 8 million — had never been seen before. This is the largest catalogue of its kind, and having it in the public domain will help maximise the efficiency of human genetics research."

Collecting, storing and analysing this data would be impossible without highly sophisticated computing resources — specialised software and huge amounts of processing power and data storage capacity.

Thanks to innovations in DNA sequencing technology, genomic data is being generated at rates previously unimaginable to life scientists. This poses significant challenges not only for storing and moving the information among different partners, but also for its analysis. The EBI group developed a robust new computing platform and several software innovations that made this pioneering project possible, and will also pave the way for other sequencing projects on an even larger scale.

"Having a systematic catalogue of human variation changes the way we can study human genetics, much in the same way as having a catalogue of human genes did," said Dr Flicek. "Among other things, it also gives us a platform for analysing the connections between genes and an individual’s disease risks." The results of the collaboration extend well beyond the scope of the 1000 Genomes Project, he said, and represent the beginning of a new era in human genetics using genome-wide sequencing.

"This work shows the power of very recent advances in sequencing to generate maps of genetic variation that bridge different scales," added Jan Korbel from EMBL in Heidelberg, Germany, who helped analyse the larger variants. "It’s an exciting first step, which paves the way for looking at the relationship between genetic variations and diseases like cancer."

Uses of the project

The uses of Project data will be many. All of the variants described in the pilot study can now be tested for their association with any given disease or trait (eg susceptibility to addictive behaviour such as smoking). Indeed, the data are already being used to inform a number of medical studies. The results of the pilot study offer a much deeper, more uniform picture of human genetic variation than was previously available, and offer new insights into functional variation, genetic association and natural selection in humans.

One clear use is to track down the causal mutations underlying initial localizations from GWAS. A second is making it possible to test less common DNA variants for contributions to disease. And a third is to help identify rare mutations that cause strongly inherited diseases: in studies aiming to find such rare mutations, it is very helpful to have a complete database of common variants that can be screened out to focus attention on those mutations that are unique to an individual or family.

But before such uses could be realized, many technical and analytical challenges had to be overcome. These were the focus of the pilot projects.

Pilot projects — testing essential aspects of project feasibility

The first pilot project involved sequencing the genomes of six people (two nuclear families each with two parents and a daughter) at high coverage. Each sample was sequenced an average of 20-60 times, and using a variety of sequencing technologies. Previous “personal genomes” were each based on only a single sequencing method, and thus were limited to what that method could detect.

By using multiple methods, the Project has uncovered not only a more complete picture of DNA variation in these individuals, but also learned about the strengths and limitations of each of the current technologies. These data also served as a comparison group for the genome sequences analyzed in the other pilot projects.

The six genomes were sequenced by academic centers in China, Germany, the UK, and the US, as well as by three companies, using
platforms from the companies: 454 Life Sciences, a Roche company; Applied Biosystems, an Applera Corp. business; and Illumina Inc. All of the platforms were able to sequence 85-90 percent of a genome and produce high-quality data.

The second pilot project sequenced the genomes of 179 people at low coverage — an average of three passes of the genome. Although sequencing costs are dropping, it is still very expensive to sequence the genomes of hundreds of people deeply enough to find all of the genetic variants in each genome accurately.

An alternative approach is to sequence many genomes at light coverage, and then combine the data from many people to discover genetic variants that they share. The results of the pilot project confirmed that this strategy is effective and will allow the project to meet its goal of discovering sequence variants that are shared with other people.

The third pilot project involved sequencing the coding regions, called exons, of 1,000 genes in about 700 people to explore how best to obtain a detailed catalog in the approximately 2% of the genome that is composed of protein-coding genes.

This Project provided unprecendented sample size to learn about the patterns of rare variation in the human population.

Data analysis and access — the first major release of biomedical data on the Amazon Web Services Cloud.

The amount of data produced by the 1000 Genomes Project is unprecedented in biomedical research. Currently, the total size of the datasets is over 50 terabytes, or 50,000 gigabytes. That corresponds to almost eight trillion DNA base pairs, or terabases, of sequence data. Early in the project, merely copying the vast quantities of data between the European Bioinformatics Institute (EBI) in the U.K. and National Center for Biotechnology Information (NCBI), part of the U.S.
National Library of Medicine in the U.S. consumed large fractions of both groups' capacity on the Internet for several days.

Researchers can freely access the 1000 Genomes Project pilot data through the 1000 Genomes website, www.1000genomes.org.

Researchers can download the data from NCBI at:

For many researchers and institutions, especially those who lack the computer and analytical power to study such a massive data set, an economical option is being tested to access and analyze the pilot data.

The pilot datasets of the 1000 Genomes Project (7.3TB of data) are available as a public dataset through Amazon Web Services (AWS) and integrated into the company’s Elastic Compute Cloud (Amazon EC2 and Simple Storage Service, S3) As new data become available and usage of this data increase on AWS, it is anticipated that additional data sets will be available in AWS.

The cost to researchers for computing through Amazon EC2 can be counted in tens of dollars per day compared to the hundreds of thousands of dollars it would cost to purchase the computer
infrastructure needed to download and analyze this amount of data locally. Because 1000 Genomes Project data are publicly available from EBI and NCBI, other companies that provide similar computing services are also free to download and provide the data to their clients.

Reference

The 1000 Genomes Consortium. A map of human genome variation from population scale sequencing. Published online in Nature on 28 October 2010. DOI: 10.1038/nature09534.