The Next Generation of DNA Sequencing
In 2001, an international team of researchers led by the NIH announced the completion of the Human Genome Project. Sequencing all 3.2 billion base pairs of our genetic code had cost ten million dollars and taken almost a decade. Now, just fourteen years later, we can sequence a human genome in four days, at cost of about twelve thousand dollars.
This rapid advance was fueled by the development of a single disruptive technology called massively parallel DNA sequencing. A company called Illumina (San Diego, CA) has emerged as the market leader. Their approach first breaks up the long DNA molecules into short fragments (a process called fragmentation), and then loads them to be sequenced simultaneously on a high-density microscopic array called a flowcell. Going from tissue sample to DNA sequence essentially involves four steps:
- Isolation of DNA from a tissue sample. This is usually a blood sample, but could also be a small piece of a tumor, a skin punch, or other tissues of interest.
- Creating a sequencing library. The long molecules are sheared into fragments of a few hundred base pairs by a sonicator, an instrument that uses high-amplitude sound waves to break DNA. At each end, we attach (by ligation) a sequencing adapter that allows DNA polymerase to attach and do its thing.
- Loading the library onto the flowcell. It contains millions of tiny wells, each of which will host a single DNA fragment. Usually, that fragment is duplicated (by PCR) in the well so that there are many identical copies to boost the “signal strength” in the next step.
- Sequencing by DNA synthesis. A complementary DNA strand for each fragment is synthesized in a base-by-base reaction called a cycle. Each cycle adds one base to the growing DNA strand. A very sensitive (and expensive) camera records which base (one of the nucleotides A, C, G, or T) was incorporated.
After one hundred fifty cycles, we have “read” one hundred fifty bases from each end of the DNA fragment. This may not seem like much, given the size of a human genome, but each flowcell produces millions of these short reads for a given sample. On average, we sequence each base in the genome about thirty times, from thirty different unique fragments.
It might seem counter-intuitive that we can sequence one genome so many times in a single experiment. True, almost every cell in the body has just two copies of each chromosome—one from mom, and one from dad—but the DNA samples we use comes from a tissue sample that contains thousands of cells. All of those DNA molecules are randomly fragmented when we create a library (step 2 above), which means that any position in the genome is represented on numerous different fragments.
The Computational Challenge
Once all of the short sequencing reads are generated, the lab work is essentially complete. Next, we turn to computers and software to help us assemble and make sense of the genetic data. The first step is to identify, for each sequencing read, the region of the genome from which it came. Here, we benefit from knowing the sequence of the human genome already. Rather than trying to reconstruct the entire genome sequence from scratch, we can use the short DNA sequence like a search query. Once we know where it came from, we line up the read sequence to the reference and compare them one base at a time.
Typically, 99.9% of the sequenced bases will match the known reference. The other 0.01% might represent either sequencing errors, or genetic variants in the individual’s genome. Each of us has around three million such variations, which are most commonly a single base substitution, but can also be insertions or deletions of bases or even large-scale structural rearrangements.
Applications of Next-Generation DNA Sequencing
This “next-generation” sequencing technology makes it possible to sequence entire genomes quickly and at a reasonable cost. Rapid, inexpensive genome sequencing provides many avenues of important research. For example, sequencing can be used:
- In cancer treatment, to compare the genomes of a patient and his or her tumor. This not only reveals the mutations that caused the disease, but may identify possible drug targets for personalized cancer therapy.
- In genetic research, to uncover the genetic architecture of inherited diseases and possibly find ways to treat them.
- In agriculture, to identify the genetic variation underlying favorable traits like drought resistance, pest resistance, and better yields.
- In forensics, to rapidly identify human/animal remains, match DNA from crime scenes to suspects, etc.
- In archaeology, to learn about human history, migration, and speciation from the clues left in ancient DNA samples.
These are just a few of the potential applications for the current state of next-generation DNA sequencing technology. But the evolution of that technology is still under way.
Next-Next Generation Sequencing Technologies
One of my favorite emerging technologies for DNA sequencing is made by a company in the United Kingdom called Oxford Nanopore Technologies. Their technology relies on feeding a single molecule of DNA through a very tiny hole (a nanopore) and inferring the sequence from fluctuations in electric charge. They’ve developed a prototype instrument called the MinION that’s about the size of a thumb drive and plugs into the USB port of your computer. I’ve got to get me one of those, mostly so I can walk around with a DNA sequencer in my pocket.
Another company called Pacific Biosciences has developed a technique to sequence single DNA molecules several thousand base pairs at a time (in contrast to the one hundred fifty base pairs). This is advantageous in certain regions of the human genome that contain highly variable sequences, such as the “human leukocyte antigen” (HLA) region on chromosome six, which is important for matching organ donors to recipients. Very long reads also help us improve the accuracy of the human genome reference in “repetitive” regions of the genome that are difficult to sequence with short read technologies.
One of the major near-term goals for DNA sequencing is to get it into the clinic, where genetic information could be used to improve patient diagnosis, prognosis, and treatment. There are a number of practical and ethical hurdles that must be overcome to do this. First, we need to establish that next-generation sequencing can provide consistent, accurate results on par with current genetic tests. Given the random processes on which the technology relies, this may be one of the most difficult tests to pass.
There are ethical concerns, too. Because of the discovery power of sequencing, it’s also important to establish guidelines about the results that may be returned to a patient after sequencing. Genome sequencing may uncover predisposition to late-onset diseases (like Huntington’s disease) that are unrelated to the reason a patient is in the hospital. Routine DNA sequencing of multiple family members may also reveal unexpected results, such as “non-paternity events” (you can guess what those are) or two parents who turn out to be second cousins. Whether or not to return such “secondary findings” is an area of contentious debate among researchers and clinicians.
The Future of DNA Sequencing
It’s impossible to look at a technology like DNA sequencing and not speculate about the possibilities for the near or distant future. In the short term, I think that DNA sequencing will become a routine part of healthcare for many people, especially those affected by genetic diseases. As our knowledge of the human genome grows, this information will have more and more predictive power, too. In other words, based on an individual’s genome sequence along with clinical and environmental information, it should become possible to estimate his or her lifetime risk for various diseases like cancer, heart disease, diabetes, and Alzheimer’s disease. These will probably be reported in terms of probabilities: twenty-one percent chance of stroke, fifteen percent chance of prostate cancer, etc. It might resemble that baby-is-born scene from Gattaca, though I doubt we'll be able to deliver the information in seconds, or predict the manner of a person's death. On the bright side, legislation like the Genetic Information Non-discrimination Act (GINA) in the United States will hopefully prevent employers and other organizations from discriminating against people based on their genetic makeup (as they do in the movie).
In many cases, these assessments will only be informational in nature. Even if lifestyle changes could reduce the risk of a certain disease, I wonder whether that will be enough motivation for many people to do them. Just look at how many people use tobacco products in spite of their scientifically proven links to cancer, heart disease, and other problems.
What if we could instead use genetic information to correct “defects” that are likely to have a negative impact on an individual’s health? For example, there are more than three thousand genetic disorders for which the responsible gene is known. In theory, if we knew the mutations that gave individuals a severe disorder, correcting that genetic defect might be the only way to prevent the disease. Numerous “gene therapy” clinical trials are under way to accomplish this by, say, using an engineered virus to deliver a fully functioning copy of a gene to individuals who are sick because they don’t have one.
Even better would be to permanently correct the genetic variant in a patient’s genome. A new technique that enables precise “genome editing” in living organisms—called CRISPR/Cas9—might one day provide this capability. Yet that would also be a tremendous responsibility to undertake, even for healthcare providers. It’s messing with nature, and not everyone is going to be on board with that. Case in point: earlier this year, to the alarm and distress of the larger biomedical research community, a group in China used CRISPR/Cas9 to modify the genome of human embryos as a “proof-of-principle” experiment.
Although their success rate was abysmal, and their work widely criticized by the rest of the research community, the Chinese group brought forth a discussion that we need to have about the capabilities of genetic technology and what it means for our future.
Dan Koboldt is a genetics researcher who has co-authored more than sixty publications in Nature, Science, The New England Journal of Medicine, and other journals. Every fall, he disappears into Missouri's dense hardwood forests to pursue whitetail deer bow and arrow. He lives with his wife and three children in St. Louis, where the deer take their revenge by eating all of the plants in his backyard.
Dan also writes fantasy and science fiction. His debut novel The Rogue Retrieval, about a Vegas magician who infiltrates a medieval world, will be published by Harper Voyager on January 19th, 2016.