Six Quirks of the Human Genome
Since the discovery of the DNA double helix by Watson and Crick in 1953, our knowledge of the genetic code has grown exponentially. Advances in technology have driven much of that discovery, including the rise of next-generation DNA sequencing instruments that I wrote about in issue #108. One of the best parts about working as a genetics researcher is that the human genome continues to surprise us in new and puzzling ways.
Where are the Genes?
The human genome comprises 3.2 billion base pairs spread across 24 distinct chromosomes. In 1990, as the Human Genome Project was just ramping up, the NIH estimated that it might contain 100,000 protein-coding genes. Five years into it, researchers revised that down to around 50,000. At the completion of the HGP in 2001, the estimate was 30,000. Now, with the genomes of thousands of individuals in hand, we know that the number of protein-coding genes is less than 20,000, and they occupy about 1.5% of the genome. This is a shockingly low number, given how complex of organisms we claim to be. The average car has 30,000 different parts (according to Toyota), so how can we have only 20,000 genes?
Unlike the parts in a car, however, a lot of our genes serve multiple purposes. Even though the number of genes seems limited, many of them are transcribed in different ways, incorporating different combinations of exons (the blocks of sequence present in mature RNA transcripts), and starting or stopping at different places. Thus, while we might have less than 20,000 genes, they produce hundreds of thousands of unique RNA transcripts that are translated into proteins.
Much of the 98.5% of DNA that does not code for proteins is nevertheless important, since the regulation of which genes are active at which times (and in which tissues) is like conducting a large, elaborate symphony. Many of those bases are transcribed into things like ribosomal RNA and transfer RNA, which aid in protein synthesis. Also, non-coding elements like transcription factor binding sites, enhancers, and repressors play critical roles in gene regulation. All of these elements conspire to help those 20,000 genes make a very complex organism.
The Virus Invaders
There are hundreds of different species of viruses that infect humans. Virtually all of them reproduce by exploiting our own cellular machinery, but one class does so by integrating its viral genome into ours. Retroviruses, as they’re called, count HIV and herpes virus among their members, and are infamous for causing latent and often incurable infections. The integrated viral genome has a distinctive pattern, which essentially boils down to three gene regions (gag, pol, and env) sandwiched between two repetitive structures called long terminal repeats.
The gag and env regions encode the virus capsid—basically its body—and protective envelope. When a retrovirus invades a host cell, the pol region encodes the enzymes it needs to reproduce and integrate its genetic code into the host genome. Then, it can produce more viruses that go out and infect other cells.
Retroviral genome structure. Most retroviral genomes contain three gene regions (Gag, Pol, and Env) flanked by repetitive sequences called long terminal repeats (LTRs). Credit: Dan Koboldt, 2015.
If the idea of a virus integrating its genome into yours makes you uncomfortable, you won’t like this next part. The structure of an integrated retrovirus is somewhat distinctive, and when we finished the sequence of the human genome, we noticed thousands of them already in it. Human endogenous retroviruses, or “fossil viruses,” make up about 1% of our 3.2 billion base pairs. And many of them are active in our cells, producing transcripts that are made into proteins. They’re not infectious because they don’t produce a complete virus capsule, and they probably serve some kind of symbiotic function.
The Gene for Speed
Skeletal muscles are made up of long, cylindrical cells called muscle fibers. They come in two types: slow-twitch fibers, which are the most energy-efficient, and fast-twitch muscle fibers, which generate more force. There’s a protein, called actin-3, that’s present only in the fast-twitch fiber cells. Back in 2003, researchers found that a mutation in the gene encoding actin-3 was associated with athletic performance. This mutation disrupts the function of actin-3, and it’s present in 20-50% of people worldwide.
If you have one mutation, you still have a working copy of actin-3, because we have two copies of most genes. But if you inherited two mutations, you won’t have any actin-3. This does not seem to have a negative impact on health, but don’t count on making it to the Olympics. It turns out that most world-class athletes, particularly sprinters, have at least one copy of actin-3. Many have two working copies of the gene.
Interestingly, the frequency of the actin-3 mutation differs considerably among world populations. It’s more common in European and Asian populations (50-80%) than in African populations (15-20%). Given this disparity, it’s tempting to theorize that a genetic advantage explains why countries that often produce world-class sprinters (such as Jamaica) have considerable African heritage. Most experts agree, however, that the genetic influence is only a small part of what’s required to make an Olympic gold medalist. After all, the final heat of the 100m dash pits numerous athletes against one another, most of whom will have actin-3 in their muscles. Only one of them can win it, and odds are his name will be Usain Bolt.
The Circular Chromosome from Mom
When scientists talk about the genome, we typically refer to the 23 pairs of chromosomes: 1-22, X, and Y. Each chromosome is a single long strand of DNA that, although it’s packaged up around protein-RNA complexes called nucleosomes, is essentially a linear molecule. But there’s another chromosome in all of our cells that often gets less attention: the mitochondrial chromosome. Mitochondria, as you might remember, are the energy-producing organelles of a cell. They contain a small circular chromosome that encompasses about 37,500 base pairs. It includes 37 protein-coding genes, most of which are vital for mitochondrial function.
The mitochondrial chromosome. The human mitochondrial genome is around 37,500 base pairs long and contains 37 protein-coding genes. Credit: "Mitochondrial DNA en" by Shanel.
Most cells contain numerous mitochondria, meaning that there might be hundreds or thousands of mitochondrial chromosomes, compared to just two copies of autosomal chromosomes and sex chromosomes. Because there are so many copies, we often sequence the mitochondrial genome with more redundancy than any of the other chromosomes. We simply can’t NOT sequence it, because there are so many copies relative to the rest of the genome.
Another unique aspect of mitochondrial DNA is that, in animals, it’s inherited only from the mother. There’s a practical reason for this: an egg contains about 200,000 mitochondrial chromosomes, and the average sperm has about 5. Recent evidence suggests that paternal mitochondria aren’t just outnumbered, but are selectively destroyed upon fertilization. No matter the cause, the effect of maternal inheritance of mitochondrial DNA has made it a useful tool for tracing maternal lineages through human history (analogous to the way that the Y-chromosome is used to trace paternal lineages).
Similar to nuclear DNA, mitochondrial DNA is susceptible to mutation and DNA damage, especially in the presence of free radicals. This is important because mutations in mitochondrial DNA can disrupt essential components of the energy chain, and have been linked to several disorders with an age-related component, such as deafness. Some researchers have speculated that mitochondrial DNA might play an important role in the process of ageing. Wouldn’t it be fascinating if this teeny-tiny chromosome were the key to reversing that process?
You Didn’t Need Those Genes, Did You?
If you compare the genomes of two unrelated people, you’ll find about three million differences between them, mostly in the form of single-base changes called SNPs (pronounced “snips”). Small insertions and deletions (called indels) are the second most common class of genetic variation. There are also large structural alterations to DNA—but until recently, these were thought to be pretty rare.
In 2006, researchers working on the map of human genetic variation (not the sequence itself, but how we differ from one another) were puzzled by inheritance patterns in certain regions of the genome that seemed to violate so-called Mendelian rules of inheritance. The underlying cause turned out to be large genomic deletions, some spanning thousands of bases. Dozens or hundreds of genes had been deleted. And yet, the individuals involved in these genetic studies are all healthy.
Since then, we’ve uncovered thousands of deletions, duplications, inversions, and more complex rearrangements in the genome that seem to segregate in human populations just like SNPs and indels. Many of them seem to have no obvious effect on the individual, though some diseases have been linked to structural variants. Cancer in particular seems to rely on rearranging the genomes of healthy cells to grow and divide unchecked.
Compared to SNPs and small indels, structural variants are far less prevalent in the human genome. Given their size, however, they might ultimately affect a bigger part of it than any other form of genetic variation.
Sickle Cell Disease Versus Malaria
Sickle cell disease (SCD) is an inherited disorder caused by abnormal hemoglobin, a protein in red blood cells that takes up oxygen in the lungs and carries it to the rest of the body. In healthy people, red blood cells are disc shaped—like a donut without a hole—which allows them to slide through large and small blood vessels without any traffic jams. In people with sickle cell anemia (the most common form of SCD), red blood cells are often shaped like a sickle. That’s the blade the Grim Reaper carries, and looks like a crescent moon.
This is not a great shape for things that need to slide through narrow blood vessels. Sickled cells can jam up and cause blockages, and because they’re not as flexible, they tend to burst apart. That’s why red blood cells of SCD patients only live for 10 or 20 days, compared to 90 to 120 days in a healthy person.
Sickle cell disease. In patients with two hemoglobin mutations, red blood cells are sickle-shaped rather than disc-shaped, which can block blood vessels. Credit: "Sickle Cell Anemia" by BruceBlaus.
SCD is a recessive genetic disease, meaning that patients inherit one defective copy of the hemoglobin gene from each parent. The mutations that cause severe recessive disease are usually quite rare due to natural selection. But SCD is somewhat common, and as you’ve probably heard, it only seems to occur in people with African ancestry (affecting about 1 in 500 births). It turns out that, although inheriting two copies of mutated hemoglobin is bad news, inheriting only one mutated copy protects against malaria.
Malaria isn’t a genetic disease; it’s caused by a mosquito-borne parasite. But 90% of the 200+ million annual cases of malaria occur in sub-Saharan Africa, where it’s a major cause of mortality. This explains why mutations in the hemoglobin gene are most common in people with African heritage: if you live in a place where malaria is common, a protective mutation offers a major advantage. The effect is so strong that it counteracts the natural selection that would normally remove disease-causing mutations from the population. This phenomenon is called balancing selection, and it’s one of the reasons that we haven’t all evolved into superheroes.
The human genome is a complex beast that took ten years and millions of dollars to sequence for the first time. Now, we can sequence a genome in about a week, for less than $1500. Thousands of genomes have been finished since the initial draft sequence in 2001, uncovering fascinating tidbits at an unprecedented rate. Even so, the more we study the human genome, the more it becomes apparent that many of its mysteries have yet to be revealed.