Bence Kover introduces how advances in molecular biology and information technology made it possible for the cost of genetic sequencing to go from three billion dollars to only about $1000, allowing biotech to transform our lives.
Due to the availability of immense computing power, data became the most valuable currency of the 21st century. And what data could be more valuable than DNA, the code of life itself. Our understanding of this molecule transformed biology from being the hobby of eccentric gardeners, to the central scientific discipline of the past 70 years. We are now able to read the DNA with vast fidelity, to generate large databases, draw conclusions on how to improve medicine and agriculture, and understand our genetic past and indeed life itself. Recently, molecular genetics has expanded beyond academia, and is now the playground of large corporations and insurance companies, approaching the consumer market and integrating into our everyday lives.
The double helix structure of DNA contains a very simple but powerful code of only four letters: A,C,T,G. These letters, or so called bases, contain the information that codes for proteins that make up the chemical factories (enzymes) and structural elements of our bodies. The determination of the order of bases in DNA is called sequencing. In the 70’s, the first sequencing methods appeared, however they could only “read” about a couple hundred bases, with a manual and tedious method that required professional training. However, as our understanding of nanoscience, biochemistry and informatics improved, new sequencing methods appeared. The Human Genome Project, finishing in 2003, marked a milestone by sequencing the human reference genome in a 10 year long collaboration of thousands of scientists, costing 3 billion dollars.
Initially in this project they still used a relatively tedious and slow way of sequencing, called Sanger-sequencing, that usually only reads hundreds of bases, which is far from the 3 billion letters of the human genome. By the end, sequencing was automated, and machines took over the job. One can only imagine the difficulty of assembling tiny hundred base reads to a 3 billion letter long genome. It was the improvements in information technology that allowed scientists all over the world to sequence the genome bit by bit, and piece it together using sophisticated software. This project resulted in making the so called reference genome, that provided a sample for future sequencing approaches, that can just align their results to this template, making the computation way easier and faster.
In the early 2000s, the so-called second-generation sequencing technologies revolutionized biotechnology. These approaches all used some sort of DNA amplification method to create a large pool of fragmented sample, and then sequencing happened via DNA synthesis using the sample as a template. The detection of each DNA letter was through the luminescent, fluorescent or pH signals during the incorporation of DNA bases. Out of these methods, the leading method was, and still is, Illumina’s dye sequencing. This method creates clusters of small DNA fragments, and adds a fluorescent dyed DNA base to an existing chain, then a camera takes a picture, and the next coloured base is added and so on. This happens simultaneously for each DNA fragment on millions of spots on a tiny chip (Figure 1.). The method covers the entire genome dozens of times and results in terabytes of picture data. Computation is then required to make sense of these pictures and assemble the genome. It was this that ruled that past 15 years and drove the genomics revolution to where it currently is.
Figure 1. A picture of Illumina sequencing data.
Illumina is currently the biggest biotech company with its 55 billion dollars market cap, and near monopoly of the genomics sector. However, it is perhaps not a coincidence that Illumina has been doing multiple acquisitions in the past years to solidify its future, as third-generation sequencing methods might overthrow its monopole position. Third-generation methods allow sequencing of single DNA molecules without any amplification needed, providing real-time analysis, faster sequencing, and individual reads of a few million bases, compared to the hundreds of bases that previous methods could have achieved. The most promising of these third-generation methods is the Oxford Nanopore sequencing, which measures the electric current around a nanopore, through which a single DNA molecule is translocated. This method has only one limitation, which is its accuracy of only 92-97%, that is considerably lower than 99.9% of other methods. Once they find a way around this problem, nanopore sequencing will most likely take over the industry, and bring down the cost of sequencing the human genome under 100 dollars. The exponential trend of the genomics industry will not stop for a while, and will undoubtedly continue to transform science (Figure 2.).
Figure 2. The rate of decrease in the price of DNA sequencing outperformed Moore’s law.
The availability of large genomic datasets enabled us to study genetic diseases, and understand the subtle differences between individuals that could lead to different drug responses. These large databases also allow us to scan through genomes of organisms. For example, once the well-known CRISPR system has been discovered, it only took the scanning of bacterial datasets to discover that there are indeed thousands of similar CRISPR systems in nature. We also mapped evolutionary relationships, and unveiled our past to a high detail, inferring what the universal common ancestor of all complex lives must have been like. The power of DNA databases will increase as more data comes in, and as further accessory technologies improve. One such technology is the AI-based protein structure determination powered by Google’s Deepmind, which together with genomics data will be revolutionary.
In some cases, we do not actually need to read the complete book of the genome, sometimes it is enough to read the headlines of each chapter, which is certainly a faster and cheaper method. More scientifically this means that we can just look at well-known characteristic bases in the DNA (so called SNPs), that usually go together with certain traits. For example, if the location has an “A” base, then the person carries a certain disease (or is more likely to have brown eyes etc.). The well-known direct-to-consumer genetic testing companies such as 23andme or MyHeritage are doing exactly this, and are able to read our genome for a rather cheap price. These products are not without controversy though, as some of the claims of these companies such as “determining your fear of height” are rather pseudoscientific.
Furthermore, there is hardly anything more personal than one’s genetic material, and providing it to such companies can often lead to these data being shared with third-party companies. It is not hard to imagine how valuable this is for insurance companies or employers, who would want to know all the genetic risks of certain individuals. Being denied insurance just based on genetic risk factors is certainly unfair, however from the point of view of insurance companies, this is a vital part of risk assessment. How this will play out in the future is going to be very interesting to see.
The concept of DNA will most certainly become a more central part of people’s life through GM foods, gene-therapies, personalized medicine, and synthetic biology. However, none of these were possible without the exponential improvements in sequencing during the past decades.