Summary: Chinese scientists developed new hybrid genome assembly technology, significantly lowered the sequencing coverage required for genome assembly of the third-generation sequencing reads.
A team from Kunming Institute of Zoology of the Chinese Academy of Sciences led by Prof. Ya-Ping Zhang & Prof. Zhanshan (Sam) Ma developed a novel hybrid genome assembly technique for the latest third-generation DNA sequencing technologies (3GS). The technique can significantly lower the required sequencing coverage for the third generation sequencing (3GS) technology and equivalently reduced 3GS sequencing costs. The new hybrid assembly technology is based on two 3GS assembly software packages (DBG2OLC and SPARC) previously developed by Dr. Chengxi Ye & Prof. Sam Ma. In terms of hardware (DNA sequencers) platforms, the technology takes advantages of 10x-Genomics® technology, which integrates a novel bar-coding strategy with Illumina® NGS with an advantage of revealing long-range sequence information, and therefore particularly suitable for augmenting the assembly of long erroneous 3GS reads. The test revealed that the new hybrid assembly technology lowered the 3GS (Oxford Nanopore®, some considered it as the fourth-generation sequencing) sequencing coverage required for assembling a human genome from 35X to 7X. The significantly lowered coverage can be translated into approximately 70% cost reduction based on current market pricing of various sequencing technologies. The technique is likely to boost the competitiveness of the current 3GS technologies in their competitions with currently prevalent NGS (next generation sequencing) technologies, open new applications for 10x-Genomics® technology, and reshape the current gene sequencing market.
Gene (DNA) sequencing technology belongs to a handful of indispensable technologies for modern life science and biotechnology. For example, without gene sequencing technology, the millennium human genome project (HGP) would not be possible. The sequencing technology used to fulfill the HGP is the first generation sequencing technology, and the price tag for the HGP was $3.8 billion. It was the second-generation technology, also known as the next-generation sequencing (NGS), that dramatically lowered the cost of DNA sequencing by enabling a massively-paralleled approach to get exceptionally high throughput in reading the gene sequences. The high throughput, low cost made it possible for the NGS to dominate the DNA sequencing market space since its development. The newest third-generation sequencing (3GS, some called Nanopore® the fourth-generation or 4GS) works by reading the nucleotide sequences at the single molecule level, which is drastically different from the previous generations that break long strands of DNA into small segments then inferring nucleotide sequences by amplification and synthesis. The 3GS generates ultra-long reads (up to 1Mb), which makes it possible to eliminate gaps and effectively resolve repeats in genome assembly. The advantage of long reads possesses critical implications for both genome science and the study of life sciences in general. However, the 3GS technologies suffer from the high base-level error rates and sequencing costs. Due to the mechanistic differences between the NGS and 3GS, as well as the economies of scale, it is hardly possible to resolve both the issues with 3GS hardware (sequencer) technology alone. In this case, genome assembly software comes to the rescue.
Similar to the relationship between PC and OS, a DNA sequencer without genome assembly software is much like a PC without Operating System. It is through the computational job of genome assembly software that assembles the sequences of nucleotide bases (A, C, G, T) into the genome of an organism (such as the human genome). It is the genome assembled by software, rather than the raw sequencing reads produced by DNA sequencers, that makes sense to biologists. Indeed, the ultra-high error rates of 3GS sequencers also made the genome assembly with the 3GS long reads rather challenging. For example, back to 2014, one step of 3GS assembly (pair-wise alignment) took nearly a half million (400,500) CPU hours to assemble a human genome on a Google cluster, the DBG2OLC software developed by Prof. Sam Ma’s team reduced the computational time of that step to 6 hours (circumventing the alignment actually, total time <1500 CPU hours) on an inexpensive workstation with 128GB memory only. Until today, DBG2OLC is still the fastest, and least memory usage software for the de nova assembly of 3GS/4GS technologies. Another software package, SPARC they developed lowered the error rate of 3GS assembly under 0.5%, while 80% faster than the other packages. It is the integration of DBG2OLC, SPARC, and SparseAssembler (an ultra-fast NGS assembler by Ye & Ma et al. in 2012) that constitute the backbone of the newly developed hybrid genome assembly technology. The significant reduction in 3GS sequencing coverage requirements of the new hybrid assembly technology can mean huge opportunities for both 3GS (particularly, Oxford Nanopore®) and 10x-Genomics® technologies.
Publication: Zhanshan (Sam) Ma, Lianwei Li, Chengxi Ye, Minsheng Peng & Ya-Ping Zhang (2018) Hybrid assembly of ultra-long Nanopore® reads augmented with 10×genomics® contigs: Demonstrated with a human genome. Genomics, vol. 110, https://doi.org/10.1016/j.ygeno.2018.12.013