Four million people in Northern England rely on the services of Yorkshire Water for clean water and waste treatment, including the processing of 150,000 tons of sewage sludge each year. Part of that processing is done by microorganisms: through anaerobic digestion they can reclaim biosolids and convert them into renewable energy. Professor James Chong, a Royal Society Industry Fellow and microbiologist at the University of York, studies those microorganisms to understand how to make that process more efficient and reduce greenhouse gases that harm the environment. Working with Yorkshire Water, Chong’s group collected sixty gigabases (or sixty billion base pairs) of microbial DNA sequence and turned to his colleagues Dr. John Davey, Bioinformatician in the York Bioscience Technology Facility (BTF), and Dr. Peter Ashton, Head of the Genomics and Bioinformatics Laboratory in the BTF, for help in analyzing the data on high performance computing (HPC) clusters.
Using Oxford Nanopore sequencing, Ashton and his lab can sequence tens or hundreds of thousands of DNA base pairs in “long reads.” Davey then runs software to assemble the reads by joining overlapping pieces of sequence together. “We expect to find hundreds of different genomes in a digester sample, but older sequencing technology which produces very short reads hundreds of base pairs long typically produces assemblies with hundreds of thousands of pieces,” he explains. “With long reads, we typically get assemblies in thousands of pieces, making it much easier to identify the species in the digesters.” But those long reads generate huge datasets with heavy computational demands, especially large amounts of disk space. So the team turned to Cloud Technology Solutions (CTS), a Google Cloud Premier Partner based in the UK and offering cloud migration, transformation, Big Data and support services, to pilot their workflow on Google Compute Engine’s virtual machines (VMs). The collaboration with Google Cloud and GÉANT enables CTS to offer unique services to the European Research and Education community.
"I can ask new questions now, like how that community of microorganisms changes over time and from system to system across the region. The challenges we’re facing in Yorkshire are duplicated all over the country so this project has the potential for significant impact."Professor James Chong, Royal Society Industry Fellow, University of York
Expanding memory capacity to nearly four terabytes
Google Cloud’s memory-optimized machine types are well suited for data analysis that require substantial virtual CPU and system memory; they are also ideal for many resource-hungry HPC applications. Google Compute Engine offers powerful ultra-memory custom machine types with up to 160 cores and 3.88TB of memory. Working with CTS, the York team started running the genome assembly with 3TB of disk space but found they needed even more storage. CTS created a ‘Quick Start’ five day tailored training package that enabled the research team to get started with their cloud solutions with specific tools and knowledge that they needed. Within these five days they had solved the problem: completing their pipeline for the first time and on a single Google Compute VM set up as virtual 96 core server attached to a 4x8TB striped LVM partition. Ashton marvels that “we hadn’t been able to run this workflow at all but using Google VMs makes this genome assembly possible, accessible to more researchers, and more affordable.” Davey adds that the shift to long reads makes “metagenome assemblies much more useful because they are easier to analyze. For example, we have been able to identify repetitive CRISPR arrays in the long read assemblies; these arrays are too complex to assemble whole with short reads. CRISPR arrays contain pieces of DNA from viruses that have previously attacked the bacteria, so we can trace the history of the digester ecosystem by studying these sequences. It was easier than I thought to get the sequence data onto the cloud server, and the Compute Engine tools made it easy to track what was happening on the machine, which greatly helped us to diagnose problems.”
"We hadn’t been able to run this workflow at all but using Google VMs makes this gene assembly possible, accessible to more researchers, and more affordable."Dr. Peter Ashton, Head of Genomics and Bioinformatics Laboratory, University of York
As datasets continue to grow, Ashton believes that scalable solutions like Google Cloud will be crucial to next-generation genome assembly. “We used to subdivide projects,” he says, “but now we can do a single project in one run. So we can apply this to larger and larger projects.” For Chong, this workflow takes some of the guesswork out of the analysis and accelerates his progress: “I can ask new questions now, like how that community of microorganisms changes over time and from system to system across the region. The challenges we’re facing in Yorkshire are duplicated all over the country so this project has the potential for significant impact.”