It’s the year 2000. A research group at an EU based pharmaceutical company discusses an edgy and ambitious project involving whole genome RNA sequencing. Like many nascent research projects, the informatics requirements are steep. The cost and process burdens of acquiring infrastructure for terabytes of data and teraflops of compute power to be used in an undefined number of use cases and methods could dampen the spirits of even the most enthusiastic team. Without extra funding, the idea will likely sit on the backburner until the next fiscal year.
by Matt Wood
Now fast forward and add cloud computing to the mix. Resource constraints are replaced with enabling technologies such as scalable storage, elastic compute and dynamic analysis platforms. Where IT procurement and lengthy technical reviews once cast long shadows over research, organisations are now accessing on-demand technology infrastructure with no upfront cost or negotiations.
With a cloud computing strategy in hand, the lab funds the project out of discretionary budgets. The project begins by sending samples to a sequencing service provider, who ships the results to a secure cloud environment. The necessary storage is available on-demand with a pay-as-you-go pricing model, meaning the researchers pay nothing until the first byte is written or after the final file is removed. The collaborators get straight to work performing large scale, distributed computations. Sharing results becomes as easy as sending an email. This is one of hundreds of examples of how cloud computing has changed the way organisations have acquired IT in the past three decades. The new world, where scientists are living today, is much different.
Researchers in industry and academia are using computers in continuously increasing quantities for molecular simulation, virtual screening and DNA and protein sequence analysis. In the past, organisations purchased expensive, purpose-built cluster resources and data management systems. This requires significant upfront investment and labs were often surprised by the management costs associated with running dedicated infrastructure. Those unable to afford hardware would use shared infrastructure, often at supercomputing centres, and wait in long queues for an opening weeks later.
These pressures have scientists earnestly exploring scalable and on-demand IT infrastructure that can meet the unpredictable demands of research and development. In six months, a project’s technology requirements may have changed three to four times (or more), so nimble technology is key. Scientists are also benefitting from IT resources that provide an affordable model for global collaboration, similar to The 1,000 Genomes Project, the largest study of genetic differences between people to date. The project offers a comprehensive resource on human genetic variation and involves participants from Europe, North America, South America and Asia who are sharing data and analysis in real time. To make data more available to a broader audience and to further innovation on genomic research, The 1,000 Genomes Project can also be accessed through the cloud. This means scientists with less advanced computers and infrastructure have the same access to the raw data as those with supercomputer technology. This is the type of sharing and collaborative model life science professionals are getting excited about today.
A simple way to explain cloud computing is that instead of buying, owning, and maintaining your own datacenters or servers, you purchase compute power and storage services from third party infrastructure providers on an as-needed basis. Database, messaging infrastructure and content distribution services are also available in the cloud. The provider manages and maintains the entire infrastructure in a secure environment and users interact with resources via the Internet. Capacity can grow or shrink instantly.
For an offering to truly be cloud computing, it must have the following characteristics: No upfront capital expenditure, pay-as-you-go services, elastic capacity, fast time to market (think server capacity in minutes) and the ability to remove undifferentiated heavy lifting. All of this must comply with regulatory needs and without sacrificing data security.
An example of a company that has taken advantage of the on-demand nature and scalability of cloud computing is Cambridge based Eagle Genomics, a bioinformatics services and software company specialising in genome content management. Eagle Genomics store and analyse large quantities of genomic data for its customers. Recent projects have included biomarker discovery, microarray probe mapping and genome assembly from next-gen sequencing data. At the heart of Eagle’s analysis projects lies an adapted version of the eHive workflow management system. Eagle’s modifications enable eHive to scale automatically by starting up and spinning down resources in response to capacity demands. This is something that Eagle could only do cost effectively by having its technology infrastructure in the cloud. This avoids the expense of purchasing and maintaining HPC hardware in-house and avoids underutilised resources.
Another life sciences organisation taking advantage of cloud computing is the European Bioinformatics Institute (EBI). The EBI is the home of cutting edge research using computers to study life science problems. One of the largest projects currently underway at the EBI is the genome browser, Ensembl – www.ensembl.org. Ensembl is a central tool used in worldwide bioinformatics research. When working as a global team, latency can become an issue. The EBI has reduced the latency of accessing the Ensembl service for their US collaborators, by moving the service to the cloud. This is making the large amounts of information hosted in the genome databases more readily available to researchers around the world to spend more time making discoveries and less time on accessing the information.
Yet another example is Galaxy, an open source web application and analysis platform designed to allow reproducible, sharable science. The team recently made Galaxy available as a cloud optimised and deployable solution to allow researchers anywhere in the world to run exactly the same pipelines and share data and results without investing in hardware or worry about managing servers.
These examples show innovative uses of the cloud, but how quickly is this catching on in the life science industry? All signs point to rapid adoption. Through discussions with scientists, engineers, developers, CIO’s and CTO’s of start-ups and enterprises, I’ve consistently heard why the cloud is a growing part of their future plans. The most frequently cited reasons include:
IT consolidation is on the rise. Driven by a need to optimise expenses and gain efficiencies, the biopharma industry is consolidating IT to focus on core expertise and reduce capital expenditure. This includes IT infrastructure, which most do not see as a competitive advantage. As organisations grow and work is distributed to scientists across the globe, technology infrastructure running in the cloud will improve efficiencies and utilisations in tandem with growth.
Agility is becoming necessary. If purchasing dedicated hardware, it can take organisations months to procure, provision and make resources available to users. That can feel like years in the fast-moving scientific world and make innovating on the science nearly impossible. IT managers and CIOs have discovered that with the cloud’s ability to rapidly provision resources, scientists can do their job with minimal resource contention. Organisations get to say no less and support more projects.
New methods lead to new collaborations. Science is all about collaboration, increasingly so as scientists start investigating biology at a systems level and collaborating with experts in specialised research functions. This has led to more distributed partnerships, both public-private and collaborations between academic institutions and companies. The availability of shared data spaces with easy access to on-demand computing resources makes the cloud very attractive today. Public access to data sets and associated data analysis tools are creating an ecosystem for data sharing and analysis that could portend a larger trend in scientific collaboration.
Scientific practices are evolving. From its early days, cloud computing has enabled new business models. Many start-ups have flourished because access to cloud services empowered them to create innovative solutions that take advantage of massively distributed architectures without having to invest the capital to build resources. Life sciences are following a similar trend. We can expect to see more start-ups emerge to provide analysis and data support roles. Instrument and service providers are also leveraging the cloud to distribute data and provide on-demand access to computing pipelines. Of course, this is all happening at a scale and lower cost than possible outside of the cloud.
Computing paradigms have shifted. Large scale modelling and simulation, and especially large scale data analysis, challenge existing infrastructure and workflow methodologies. Data-intensive workloads require massively parallel frameworks that are ideally built on top of commodity hardware. Such systems, like Hadoop and non-relational databases, are becoming part of the solution for difficult computing challenges. These frameworks are now tuned to successfully run in the cloud. The availability of dynamic cluster computing resources in the cloud has multiplied the capabilities researchers can access to solve scientific problems at massive scales. Before cloud computing, these problems remained untouched or were addressed at scales that limited utility.
A cloudy, data-driven future.
Economics, a desire to foster more collaboration and the need for faster innovation cycles are leading the life science industry to a new world where scientists have instant access to infinitely scalable resources. In the next few years, third generation sequencing, massive metagenomics sequencing projects, and an increased availability of molecular diagnostics are going to produce unprecedented amounts of data at relatively low costs. Cloud computing will play a key role in providing the technology infrastructure that will drive the data-driven future of life science.
Image source: http://www.dundee.ac.uk
Category: Guest Posts