DigitalGenetics - stock.adobe.co

GlaxoSmithKline R&D CDO on combining Hadoop and Docker to boost drug discovery efforts

Mark Ramsey, chief digital officer of research and development at GSK, describes how banking on big data, containers and on-premise technologies is speeding up the company’s efforts to uncover new drugs and improve the hit rate of its clinical trials

Pharmaceutical industry estimates suggest that about 90% of drug trials fail, and do not result in new medicines being created and brought to market.

UK-based pharmaceutical giant GlaxoSmithKline (GSK) partly attributes this failure rate to a lack of understanding of how best to target drugs at human diseases biologically and accurately.

To help fill in some of these blanks and increase the chances of success, GSK is pursuing a research and development (R&D) strategy that draws on human genetics data to inform its drug-targeting efforts.

“Genetics data is extremely important in the development of new medicines,” says Mark Ramsey, chief data officer (CDO) for R&D at GSK.

“And the work we have been doing over the last year or so has specifically focused on bringing that data together with the knowledge we have in GSK about how genetics works, and using that information to make decisions around identifying new potential medicines.”

To this end, the organisation agreed a four-year data-sharing and collaboration partnership in July 2018 with US-based personal genomics research firm 23andMe, whose services have been used by more than five million people to gain a personalised view of their own genetic makeup.

This is also one of the motivating factors behind GSK’s decision to make a multimillion-pound investment in the UK Biobank, which focuses on establishing how genetics and environmental factors contribute to the onset of human diseases.

Both partnerships are giving GSK’s R&D teams access to huge amounts of genetics data and, over the last three years, Ramsey has overseen a big data analytics IT project to ensure the organisation’s scientists can process and analyse this information as efficiently as possible.

This work has centred on the creation of GSK’s Hadoop-based R&D Information Platform (RDIP), which Ramsey describes as a centralised hub where all the data – not just genetic – that the R&D team needs to aid the discovery of new medicines resides. 

“There was not a platform like this in GSK before, so we have established a completely new environment where we have integrated an ecosystem of a couple of dozen technologies, so we can put the ability to do really advanced analytics into the hands of our scientists,” says Ramsey.

Cross-organisational structure

Ramsey joined GSK as CDO in July 2015, when he was tasked with helpingto  transform the way GSK’s R&D organisation uses data by helping its members to derive as much value and insight from it as possible.

“We started by assessing the maturity level of the organisation, and establishing the roadmap we needed to put in place to help that transformation,” he says.

“So we started by establishing an information platform that would act as a central data and analytics hub, paving the way for RDIP’s creation.”

During Ramsey’s first six months at the firm, the team overseeing this work grew to 25 individuals, and the first iteration of RDIP that they worked on collectively went live in June 2016.

“Between then and the end of 2016, our objective was to upload at least 90% of the structured data from across R&D [into the platform] and then the objective for the next six months was to upload at least 90% of the unstructured data,” he says.

“We are talking about thousands of data sources and a good amount of data, so this was quite an aggressive timescale, but the team was able to push forward and get it done.”

Automation played an important part in helping to meet the data ingestion deadline, along with another core piece of technology called StreamSets.

“I think of it as a vacuum cleaner we can attach to all of the data sources within the organisation, and it pulls that into the RDIP data lake in a very structured way,” says Ramsey.

“The next step in the process is to use machine learning and probabilistic matching to take all of those disparate data sources and make them line up to industry-standard ontologies so it can be presented to the scientists and researchers in a way they can easily make use of.”

The importance of broadness

While genetics data is a rich information source for the R&D team, from the outset, RDIP was designed with a broad range of data types and use cases in mind, for a mix of competitive and practical reasons, says Ramsey.

“From the beginning, we really looked at how this information could be shared across the entire R&D organisation, and not just use it to solve a very specific use case, like genetics,” he says.

“Our aim was to open up the ability to collaborate and do analytics across the entire R&D process, which is something a lot of pharmaceutical companies struggle with, as everything is designed programme by programme.”

The problem with this approach is that the resulting data analytics platform can become so tightly tailored to address that particular use case that achieving the same thing in another area of scientific interest can be problematic, says Ramsey.  

“A number of other pharmaceutical companies have gone very narrow and deep, for example, and built solutions just for genetics or for different parts of the business, so it prevents them from getting this longitudinal or cross-R&D view,” he says.

“What they end up doing is painting themselves into a corner. They are really successful for one use case, but they can’t go from that one use case to 10 use cases or even two or three, because they have specialised so much in the way the platform is developed.”

Read more about scientific research

  • Paul Woobey, IT director at the Wellcome Trust Sanger Institute, tells Computer Weekly why the organisation’s HPC workload requirements cannot be fulfilled by a move to the cloud.
  • The worldwide roll-out of the Intel-backed Collaborative Cancer Cloud (CCC) research initiative is a step closer, with two further universities joining the cause.

The fact that the platform is designed to serve the needs of as broad a range of R&D use cases as possible is reflected in the way RDIP is built and the wide variety of big data workloads it can process, says Ramsey – thanks to its extensive use of Docker Enterprise containers.

RDIP itself is an on-premise platform, hosted within GSK’s own datacentre for performance reasons, and using Docker Enterprise allows the team to run multiple workloads in various configurations off the same physical hardware, says Ramsey.

“It gives us the ability to really maximise the different environments, and accelerate development because what a lot of folks don’t recognise is that in some cases, some solutions require specific versions of software in order to operate correctly,” he says.  

“If you have two of those kinds of environments and they require different software stacks, it makes things extremely difficult because you end up needing separate physical clusters.

“Docker gives us the ability to create those virtual environments and create those different combinations of the software and fit them all on the same hardware, so it really gives us a benefit in hardware productivity.”

GSK’s R&D team is also using Docker Enterprise to containerise workloads and move them around, depending on whether the data inside needs to be processed in a GPU-based or HPC environment.

“We look at the type of analytics or type of simulation being done, and move the workloads to the right underlying environment in order to support the best execution of those types of analytics,” says Ramsey.

“Some things are better suited to run in a more HPC model, whereas some you can get significant benefits from by running within a GPU environment. So that’s the other place here where Docker becomes interesting, because you can containerise the workload and move things around based on the characteristics of the type of analytics that is running.”

Counting the cost of clouds

As for why GSK has opted for an on-premise setup for RDIP, Ramsey says the data volumes it handles, coupled with the amount of processing it requires, would make it too expensive to run the setup exclusively in the public cloud.

But that is not to say GSK does not use cloud at all – Ramsey is quick to point out that RDIP has “hybrid” and “cloud-bursting” capabilities built-in.

“Cloud for us is just another environment to deploy to,” he says. “But one of the things that is misunderstood in the market is that cloud is a click-charge type of environment.

“You pay for the usage and, in many cases, if you’re going to have a large amount of persistent data or have very intense analytic loads, it is much more expensive to do those in the cloud than it is to have a physical, on-premise solution.”

There are exceptions where the converse is true, however. For example, Ramsey describes a hypothetical, variable workload involving a small quantity of data that requires tens of thousands of nodes of compute power on an irregular basis and for a relatively small amount of time.

“That type of variable workload is well suited to the cloud environment because you don’t want to take the [financial] hit of having that type of environment in place longer term if you don’t need it on an ongoing basis,” he says.

If GSK had not already possessed its own datacentre estate to run the platform out of, the situation could have played out differently, but it is important for enterprises to realise that public cloud is not always the cheapest way to run things – particularly where big data workloads are concerned.

“In many cases, the cloud is actually four to five times more expensive to do the same type of analysis than looking at the fully loaded cost of running an on-premise environment,” says Ramsey.

“Cloud can be easier and faster, potentially, to get started, but there is a point where you have to look at the persistency of data, and – if it’s something you’re going to be doing on an ongoing basis – then you really have to scrutinise the costs.”

Weighing up the benefits of RDIP

RDIP has been in place for a while now, and the team supporting it has grown to about 100-strong. The productivity benefits it has brought to GSK’s R&D team are sizeable, says Ramsey.

“Before we established this environment, it took three to four days to do one [granular-level genetic]  analysis and we had hundreds of thousands of them queued up, and now we can do the same analytics in less than 30 minutes,” he says.

“So moving from four days to 30 minutes enables the scientists to actually get in and really understand the genetic variants.”

The platform has also boosted the ability of GSK’s R&D teams to collaborate and share the results of any experiments they embark upon, as well as compare their findings with what others working in the same field may have come up with years before.

In many cases, this has had the added benefit of enabling its scientists to streamline their own workloads and processes, says Ramsey. “They now understand what experiments have already been conducted, so they don’t have to repeat things in order to progress,” he adds.

The platform is also helping GSK to shift the needle in improving the outcomes of its clinical trials, so that the drugs it tests stand a better chance of going on to become new medicines, because it so much easier now for its scientists to access data from past experiments, for example.

“This means that when they are designing a new clinical trial, they can do a better job of making sure it is done as effectively as possible,” says Ramsey. “They can also look at all the results and characteristics of things that have happened in the past as a way to help influence the discovery of a new medicine.”

And there are other markers of success for the project, with Ramsey pointing out that GSK’s president of R&D, Hal Barron, declared in July 2018 that genetic data analytics will form the starting point for all new medicine developments at the firm.

“I think that is very exciting, that we are now setting some specific objectives for R&D that are focused on data and analytics,” says Ramsey, “and we are on that journey where we are starting to see the benefits, and can see how this will change the way drug discovery is embarked on in future.”

Read more on Containers