Gundolf Renze - stock.adobe.com
GlaxoSmithKline R&D creates data platform using Hadoop for the internal sharing of scientific data
GlaxoSmithKline is making better use of the data it has about the development and trials of medicines through a Hadoop-based platform
Pharmaceuticals firm GlaxoSmithKline (GSK) has improved its research and development (R&D) capabilities through a programme to enable the sharing of data generated through the development of medicines across the R&D department.
In 2015, GSK embarked on a data strategy to address the challenge it faced in sharing data. There are around 10,000 scientists in GSK’s R&D operation, but very little data on medicine development and trials was shared between them.
Before the data strategy, which is now three years old, all data from medicine trials and experiments was in different formats and stored in different places, said Mark Ramsey, who was brought in as chief data officer for GSK’s R&D operation in 2015.
He said some work had been done on traditional data warehousing in the past, with attempts to structure and organise data using technologies such as Oracle and Teradata. “But what we were really looking for was something to tackle the problem on a broader scale,” said Ramsey.
“Pharmaceutical companies produce a large amount of data, but it is produced in vertical silos,” he said. “For example, in discovery there is experimental data produced which is used to progress individual new medicines, but there wasn’t really an ability to share that information across the R&D organisation and to use the power of the aggregation of that information to make better decisions.”
GSK recognised this was a constraint, so recruited Ramsey as a chief data officer to define a data strategy across the R&D operation so information could be used as a strategic asset rather than just for operations.
Read more about Hadoop
- Trying to calculate Hadoop cluster capacities isn’t always straightforward. It's important for organisations to include IOPS and compression rates in their predictions.
- Hadoop data lakes offer a new home for legacy data that still has analytical value. But there are different ways to convert the data for use in Hadoop depending on your analytics needs.
- Social media giant plans to offload some of its Hadoop clusters to the Google Cloud Platform to boost the resiliency of its infrastructure.
He started by identifying where the department was in terms of data use. “I initially did a survey across the entire R&D population of about 10,000 scientists using Competing with Data from MIT, which measures data maturity, and got a very high response rate,” he said.
“In general, the feedback confirmed the hypothesis that people could access the data they created themselves but could not really share.”
He followed this with an assessment of what had been done in the past, in terms of creating an integrated information platform, and found there had really not been a focused effort in R&D to share data and that the technology required to facilitate sharing was not in place.
When the organisation is developing medicines, scientists do experiments. So you have thousands of scientists carrying out experiments as they try to determine if it is a success or not. But at GSK, they were all doing these experiments based on individual programmes. “There is value in putting all those experiments together,” said Ramsey.
“Before they start an experiment, they can analyse all the similar experiments already done and get insight from them. The worst-case scenario is somebody doing an experiment that has already been done,” he said.
The organisation also carries out lots of clinical trials. These are done with certain focused results the scientists are trying to achieve – they will either achieve them or not. “But if you don’t put all the clinical trials together, you lose the value of that aggregated knowledge.”
Bringing information together
The organisation made the decision to use Hadoop as the foundation to give it the ability to bring information from different operational sources together in the right format so it could start curating and rationalising it. Hadoop is an open source software for storing both structured and unstructured data.
The company had to start from scratch. “We put in place a new platform because the technology had not been used at GSK before,” said Ramsey.
It then integrated a number of other technologies to bring the data into the platform and rationalise it.
He said the project would never really end because the data team is constantly refining things and finding new use cases. Most of the work was completed in-house, at GSK’s global hubs, with none of the traditional systems integrator relationships, but it does work with a number of smaller specialists in areas such as data science and analytics.
To this end, GSK has built an ecosystem of about a dozen smaller software suppliers to support the platform. This includes California-based startup Waterline Data, for example, which provides metadata repository technology. This ensures that once the data is in the platform, GSK can search it and see where information exists and who has used it in the past.
GSK is also looking at using artificial intelligence (AI) in the development of new medicines using supercomputing technology.