In 2015, GSK embarked on a data strategy to address a challenge sharing data caused by data fragmentation and different formatting of data across the R&D department.

There are around 10,000 scientists in GSK’s R&D operation, but very little sharing of data on medicine development and trials.

Before the data strategy, which is now three years old, all data from medicine trials and experiments was all in different formats and stored in different places, said Mark Ramsey, who was brought in as chief data officer for GSK’s R&D operation in 2015.

He said some work had been done on traditional data warehousing in the past, with attempts to structure and organise data using technologies like Oracle and Teradata. “But what we were really looking for was something to tackle the problem on a broader scale,” said Ramsey.

“Pharmaceutical companies produce a large amount of data, but it is produced in vertical silos,” he said. “For example, in discovery there is experimental data produced which is used to progress individual new medicines, but there wasn’t really an ability to share that information across the R&D organisation and to use the power of the aggregation of that information to make better decisions.”

GSK recognised this was a constraint so recruited Ramsey as a chief data officer to define a data strategy for all of R&D so information could  be used as a strategic asset rather than just for operations.

Read more about Hadoop

  • Trying to calculate Hadoop cluster capacities isn’t always straightforward. It’s important for organisations to include IOPS and compression rates in their predictions.
  • Hadoop data lakes offer a new home for legacy data that still has analytical value. But there are different ways to convert the data for use in Hadoop depending on your analytics needs.
  • Social media giant plans to offload some of its Hadoop clusters to the Google Cloud Platform to boost the resiliency of its infrastructure.

He started off by identifying where the department was in terms of data use. “I initially did a survey across the entire R&D population of about 10,000 scientists using Competing with Data from MIT, which measures data maturity, and got a very high response rate,” he said.

“In general, the feedback confirmed the hypothesis that people could access the data they created themselves but could not really share.”

He then did an assessment of what had been done in the past in terms of creating an integrated information platform and found there had really not been a focused effort within R&D to share data and that the technology needed to do so was not in place.

When the organisation is developing medicines, scientists do experiments. So you have thousands of scientists doing experiment on specific compounds and things as they try to determine if it is a success or not. But at GSK, they were all doing these experiments based on individual programmes. “But there is value putting all those experiments together,” said Ramsey.

“Before they start an experiment, they can analyse all the similar experiments already done and get insight from them. The worst case scenario is somebody doing an experiment that has already been done,” he said.

The organisation also does lots of clinical trials. These are done with certain focused results they are trying to achieve. It will either achieve it or not. “But if you don’t put all the clinical trials together, you lose the value of that aggregated knowledge.”

Bringing information together

The organisation made the decision to use Hadoop as the foundation to give it the ability to bring all information from different operational sources together in the right format so they can start curating and rationalising it. Hadoop is an open source software for being able to store both structured and unstructured data.

The company had to start from scratch. “We put in place a new platform because the technology had not been used at GSK before,” said Ramsey.

GSK then integrated a number of other technologies to bring the data into the platform and rationalise it.

He said the project will never really end because the data team are constantly refining things and finding new use cases. Most of the work was completed in-house, at its global hubs, with none of the traditional system integrator relationships used, but it does work with a number of smaller specialists in areas like data science and analytics.

To this end, GSK has built an ecosystem of about a dozen smaller software suppliers to support the platform. This includes California-based startup Waterline Data, for example, which provides metadata repository technology. This ensures that once the data is in the platform, we can search it and see where information exists and who has used it in the past.

GSK is also looking at using artificial intelligence in the development of new medicines using supercomputing technology.