Should we test data like we test software?
This is a guest blogpost by Maarten Masschelein, CEO & Founder, Soda
In the not too distant past the idea of software quality was often more discussed than adhered to. In fact, Reed Hastings the founder of Netflix made his first fortune by pioneering the concept of rigorous software testing. Since then, it has become ubiquitous to the point that it is almost inconceivable that a serious developer would bring a product to market without using a Test-Driven Development (TDD) approach.
In a parallel movement, we are seeing the same attitudinal shifts occurring in the need for data quality management. The assumption has been that “all data is good data” until it breaks the application, but that isn’t cutting it anymore and the focus is shifting towards pre-emptive testing and monitoring to ensure data quality. This movement is based on three independent but interlinked driving forces.
Firstly, the acceleration to automation that is taking place across the enterprise using data. More products are being built using data as a core input, but once these “data products” are in production they need to be managed on an ongoing basis. Even the most sophisticated machine learning algorithms won’t function correctly if they are fed poor quality data.
Secondly, as businesses change and evolve in response to fluctuating customer and market demand, so too does their data at a rate that many cannot keep up with. As data passes from team to team it becomes more and more challenging to coordinate all the changes, which inevitably degrades the analytical integrity of downstream data products.
And third, the emergence of entire companies that run on data. Think Amazon, Tesla and Deliveroo. They have shown that it is possible to disrupt markets at scale with data, but there are challenges in terms of how they collect, process and maintain this data.
Documented examples where failure to test and monitor have caused problems are not hard to find. In the UK, 100,000 people were incorrectly told that they were “extremely vulnerable” to Coronavirus and needed to self-isolate for four months. The speed and complexity of the “test and trace process” meant the NHS had written to people deemed at risk using a dataset containing a large number of errors, including those deceased.
One of Europe’s largest ecommerce sites recently had to retrain their recommendation engine after an outage only to experience an abrupt downturn in sales. The investigation revealed that retraining had happened right after a major promotion on women’s shoes. The result? An engine that would only recommend women’s shoes.
Both of these examples have one thing in common: they could have been prevented with the right approach to managing data quality. A practice where data quality testing and monitoring is instilled throughout the organisation would at least ensure that these data issues and errors can be identified and resolved. Missing rows can be checked, old datasets can be quarantined, and new refined data monitored for parameter compliance, but only if an enterprise-wide approach to data quality is instigated.
As much as software is now in each and every department of businesses everywhere, data products are also being built and used across entire organisations. Almost everyone now has a stake in data which is why it’s never been more important to understand it, trust it and stay on top of it. Just as the poor software quality that was tolerated 20 years ago would today be inconceivable, so the lack of data testing and monitoring might just seem irresponsible in a few years’ time.