Soda pops new sparkle into data engineering quality
You know soda, it’s what the Americans call fizzy-pop, Coke/Pepsi (other colas are available) and all manner of carbonated soft drinks.
No, not that soda, we mean Soda, the provider of data reliability tools and cloud observability platform technologies.
Yeah, you know this Soda, because the company had a booth at the recent Snowflake Summit conference and exhibition – possibly going to show that there’s no data warehouse platform technology these days without some sparkling fizzing data quality and reliability tooling poured on top – again, other Snowflake strategic partners are available.
The firm has now announced the general availability of Soda Core, an open source framework for data engineers to embed data reliability checks and quality management into data pipelines.
Data Engineering As-Code
Powered by SodaCL (Soda Checks Language), described as the first Domain-Specific Language (DSL) for data reliability, Soda Core introduces data engineering as-code (Ed: should that be Data Engineering As-Code or DEAC?) practices to create broad coverage, eliminate data downtime and unlock the tasks of detecting and resolving issues across the data product lifecycle.
In most data teams, data engineers are responsible for building systems and pipelines to ingest, model and deliver reliable data products to the business. Once in production these products need constant attention to address changes to data schemas and structures, broken transformation logic and concept drift, all of which impacts reliability, quality and trust in the data. The challenge for data engineers is manually fixing these data issues at scale with a lack of tools, processes, and expertise that would enable them to create more reliable and high-quality datasets.
Soda Core introduces a free, open-source framework for data engineers to build and maintain data checks as-code at scale, across every data workload, from ingestion to transformation to consumption.
The ‘shape & health’ of data
It offers a library of tools for data reliability, with core components including the use of dataset metadata to understand ‘the shape and health’ of the data and built-in metrics and broad check coverage that can be used to validate a huge number of data quality parameters.
“This first public release of Soda Core and SodaCL is one of the most important milestones in our journey so far, giving data engineers the framework and language to get started and scale with reliability engineering and data quality management,” explains Tom Baeyens, CTO, and co-founder, Soda. “We realised early on that when it comes to data quality, the needs of engineers are quite different when compared with the needs of the data team as a whole. A lot of people in a data team know what good data looks like but only a few can code the checks. With our releases today, we are providing the tools to remove the bottlenecks that exist around coding data reliability, enabling data engineers to build data quality checks-as-code directly into their pipelines and fundamentally change how teams set up and maintain reliable, high-quality data products.”
With Soda Core, fixed and dynamic thresholds ensure that data can be tested and validated with dynamic threshold systems like change-over-time and anomaly detection, as part of a comprehensive end-to-end workflow that helps detect and resolve issues, and automatically alert the right people at the right time.
Alerts and notifications can be created using a preferred ticketing or on-call system which means that, by extending Soda Core with a Soda Cloud account, notifications can be routed through to the right people, enabling less technical users to get involved by adjusting thresholds or adding new checks altogether.
The road to ‘good’ data
Also released today, SodaCL replaces the resource intensive need to code in SQL with one language that is writable and readable by almost anyone, meaning that everyone on a data team can define the thresholds of what good data needs to look like.
SodaCL provides a language foundation that will evolve over time to address business specific issues across multiple business domains including areas such as Asset Management, Supply Chain and Customer Data.
The first iteration of SodaCL delivers test and monitor checks-as-code from ingestion through to transformation, with over 30 built-in metrics and check types available to validate a great number of data quality parameters and generate value immediately.