Computer Weekly series: The age of data engineering
We have entered the age of data engineering.
That’s a lie, in truth, we’ve been working with data engineering methodologies, tools and processes since the sixties as we entered the “IBM” PC era and then subsequently moved to build networks, the Internet, the cloud model of service-based computing and the new era of AI with its generative fanfares and functions.
But now does feel rather like a new era in terms of the technology industry’s focus on data engineering, perhaps primarily because we know understand the value of each individual datum in the wider march to build intelligence services and automation functions.
CWDN series
So then, we embark upon the Computer Weekly Developer Network data engineering series. This selection of contributed columns and editorials will come from bona fide software engineers and data scientists at all levels who have spent most of their careers caring about the value drawn from the “values” (data pun there) that permeate throughout the applications and services that we now all depend upon.
But what is data engineering?
In simple terms, data engineering involves the creation of computing systems built to is collect, manage and transform raw data (sometimes structured, often unstructured) data into a usable form. Part of the wider discipline of data science, data engineering professionals are tasked with overseeing the development and management of database architectures and data processing systems.
Heavily focused on key tasks including integration, data engineering today (in the age of AI especially) involves identifying areas where automations can create more functional data pipelines that serve live production software applications effectively.
Other aspects of data engineering will involve data duplication, data verification and data management in order to corral information resources inside appropriate business policy controls as well as privacy and security guardrails – and this needs to be executed in a way that ensures data meets regulatory compliance legislation and governance controls.
Who are data engineers?
Data engineers typically have software engineering qualifications of some form and a background in science-related subjects, but automation means more business analysts and other domain specialists are being brought into the fiel.
According to IBM, “Data engineering is the practice of designing and building systems for the aggregation, storage and analysis of data at scale. Data engineers empower organisations to get insights in real-time from large datasets. From social media and marketing metrics to employee performance statistics and trend forecasts, enterprises have all the data they need to compile a holistic view of their operations. Data engineers transform massive quantities of data into valuable strategic findings.”
IBM also says that data engineers “govern data management” for downstream use including analysis, forecasting or machine learning.
As specialised computer scientists, data engineers excel at creating and deploying algorithms, data pipelines and workflows that sort raw data into ready-to-use datasets. Big Blue thinks that data engineering is an integral component of the modern data platform and makes it possible for businesses to analyse and apply the data they receive, regardless of the data source or format.
According to DWP DIgital, “A data engineer takes raw data, transforms it and stores it in formats appropriate to the use cases. An analogy is the fuel industry. Oil is extracted from a well transported, refined into different products (diesel, Jet fuel, LPG, biofuels) and stored available for further use. The whole process is monitored, secure and automated, with alerts in place when problems arise. Data engineering is the same concept, with data instead of oil.”
DWP thinks that data engineers need to be problem solvers, in order to be able to process information, present the problem statement and be solution-orientated. Ultimately, good data engineering can lead to the development of “data products” (often accessed via an API) that are optimised for different business departments to use inside what are now sometimes called “data streams” that enable immediate interrogation and support real-time decision-making.
Data product customers can be both internal departments and external entities and organisations.
How has data engineering evolved?
According to the GovUK pages for on data engineering, “Around the 1970s/1980s the term information engineering methodology (IEM) was created to describe database design and the use of software for data analysis and processing. These techniques were intended to be used by database administrators (DBAs) and by systems analysts based upon an understanding of the operational processing needs of organisations for the 1980s.”
This then moved as we entered the early 2000s, the data and data tooling was generally held by the information technology (IT) teams in most companies.
“In the early 2010s, with the rise of the Internet, the massive increase in data volumes, velocity and variety led to the term big data to describe the data itself and data-driven tech companies like Facebook and Airbnb started using the phrase data engineer. Due to the new scale of the data, major firms like Google, Facebook, Amazon, Apple, Microsoft and Netflix started to move away from traditional ETL and storage techniques. They started creating data engineering, a type of software engineering focused on data and in particular infrastructure, warehousing, data protection, cybersecurity, mining, modelling, processing and metadata management,” notes the Government Digital and Data Profession Capability Framework.
Beyond ETL, let’s talk
So let’s embark upon this editorial series with some real focus.
The above exposition and introduction should serve as an appropriate taster and scene-setter to clarify what we want from the data engineering cognoscenti who now have the opportunity to clarify what the status of this profession is, what skills matter most, whether the oft-unloved DBA can now enjoy a high status in the IT team, where automation (and let’s not have too much AI please) can help data workflows and how data engineering differs from and fits into the wider realm of data science.