Data engineering - Percona: Measure twice, implement once, the art of thinking ahead
This is a guest post for the Computer Weekly Developer Network written by Bennie Grant in his role as chief operating officer at Percona.
Grant writes in full as follows…
The phrase “measure twice, cut once” sums up the importance of planning ahead whenever possible. By understanding the requirements of any given job or project, you can avoid re-work in the future. But how does this adage apply to data engineering?
Data engineering is a developing discipline where software developers take on more responsibilities around the infrastructure and data that their applications create. There are two reasons for this. First, data is held in higher esteem than ever by businesses and used for more purposes, from being fed into analytics models to predict future trends through to experiments with generative AI. The choices that developers make around building applications affect this over time. And second, the cost to run this infrastructure is under more scrutiny than ever, so any choice around data has to make the most sense financially as well as for software architecture reasons.
What (app) do you really want?
In the past, developers would build applications and then hand them over to IT operations and database administrators to manage. Data scientists and business intelligence teams would get involved in the analytics area, responsible for their own projects. Today, however, developers are responsible for more and more of the infrastructure involved, even if they lack that specialist expertise in data management or analytics.
Alongside this change in responsibilities, data has become so valuable that it is being used as soon as it’s generated. Data streaming technologies can take transactions and trigger specific analysis workflows when they meet certain criteria; analytics should take place in real time; new data can create vector embeddings for search and generative AI deployments. Each of these pipelines can have its own associated data infrastructure. Additionally, there are various deployment options to consider: Should you use a public cloud, opt for as-a-service products, run your own cloud-hosted solutions, or deploy within your own data centre environment?
In response to this, data engineering involves looking at how data you create might get used and then implementing the right technology to simplify these tasks over time. Picking the right technology and approach reduces the need for rework, or for clunky integration deployments that the developers have to support themselves over time. It also includes considering the costs to run one database over another – picking the right technology should provide the most predictable (or acceptable) cost increase as the data footprint goes up, while the wrong option will cost more as the volume increases.
In the case of databases, choosing the wrong option can negatively affect your whole data approach across different deployments downstream. This results in unnecessary costs and more work to replatform or update systems later.
Data engineering & avoiding problems
To prevent this kind of issue, developers and data specialists need to work together from the start. It’s important to ask questions around how data will get used over time for analysis or for more advanced AI tasks. This helps identify potential problems before architectures get set in stone and ensures that developers pick the most suitable approach for the long term.
For applications, transactional databases are the most commonly implemented choices. Alongside this, many databases promise ‘multi-model’ support for different forms of data, from transactions, vector data embeddings through to more analytical workloads. Using a multi-model approach can be easier than supporting multiple databases for different parts of the pipeline.
However, you may want to use different tools because they are better for performance. Similarly, you may want to consider options like lakehouses for real-time data analytics, where projects like ClickHouse may be a better option.
Understanding all the different options is hard, because every project or product will talk about their ability to scale or meet performance needs. To solve this, data engineering involves being precise about what your goals are and how you want to measure that performance. The biggest challenge with data engineering is how to deliver infrastructure in the most efficient way possible from transactions through to analytics and getting developers to understand the impact of the choices they make. Here, “efficiency” is not about the lowest cost option for any single task, but achieving the best and most effective overall deployment.
To echo Albert Einstein, data engineering is about making implementations around data as simple as possible, but not simpler.
By improving our understanding ahead of any big decisions, we can improve performance around data, reduce costs and avoid rework. That is worth the time taken at the start.