Data engineering - Jesse Anderson: Recalculating the 'confused' team

This is a guest post for the Computer Weeklyu Developer Network written by Jesse Anderson in his role as managing director of the Big Data Institute.

In his current position, Anderson mentors companies all over the world ranging from startups to Fortune 100 companies on big data and this includes projects using technologies spanning Apache Kafka, Apache Hadoop and Apache Spark.

Anderson writes in full as follows…

Data engineering is the next big thing, right?

With all of the hype surrounding AI and data, our industry should have this squared away, but we don’t.

In fact, data engineering is the confused team.

The confused IT team

We face several outstanding challenges that vary by company and even person. These questions include foundational questions such as:

What is a data engineer?
What do data engineers do?
Should we have all kinds of nuanced titles for variations?
Just how technical should data engineers be?

We’re in good company as the confused team.

We join data science as the “mistaken for” team… and data warehousing as the “no team” in terms of these units’ wider perception across the IT and business functions. Allow me to explain – data scientists have been (and still are) mistaken for people who could create models and do all of the engineering work required (hint: they can’t) – and, if you ever want to hear how something data-related can’t be done, just go to the no team (data warehouse) and you’ll be told no, it can’t be done. This perception reality doesn’t bode well for the industry as we can take the equation confused+mistaken for+no = data projects that don’t go anywhere.

… and that’s the state of the industry.

A tale of two data engineers

Part of our confusion comes from the two quite different definitions of data engineer. One definition is a SQL-focused person. This person can write SQL queries, but that’s the limit. The other definition is a software engineer with specialised knowledge in creating data systems. This person can write code and write SQL queries. More importantly, they can create complex systems for data where a SQL-focused person is totally reliant on less complex systems, often low/no systems.

Anderson: Data engineering team epiphanies require some profound organisational & technical changes.

You might think that we can easily bridge the gap between the no/low code systems now and LLMs writing code. The current reality is that no/low code systems are still limited. One of the reasons the no team always had to say no was due to both their skill and system limitations.

The ability to write code is a key part of a data engineer who is a software engineer. As complicated requirements come from the business and system design, these data engineers can create the systems at the complexity required for the solution.

Making the changes

Now comes the hard part: actually making the changes.

Companies will look around and find the data scientist who is the best coder (they’re still not a data engineer) or think their SQL-focused people can simply learn how to code (been there, seen it, largely doesn’t work). It is putting a new technology or methodology in place. If it were easy to create the right data engineering team in the first place, everyone would have done it. Some profound organisational and technical changes are necessary. You’ll have to convince your C-level to fund the team, HR that you’ll have to pay them well and convince business that working with a competent data engineering team can solve the problems.

It’s a tall order, but it is possible.

I know this because I’ve helped turn around a company’s data engineering teams that were on the wrong path. It takes concerted effort and these projects and teams don’t turn themselves around organically. If your data initiative isn’t going well, several issues may be at play. You’ll need to find and sort them out to become successful.

I invite you to join and take confused+mistaken for+no = data projects that don’t go anywhere and change it to the right people+process+technology= data projects that succeed.