Data engineering - Hycu: Bridging data & infrastructure, the evolving role of data engineers
This is a guest post for the Computer Weekly Developer Network written by Subbiah Sundaram, senior vice president of products at data protection platform company HYCU, Inc.
Subbiah spearheads product management, product marketing, alliances, sales engineering and customer success with more than 20 years’ experience delivering multi-cloud data protection and on-premises solutions.
Subbiah writes in full as follows…
While data engineers and data scientists collaborate on the same projects, their day-to-day responsibilities differ. Data scientists focus on analysing data, building models and generating insights. On the other hand, data engineers ensure that data pipelines run smoothly, that data is properly stored and easily accessible and that the infrastructure is in place.
Fiddly fidelity
One of the most important aspects of a data engineer’s job is ensuring data quality, fidelity and accuracy.
This means taking raw information – often from multiple sources – and transforming it in a consistent, reliable way that is ready for further analysis. Writing effective ETL (or more commonly these days, ELT) jobs is at the heart of this process, but it’s hardly the entire story. You also have to think about scaling storage solutions, designing robust data models and planning how to protect sensitive data.
With the growing popularity of public cloud platforms, these responsibilities have evolved significantly.
Instead of managing on-premises databases, data engineers must be familiar with cloud data warehouses like BigQuery, Snowflake, or Redshift and their associated infrastructures. These platforms allow for tremendous scalability, but they also introduce a level of complexity that was traditionally handled by IT teams on-premises. High availability, disaster recovery and data protection – once mostly IT’s domain – now often reside in the data engineer’s toolkit.
But that’s not all.
Modern data pipelines involve a range of services that go well beyond a simple database or storage bucket.
Beyond buckets
Machine learning workloads, for example, might tap into various PaaS and serverless offerings.
You might have an orchestration tool like Airflow or a streaming component like Kafka. There may be APIs for real-time data processing, or serverless functions that transform data on the fly. When you start stacking these services together, it can feel like you’re assembling a complicated puzzle – one that must remain stable and secure at all times.
A single misconfiguration in your Terraform template can disrupt an entire pipeline.
Security and data protection are two areas where many organisations assume the cloud provider handles everything. While these providers do take on some responsibilities – such as ensuring their underlying hardware is secure – there’s still a lot that falls squarely on the data engineer’s shoulders.
This is where the concept of “Shared Responsibility Model” comes into play: the provider safeguards the infrastructure, but you, the tenant, must properly configure and maintain it. If you leave an S3 bucket wide open to the Internet, or if you accidentally delete your production dataset, that’s on you to fix.
This shift in responsibilities means data engineers need to be prepared for incidents that might have been outside their scope in traditional environments.
Rather than relying on a dedicated IT department or DBA to handle recovery, data engineers are now frequently the ones to build and maintain backup strategies, manage failover processes and monitor system health. On top of that, they need to keep an eye on access controls and identity policies to ensure that only the right people – and services – can manipulate or view sensitive data.
Data engineering scope is widening
Ultimately, a data engineer in the cloud era is more than just a pipeline builder.
You become something of a systems architect, a security practitioner and backup admin all rolled into one. You’re responsible for knowing the ins and outs of each service you deploy, anticipating any potential points of failure and ensuring that your data is both reliable and protected at all times.