Matillion developer lead: Scaling coding efforts with low-code ETL
The Computer Weekly Developer Network gets high-brow on low-code and no-code (LC/NC) technologies in an analysis series designed to uncover some of the nuances and particularities of this approach to software application development.
Looking at the core mechanics of the applications, suites, platforms and services in this space, we seek to understand not just how apps are being built this way, but also… what shape, form, function and status these apps exist as… and what the implications are for enterprise software built this way, once it exists in live production environments.
This piece is written by Ian Funnell, manager of developer relations at Matillion – a company known for its cloud data middleware platform designed to enable data integration.
Funnell writes as follows…
High on the list of challenges within modern data analytics is the need to integrate data from a wide and ever-increasing variety of applications, in an ever-increasing variety of formats – especially those which are semi-structured.
As part of this process, there are things that computers are good at, such as “work out the structure and datatypes in this file and load it into a database” — so, realistically, the capability of human operators is being wasted if they are spending time on solving those kinds of problems.
That’s where low-code no-code (LC/NC) presents a perfect solution – it can fulfil seemingly low-value and tedious tasks quickly, reliably and in a maintainable way.
Then there are things that human operators are good at – such as deciding which files are the most valuable ones to be looking at in the first place. Complex decision-making, value judgements and business prioritisation are problems less suited for computers to be solving.
For those tasks that do lend themselves to automation, LC/NC tools essentially provide a structured framework for code generation. Of course, this isn’t anything new; it’s what compilers have always done. In the 1990’s they were called CASE tools. In the data world, the concept has existed ever since the first person had the idea of using SQL to generate SQL.
Deploying a data-first strategy
In the context of data analytics, the interesting thing is that code is important for end-users who are looking at data (e.g. in reports, dashboards, or output from a machine learning recommendation system) because that’s what shifts the data around. But the code is not the end goal – the real end game goal is the availability of useful and highly consumable data. It’s completely possible (and regrettably common) that the code can run quickly, in adherence with the specification and with no error, but that the resulting data is incorrect.
This makes data analytics an ideal area in which to leverage LC-NC tools. The code is a mandatory part of the solution, but generating and maintaining the code efficiently using a LC/NC tool clears the way to focussing on the data itself.
It’s why such tools can become a key enabler for businesses to adopt a data-first strategy.
Adopting a declarative mindset
Another way to look at this through a data lens is to consider how LC/NC tools enable a more declarative approach to design. A data-oriented mindset is a declarative mindset, enabling developers to handle complex programmes in a compressed form. LC/NC tools free operators from many of the important but mundane distractions that hit when you are hand-coding and make a difference where it matters.
Consider loading a data lake as an example:
A hand-coder might think they ought to use data partitioning, so they have to get the subfolder layout correct for that, remember the syntax for the library code in Python and deal with the orchestration constraints of loading the data as soon as it arrives (or not if it’s late), all while considering error handling. Oh, and there’s probably a business reason for all of this too!
A LC-NC user on the other hand only needs to know the location where the data is being sourced from, and can choose to tick a box to enable partitioning optimisation.
The goal of data integration
One of the enduring myths about data quality is that source data is of poor quality to begin with. In fact, the opposite is true. Source data has perfect quality and is being used to successfully run businesses, it’s just difficult to read. The goal of data integration is to keep that initial high quality and to transform data from multiple sources in a connected way to make it much more consumable overall.
As a consequence, LC/NC tools in the data analytics realm tend to fall into two categories:
- The data colocation aspect of data integration tends to be more mechanical and in the domain of no-code tools.
A data colocation technique such as high watermark loading is conceptually exactly the same in finance as in aerospace and healthcare. Data colocation is a vital step along the way, but it’s not valuable on its own and is complete just by turning the handle with minimal fuss. This type of work – copying data from place to place virtually or physically – is highly declarative. Consequently, these no-code tools tend to have simple interfaces, such as lists of properties or sets of dialogs.
- Following on from colocation, the data integration steps tend to need more human input and guidance. A join is a join, and a filter is a filter. But both these require the injection of specific business rules at specific points.
Hence data integration tends to be more in the domain of low-code tools. The overall syntax of a join can be machine-generated, but a human is given the opportunity to supply the exact join criteria, according to a business rule. Sophisticated data transformations are built up in a low-code tool by chaining simpler, atomic functions together such that one happens after another. Consequently, the solutions usually end up in a graph-like structure. You can zoom right out to appreciate the overall design, and zoom right in to examine the individual implementation details.
Making sense of your strategy
Overall here we can say that low-code no-code platforms are particularly appropriate for data analytics for two reasons.
- Firstly, because it has its fair share of routine and prosaic tasks that it makes sense for computers to manage, so data teams can concentrate on more valuable and strategic work.
- Secondly, because the goal of data analytics is to make data useful and highly consumable.
Software is still required to operate on the data, but the software itself is not the primary goal and that’s why it’s a prime candidate for automated generation by a LC/NC tool.