Auto-tech series - Snorkel AI: Data is the new input for business process automation

This is a guest post for the Computer Weekly Developer Network written in full by Alex Ratner, CEO at Snorkel AI – a company known for its technology which is designed to help developers build enterprise AI faster with programmatic labeling, data-centric development, and now, foundation models.

With the new widespread availability of foundation models and Large Language Models (LLMs) such as GPT-4, BERT, Stable Diffusion and more, businesses have entered an era when they can automate and streamline processes at a massive scale.

But the way companies approach software development is changing rapidly, so what do we need to think about next?

Engineering teams still need traditional techniques to build supporting infrastructure, but a new class of programmers is using data as the primary input to build Business Process Automation (BPA).

Ratner writes as follows to explain and clarify…

Data as a programming language

The concept of data as a programming language is deceptively simple.

Feed a neural network a collection of inputs and desired outputs, referred to as labeled training data and the network will write the ‘code’ to reproduce that transformation. The code, in this case, takes the form of neural network weights – millions or billions of them, depending on your needs and the Machine Learning (ML) model selected.

The resulting program will perform more robustly than traditional rules-based systems that have powered software automation for decades.

This approach offers several benefits.

For one, neural networks are powerful; they can capture far more subtle patterns in and from data than would be possible with heuristic code. You also have a wide choice of models based on your task, domain, accuracy requirements and cost constraints. Finally, once put in production, you can rapidly adapt automation software to changing objectives or performance criteria by producing new training data and retraining the models.

Let’s look at how foundation models/LLMs accelerate this.

The concept of data as a programming language has defined our academic work for years. In 2016, we called it ‘data programming’.

The idea of programming with your data was a strange new idea in 2016; by 2020 it had become a critical and widely accepted concept. But today, with the emergence of powerful foundation models that form the bedrock of how we build AI, data is effectively the only viable way to program AI.

The emergence of large pre-trained and easily-available foundation models enables businesses to benefit from collective human knowledge, represented by Internet data and years of research and development on deep learning algorithms. Engineering teams can take these models – which have already learned numerous relationships in data – and fine-tune them to your needs using your own data.

The central approach is the same: feed the model inputs and desired outputs as labeled training data.

The key to achieving the results you need is using the right data.

Data as bottleneck

Programming software through data sounds easy, but efforts to do so consistently hit the same bottleneck: labeling and curating the data. Engineers have solved many business problems with Machine Learning (how many teams struggle to predict churn anymore?), but developing data for domain-specific tasks poses an enduring challenge.

What column in a call transcript indicates if a customer had a bad experience? What pixels on an X-ray tell you that a patient has lung cancer?

Snorkel AI CEO Ratner: data is effectively the only viable way to program AI.

This kind of data requires subject matter expertise. While businesses can sometimes hire a swarm of outside contractors to label data, security or privacy concerns—or the need for specialised expertise—often stops that approach in its tracks. Most organizations spend weeks or months labeling thousands of documents and images manually.

Recently, novel approaches such as programmatic labeling and weak supervision have emerged where data science teams can collaborate with subject matter experts to label a large amount of high-quality training data in minutes.

Automating away the challenge

Once a business solves a core task through data, automating it is a simple matter of orchestration. Cloud service providers offer tools that trigger processes and pipelines based on relevant business factors – when a file appears, for example, or on a set schedule. Depending on the problem, you may solve the automation side with an old-fashioned cron job.

NOTE: The (above) cron command-line utility is a job scheduler on Unix-like operating systems. Users who set up and maintain software environments use cron to schedule jobs (commands or shell scripts), also known as cron jobs, to run periodically at fixed times, dates, or intervals. It typically automates system maintenance or administration. although its general-purpose nature makes it useful for things like downloading files from the Internet and downloading email at regular intervals.

Using the right training data, ML models, and orchestration software is a powerful combination. You can turn a task that used to take a human 10 minutes into one that runs in less than a second using AI. But you have to start with the data. That’s your primary input to automating tasks.