Data engineering - DataStax: Building the Gen-AI stack, how to plan ahead

This is a guest post for the Computer Weekly Developer Network written by Dom Couldwell, head of field engineering for EMEA at DataStax.

Couldwell writes in full as follows…

In certain circles – the majority of CEOs and a lot of developers – there is a lot of excitement around generative AI.

Over the next year, we should see these applications move from pilot projects into real world production deployments.

But delivering any kind of AI experience relies on data – and of course, data engineering.

In turn, that data depends on having the right infrastructure to get that data ready for use with gen-AI.

RAG defined

This involves creating vector embeddings of any records or data sets that you want to use. Vector embeddings are numerical representations of data so that it can be understood semantically. When a user request comes in, the request is turned into a vector and used to search against your vectorised company data for semantically similar results. Those results then get shared to your LLM and the LLM delivers a response back. This approach is termed retrieval augmented generation, or RAG – it’s becoming a standard for any AI application.

How to keep your data relevant

From a data engineering perspective, there are some challenges that exist around data and AI – the first is around data quality.

Developers are not normally responsible for the data that they have to work with – however, if that set of data is not clean to start with, it will affect the quality of any responses. Understanding what data you have – whether that is specific sets of PDFs for products through to huge catalogues of data – at the start can help. This will impact the rest of your data infrastructure and in turn how you design your systems.

Funky (data) chunking

As an example, data chunking is how you chop up the data before it’s converted into vectors.

Without the right tooling here, your search results will suffer. Alongside, you will have your search approach too – should you go with a pure vector search model, or combine it with additional metadata to perform a hybrid search? Lastly, how much data do you have stored – will your vector data be enough or do you need to consider creating graphs of your data as well? All these decisions will affect your application performance around AI responses.

Creating a set of vector data is one thing, but your data set might change over time. You don’t want to have old or out-of-date information in any responses that you serve up to users. While RAG can improve the relevance of your responses, it is only as good as the quality of the data in your vector database. Updating that set of vector data over time requires its own pipeline.

To implement this, data developers and engineers can use data streaming to bring data in.

The theory is that new relevant data can be streamed across for conversion into new vector embeddings alongside existing values. However this pipeline needs to be managed in its own right and companies like Unstructured are coming up to fill that gap.

Planning ahead

So what does this mean for developers around data engineering?

The challenge is how to integrate these different sources of data and keep them running. At the same time, developers don’t want to spend time looking after databases of information or vectors if they can help it. Couple this with challenges around adding new features or further sources of data and running GenAI applications looks like a significant infrastructure headache.

The solution here is to look at how to automate that integration side using platforms that bring in integrations to best in class tooling rather than building everything yourself. Tools like Langflow make it easier to build applications based on components that can be linked together through APIs, while the back-end systems can be managed separately. By taking this composable approach, developers can concentrate on building applications rather than managing infrastructure. We should bring AI as a service to developers, rather than make developers all into expert data engineers.

The other benefit from this approach is that it makes it easier to review and test new projects.

With so much investment in new models and capabilities, you will want to test out different LLMs for specific use cases. What works for a general chatbot leveraging Gen AI might not be suitable for an app that would use video or image responses, for example. Similarly, developers might want to look at small language model (SLM) options rather than the same LLM for everything. Or they might want to look at different data chunking and management options for vector creation, to improve the relevance of vector search requests.

Knowing how different components work in the context of an AI application is the next big hurdle for developers.

Being able to test services and swap out components without having to redevelop the whole application from scratch each time will significantly improve developer productivity. By looking at the results you deliver and how your data engineering approach supports these results over time, you can make the right decisions to scale up your applications.