LLM series - SymphonyAI: Core realities for developers coding AI

This is a contributed blog by Vijay Raghavendra in his capacity as CTO of SymphonyAI in line with the Computer Weekly Developer Network’s series discussing software engineering work in the Large Language Model (LLM) arena, in obvious relation to the use of this technology in the generative AI space and beyond.

SymphonyAI builds AI SaaS across what it calls ‘resilient growth verticals’ including retail, consumer packaged goods, finance, manufacturing, media and IT/enterprise service management. 

Raghavendra writes as follows…

When we think about how developers should approach code in the LLM AI space, first off, these software engineers need to consider what LLM they’re actually using and how it was trained. For the most part, developers will be accessing foundational LLMs or fine-tuning foundational LLMs, not training fully new models as that remains cost-prohibitive for all but the largest enterprises.

In terms of fine-tuning, it’s important to think about LLMs in this context: not all the LLMs available on the Hugging Face leaderboard are built the same way, both in terms of number of parameters and in terms of how they perform on various scoring benchmarks. 

Not every model is good at all tasks and some of these different benchmarks can help with making the right decisions on which models to use and when. 

Into the black box

For developers, the closed models i.e. OpenAI’s GPT and Anthropic’s Claude are going to be the black boxes. With those models, you’re relying on OpenAI and Anthropic to have “done the right things” when they trained these models. None of them have fully published their training datasets. Each company makes some assertions about what they’ve trained them on, but you (the developer) can’t really know. When building applications on those models, it’s not fully under your control what data has been accessed to support your application. 

Developers will find that the open models, however, have more transparency about their training data and its provenance. Llama, Falcon and Mistral have published what data went into the training. 

Training brain drain

At a most basic level, developers need to be acutely aware of the models they’re building with, how they were trained and the data that went into them. 

At SymphonyAI, we finely tuned a few different types of models hosted on Microsoft Azure. For our industrial LLM, for example, we finely tuned Meta’s Llama 2 model. We found the open models were much better suited to fine-tuning for specific industry use cases. 

We’re just starting to use Mistral AI’s open-source models as well. For us, it was important to build our knowledge of the developer personas within the verticals who will end up using the model to build applications and copilots. To encapsulate all that knowledge, we had to be able to fine-tune. Llama 2 and the other open models offered much more flexibility for fine-tuning and we expect that to continue to be the case

So then, on the question of which LLM do you start with (and does everything start with a foundation model or not?) which is the most complex, which offers the most support etc? Everything does start with a foundational model, yes. The value and accuracy of the LLM scales linearly, giving the large foundational models a big head start – and until the cost of GPU compute drops precipitously, training LLMs will remain prohibitively expensive.

We’d recommend developers start with the open foundational models as they offer the most granular fine-tuning compatibility. They’re more, although not fully, transparent and you can usually identify the source of any hallucinations by reviewing the training data and parameters. 

When we are asked whether working with LLMs requires a wider grounding in Machine Learning (ML) algorithm knowledge for the average programmer – and if there are shortcuts available here in the form of low-code tools, are they enough? The simple answer is no, modern developers shouldn’t need a specialty background in deep or machine learning to use LLMs. 

In fact, even in the one year since OpenAI launched ChatGPT, I’ve seen a sea change in the types of tools available to developers to get them building on these LLMs. We’re seeing the same abstraction in the space that we saw in application development with no/low code and RPA, we’re just seeing it happen even faster. 

Underlying SDKs & frameworks 

What we’re doing for our developer community is investing heavily in the underlying SDKs and frameworks so that not every developer needs to become an expert in fine tuning.

I do think developers building gen AI applications ought to come from the industries they’re building for, not necessarily AI labs. All developers, even AI ones, should have a fundamental understanding of the business they’re building applications for. Going from proof of concept to in-production-generation AI products is very difficult. You need to know what a day in the life of one of your users will look like to build gen AI applications successfully. 

While we’re on safety, should we be using closed source LLMs or open source LLMs and what’s the essential difference? Even Llama 2 is a “gray box” in some sense. It’s said that even the data scientists and AI researchers who built these models don’t fully understand everything that’s going on inside the model.

Vijay Raghavendra, CTO of SymphonyAI.

Having said that, when using both closed and open LLMs, developers have to pay attention to safety and provide the appropriate hooks and guardrails. When we were building our industrial LLM, we went through a number of scenarios to jailbreak models and to fool them. Google’s research team recently did the same with GPT 4.

At SymphonyAI, we built out an internal “Red Team” that is taking a large set of prompts and scoring responses on different attributes. It’s probing the fine-tuned LLM for problems. Just like we as developers wouldn’t release a product without regimented and rigorous testing, we wouldn’t publish a fine-tuned LLM without doing the same. 

On the subject of AI safety in 2024 and beyond, I expect we’ll also see new and emerging tools getting used more frequently to reduce hallucinations and more clarity from these LLMs. We’re currently using Retrieval Augmented Generation (RAG) with Llama 2, for example.

Users must also plan for and put guardrails in place to protect against prompt injection attacks. This last part is a cat-and-mouse game just like online fraud, social engineering fraud etc. are today.