Cloudinary developer lead: When (and when not) to use multi-modal LLMs for visual content

This is a guest post by Sanjay Sarathy, VP of developer experience and self-service at Cloudinary, an organisation which helps companies unleash the full potential of their media to create the most engaging visual experiences.

Sarathy wants us to know when (and when not to) use multi-modal Large Language Models (LLMs) for visual content.

As we know, LLMs have proven their worth as writing and coding assistants, though they generally require considerable human oversight. Likewise, it’s risky business to generate visual content from scratch using LLMs. Today LLMs are best used to boost productivity by augmenting creativity, reducing error rates and optimising workflows at scale.

Let’s back up to see why – Sarathy writes in full as follows…

We’re teaching machines to see the world as we do. But it’s early days.

As you may know, the first generation of LLMs was trained only on text data. This gave us ChatGPT, the platform most responsible for bringing LLM adoption into the mainstream.

Next came multimodal LLMs that were trained on a wider range of data sources like images, video and audio clips. This evolution made it possible for them to handle more dynamic use cases such as generating sound, pictures and animations from text-based prompts.

By expanding and integrating a wider range of data sources, gen AI is able to closely analyse and “understand” how we humans experience the world: through our senses. Or, as the Microsoft researchers describe in their introduction to the multimodal LLM Kosmos-1, ”The goal is to align perception with LLMs so that the models are able to see and talk.”

How LLMs learn nuance & context

Because multimodal LLMs are trained with such massive, diverse data sets, they are getting much better at conveying nuance and contextualising in response to text and verbal prompts. One stage of training the LLM, for example, might focus on learning to describe images – and crucially, the relationships between objects – in a natural human style. Another might involve fine-turning a model so that it accurately follows human instructions, delivered in different ways, to transform or correct an image.

As a result, some impressive tools are emerging as we move to multimodal. Because of how they’re trained, they can give you a textual description – or caption – using natural language, which is more readable and expressive than a set of tags.

So an LLM could describe an image as “a large white cat sitting on a chair next to a plant” as opposed to a set of unrelated tags like, ”cat”, “chair”, “plant” and “room”.

Indeed we’ve started building the technology into our own products.

Cloudinary already uses a multimodal LLM to recognise the content of an image and generate a caption. This is then returned during the uploading process and stored as image metadata within the platform so it can be accessed by screen readers or search engines, for example.

Best use cases for multimodal LLMs today

Optimising images for search engine visibility and ensuring accessibility for all users are great use cases for multimodal LLMs ability to boost developer productivity. Output can be highly expressive and paint a detailed picture of an image’s setting, helping developers improve search engine rankings and web accessibility, making visual assets easier to find (asticaVision, Pallyy or CaptionIt offer AI description generators, for example). And you’ve probably found more than once that manually adding image descriptions or alt tags can be time-consuming and error-prone.

Our own research backs this up. Last year we ran a Global Web Developer Survey on AI adoption which found that 99% of the web developers believe that AI tools have the potential to improve the developer experience overall; 64% were already using generative AI tools to streamline the development process and 54% for workflow automation.

You probably suspect a ‘but’ is coming and you’d be right.

Most of us will have seen AI-generated images on social media or even in brand advertising and you may already have worked with AI image generators like Dall-E and Midjourney to create images or videos from text input. But these have the potential to go badly wrong and put your brand at risk. There are countless examples in the media of LLMs creating images that are offensive, distorted, unnatural, out-of-context, or all of the above. One of the main challenges of working with machines is to ‘translate’ the concept that the user has in mind into something that the machine can understand.

Multimodal LLMs have undoubtedly improved this process. For instance, Clip from OpenAI can analyse and learn the correlations between what’s happening in images and the text prompts associated with them. However getting models to do what users expect is still something of a gamble.

What is outpainting?

Take outpainting as an example. It’s a technique for extending images by adding content that blends seamlessly into your existing image, preserving the style and detail of the original, resulting in a cohesive and extended image.

Outpaining is a powerful use case that can seriously boost productivity. But it’s essential to provide specific instructions and moderate the output to ensure brand alignment and accuracy. For example, if there is a person in the picture, without instruction the LLM might logically decide to add another person in the extended background. So a human needs to specify exactly what type of image should appear in the background.

While it takes some oversight, outpainting still accelerates time to market tremendously by eliminating the need to directly execute each stage of the transformation for every image. To help balance developer productivity and moderation, our tools, for example, let you tag certain images as ‘requiring moderation’ during upload. Then moderators can go through them in batches and accept or reject them.

Cloudinary’s Sarathy: Let’s paint the world, digitally.

Of course, there are other challenges in working with LLMs to produce visual content. Firstly, most open source LLMs don’t yet support multimodal input or output, which means you have to look at more proprietary models. Second, if you want to work in a language other than English, you may struggle. Third, regulators are only just getting to grips with this new technology and working out how to respond. For example, it could be problematic to use an LLM to create images in the style of an existing artist.

The wizard’s apprentice not the wizard

For these and the other reasons I have outlined, the best course for any website owner or ecommerce team lead to steer right now is to use AI not for image creation but to boost human productivity.

With large-scale visual content management, there’s a lot of tedious, repetitive work to do that’s vital to accessibility and the user experience generally: things like auto-tagging, intelligent cropping and resizing and transformations. In the context of e-commerce, this includes tasks like applying different kitchen-style backgrounds behind tables in your new campaign hero image.

Undoubtedly, AI-supported image production is the way forward, which is why we’ve recently launched a tool to help brands transform generic imagery into more compelling material using AI approaches.

As it stands, it’s still too risky to jump wholesale into using Multimodal LLM to generate visual media without human oversight—which, to be fair, is about where we’ve all kind of landed with ChatGPT anyhow.