spainter_vfx - stock.adobe.com
How Vast Data is simplifying data infrastructure
Vast Data CEO Renen Hallak explains how the company’s unified data platform that combines storage, database and compute capabilities into a single technology stack can improve the efficiency of analytics and artificial intelligence workloads
When IBM introduced the System/38 back in 1978, it came up with the novel concept of embedding a database in the minicomputer’s operating system, blurring the lines between file systems and databases, and allowing data to be queried easily.
But because the System/38’s hardware at the time could not keep up with compute requirements, it was not a practical machine until its successor, the AS/400, came along a decade later. The AS/400 shipped with the IBM Db2 relational database and enabled customers to continue running the same applications even if the underlying processor architecture was changed.
Today, Vast Data, one of the industry’s fastest-growing data infrastructure software providers, has taken a similar architectural approach with its unified data platform that combines storage, database and compute capabilities in a single technology stack, enabling businesses to reduce data movements and simplify their data pipelines to support analytics and artificial intelligence (AI) workloads.
Following a Series E funding round worth $118m in December 2023, the company announced a record fiscal year earlier this month, going from $1m to $100m in annual recurring revenue within three years of debuting its product.
More recently, the cashflow-positive company cast its sights on the Asia-Pacific (APAC) market, having established a regional headquarters in Singapore to grow its business in the city-state, as well as in key markets such as Australia, Korea, Japan and India, where it is eyeing business from large enterprises such as banks and telcos that are looking to reduce their data footprint and infrastructure costs.
In an interview with Computer Weekly in Singapore, Vast Data’s CEO and co-founder, Renen Hallak, shares more about the company’s roots and the problems it is trying to solve with its data platform, which has contributed to the meteoric rise of one of the industry’s newest tech unicorns.
The idea behind the Vast data platform isn’t exactly new – IBM was probably ahead of its time when it had a similar architecture for the System/38. What brought the company to what it is today?
Hallak: I’m a big believer in ideas whose time has come. A good idea will come back to us over and over again, but when the time is right, that’s when it takes off. I think that’s a big part of why we’ve been successful over the last eight years. We were, at the right time, looking at the right problems and finding new solutions for them.
It’s never about someone sitting in the ivory tower and thinking of a breakthrough. It’s always about how the rules of the game have changed, and the cracks that allow us to try new things that weren’t possible before.
Renen Hallak, Vast Data
Before we started in 2016, I was head of engineering at XtremIO, which was acquired by EMC. XtremIO was one of the first all-flash systems that was very small and fast, but it was also very expensive. While customers loved the idea of all-flash systems – they are fast, don’t take up a lot of space, are very resilient and easy to use – they hated the fact that they were so small and expensive, which meant that they could only put a very small subset of their data in those systems.
At the same time, they had multiple tiers of storage systems that were bigger but much slower and cost-effective, so they had to move data between those different systems to work around these compromises and trade-offs. Everyone was yearning for something simpler.
Eight years ago, the more forward-looking of those organisations saw analytics as something that was going to change the world, so they needed fast access to everything. That meant the old model of storage tiering didn’t work anymore.
That was the idea behind Vast Data. How can we simplify the world? How do we support new analytics and AI workloads by breaking those fundamental trade-offs? So I started a team and hired some good architects and developers. I spoke to potential customers about what we wanted to build, and they all said they would buy it. But when I said that to the team, they looked at me like I was the dumbest person in the room. They said, of course customers would buy it, but it was impossible to build.
We spoke to as many vendors as we could, like Mellanox, which was not yet acquired by Nvidia at the time, about new protocols, and Intel about new media. We took these new building blocks that weren’t available yet and found a way to architect them into something that would break those fundamental trade-offs, at least on the whiteboard.
Then we started to build. We didn’t have the underlying hardware yet, but we built the software first and the hardware became available over the next couple of years. By the end of 2018, we were able to piece everything together, and started selling to customers at the beginning of 2019.
You talked about the challenges of storage tiering, but there are also challenges with ETL (extract, transform, load) and data pipelines that organisations need to grapple with in analytics workloads. Talk to me about some of these other problems that you are trying to solve with your platform.
Hallak: In the early days, it was all storage. We didn’t have anything beyond storage. We had this architecture, but we realised after a few years that the architecture was our invention, not the storage system. The storage system was the first instance that took advantage of this architecture.
After we deployed in 2019, our customers told us they needed features like snapshots, replication, encryption, multi-tenancy and quality of service. We didn’t have any of that, and it took us a few years to add all that functionality. When we didn’t need to worry about storage anymore, we started to focus on other things like building a new type of database.
Traditionally, databases weren’t built with AI workloads in mind. They had those same trade-offs that storage systems had. And so, we started to build a database. You mentioned ETL – we found that in the database space, beyond those fundamental trade-offs, there were new trade-offs. For example, in the database world, you always had row-based transactional databases and column-based analytics databases. It’s the same data but you needed to keep it in two ways because, historically, data on hard drives was read in a sequential manner, so you needed an ETL to coordinate everything.
With random access media, as we have with NVMe-over-Fabrics that gives access to all of the metadata in all of the nodes – which our architecture allows you to do – you don’t have those limitations any more. So you don’t need a separate transactional and analytics database with ETLs and copies of data – all of that complexity goes away. We started building a database for AI, allowing our customers to analyse metadata associated with unstructured data. From there, we went on to build a general-purpose database that allows us to do things in ways that weren’t possible before.
Is this like a multimodal database where you can support all kinds of file and data structures?
Hallak: It is, and to give you an example, one of our customers went from using an Apache stack to just using Vast. Historically, let’s say you have a Parquet file or object, which is a table within a file or object. You would store it on Amazon S3, and then you would need to read it up to a higher layer to do the analysis.
Renen Hallak, Vast Data
Let’s say you want to run a query – you need to read 2GB (gigabytes) of information to run the query and then you end up with 20 bytes of the result. With Vast, everything is multiprotocol. You can write it as a file and read it as an object, or write it as an object and query it as if it’s a table. You don’t need those higher layers anymore. Everything becomes more efficient and effective. What we have is a combination of a file system, object store and database all put together. It’s not one stacked on top of the other.
Vast Data is expanding into Asia-Pacific. What are your expectations for the region to contribute to the company’s growth trajectory?
Hallak: I think this region is going to grow a lot faster than other regions for two reasons. First, we’re just starting and it’s always easier to grow 10 times off a smaller base. But more than that, I think there’s a hunger here for what’s innovative and bleeding edge. There is a lot less legacy to contend with, so you can leapfrog other countries and regions. We’re seeing that across the region, not just in Singapore, but also in Korea and Japan where organisations and governments are embracing new technologies a lot faster than those in other regions.
How are you approaching customers who are still invested in existing data platforms? How can Vast fit into their broader strategy as they transition to your platform?
Hallak: It doesn’t happen overnight, and our business model is to land and expand. We usually ask the customer for their hardest workloads, so we can prove ourselves and show value on day one. They don’t usually have a good alternative because it’s their most challenging workload. Over time, they realise that Vast has become their easy button. If they have a challenge with something, rather than trying to fix it, they will migrate that workload to Vast.
As other problems occur and other systems reach end-of-life, they expand their usage and put more data on our platform. It usually takes three to five years before they have the majority of their data on Vast. Then they can start unlocking the true value of our platform and do things that they couldn’t do before.
How are you educating the market on the capabilities of your platform and getting developers and partners up to speed on what you’re doing?
Hallak: We’re an engineering-led company. More than half the company is in R&D. Even on the go-to-market side, more than half are engineers. We find a common language with engineers on the customer side, talking to them about their challenges and how we solved those challenges for other customers. That gives customers a lot of comfort and confidence that there’s something real here, not smoke and mirrors. And then we encourage them to start using the platform, which proves itself a lot better than we can through words.
Most of our expansions happen through word of mouth. When a customer sees that our platform is really good, they tell colleagues in the same space and other companies that they should look at Vast. If one of our customers joins a new company, they will call us and say they want Vast there, too. We’ll never try to be a company with tens of thousands of customers. We’re focused on the most data-intensive organisations on the planet, and we want every one of them to be very happy with what we provide. It’s a focused approach that seems to be working so far.
Can you give me a sense of your technology roadmap? For example, some data platform players have been talking about allowing customers to bring in large language models (LLMs) and hosting some of those models on their platforms.
Hallak: We’re not in the LLM space because that’s the application layer and we are underneath that. We’re also not hosting models because we are a software company, not a services company. We will provide our software to hosting companies that can provide it to their customers. We will also provide our software to LLM companies so that they can leverage our infrastructure underneath their models.
From a roadmap perspective, when we came out with that data storage system, it was very new and didn’t have all the functionality that was required. It’s very similar today. We came out with the Vast database about a year ago and we’re still adding all the functionality that is required for organisations to use it across the board. The data engine, which is the compute layer, is not generally available yet.
The first version of our data engine will come out later this year, enabling customers to run functions on CPUs [central processing units] and GPUs [graphics processing units]. Some functions like inference can run on older GPUs as they don’t need as much power, while training will require the latest and greatest in GPU power.
The scheduling of jobs and functions happens based on what’s required to run optimally and where the data is. If the data sits on the other side of the world, you’d want a function to run close to where the data is, rather than move the data across oceans. That vertical integration gives us the strength that we otherwise wouldn’t have.
Every three months, we come out with a new version that has a lot of new functionality across different parts of our platform. There’s still a lot of work to do just to realise the vision that we’ve set out.
I understand there are plans to take the company public. How far are you away from that?
Hallak: I think there are two parts – one is being ready to be public and the other is pulling the trigger and performing an IPO [initial public offering]. On the readiness front, I think we’re very close from a numbers perspective. If we wanted to go public tomorrow, I think we could. We still have some work to do on getting the company ready in terms of our ability to be public without slowing us down with all the reporting and compliance requirements. That will happen over the next few quarters.
As for when do we pull the trigger? I don’t know. We enjoy being private very much. We can think for the long term, and we can innovate for our customers without having that 90-day clock attached to us. We’re taking advantage of that as much as we can. But my expectation is that over the next couple of years, we’ll get to a size where it won’t make sense to remain private.
Read more about data and storage infrastructure in APAC
- Pure Storage’s cloud-compatible and energy-efficient modular flash storage architecture is well-poised to address the data storage demands of artificial intelligence workloads.
- In this expert guide to storage management in Asia-Pacific, we unpack the developments shaping storage management and key strategies to extract business value from data.
- NetApp’s senior executives talk up the company’s efforts to support AI initiatives and deliver first-party storage services on public cloud platforms.
- Databricks is making it easier for organisations to adopt a data lakehouse architecture through support for industry-specific file formats, data sharing and streaming processing, among other areas.