Dethawing the data lake, Amazon S3 managed Apache Iceberg Tables

Among the many news items that surfaced at AWS re: Invent this year were new features in Amazon Simple Storage Service (Amazon S3) that make S3 the first cloud object store with fully-managed support for Apache Iceberg.

As many will know, Apache Iceberg is a high-performance format for huge analytic tables that brings the reliability and simplicity of SQL tables to big data.

The new Amazon S3 updates are designed to enable faster analytics make it easy to store and manage tabular data at scale.

These new features also include the ability to automatically generate queryable metadata, the aim of which is to simplify data discovery and understanding.

Bucket nuggets

Amazon S3 Tables introduces a new “bucket type” that is purpose-built to optimise the storage and querying of tabular data as Apache Iceberg tables, delivering up to 3x faster query performance, up to 10x higher transactions per second (TPS) and automated table maintenance and automation for analytics workloads.

As TechTarget reminds us, an Amazon S3 bucket provides object-based storage, where data is stored inside S3 buckets in distinct units called objects instead of files.

“Amazon S3 buckets are similar to file folders and can be used to store, retrieve, back up and access objects. Each object has three main components — the object’s content or data, a unique identifier for the object and the descriptive metadata, including the object’s name, URL and size,” explains Kinza Yasar.

Amazon S3 Metadata delivers queryable object metadata in near real-time, including object metadata and custom metadata, storing it in S3 Tables for easier querying for business analytics.

“As the leading object store in the world with more than 450 trillion objects, S3 is used by millions of customers and we continue to innovate to remove the complexity of working with data at unprecedented scale,” said Andy Warfield, vice president of storage and distinguished engineer at AWS. “We have seen the rapid rise of tabular data and increasingly, customers want to query across tables, increase query performance and understand and organise troves of data so they can easily find exactly what they need. S3 Tables and S3 Metadata remove the overhead of organising and operating table and metadata stores on top of objects, so customers can shift their focus back to building with their data.”

S3 Tables and S3 Metadata are Apache Iceberg table-compatible so users can query their data using AWS analytics services and open source tools, including Amazon Athena, Amazon QuickSight and Apache Spark.

Apache Parquet

According to AWS, many data-developer teams today organise the data they use for analytics as tabular data, most often stored in Apache Parquet, a file format optimised for data queries.

“Parquet has become one of the fastest-growing data types in S3 and customers increasingly want to be able to query these growing tabular data sets – often turning to OpenTable Format (OTF), an open source standard for storing data in tables – because it helps organise, update and track changes to large amounts of data. Apache Iceberg has become the most popular OTF to manage Parquet files, with customers using Apache Iceberg to query across billions of files containing petabytes or even exabytes of data. However, Apache Iceberg can be challenging for customers to manage as they scale, often requiring dedicated teams to build and maintain systems to handle table maintenance and data compaction, as well as manage access control. These external systems are costly and complex and they require skilled teams to maintain, using up valuable resources,” notes AWS, in a re:Invent product statement.

Amazon S3 Tables are purpose-built for managing Apache Iceberg tables for data lakes and S3 Tables are specifically optimised for analytics workloads.

Snapshots & schema evolution

Organisations can use S3 Tables by creating a table bucket that optimises the storage and querying of tabular data in fully-managed Iceberg tables. With S3 Tables, users benefit from Iceberg capabilities like row-level transactions, queryable snapshots via time travel functionality, schema evolution etc.

In addition, S3 Tables also provide table-level access controls, allowing customers to define permissions.

According to AWS, “As more customers use S3 as their central data repository, the volume and variety of data have grown exponentially, with metadata becoming increasingly important as a way to understand and organize large amounts of data so customers can find the exact objects they need. To address this problem, many customers resort to building and maintaining complex metadata capture and storage systems to enrich their understanding of data. But these metadata systems are expensive, time-consuming and resource-intensive, often requiring data engineers to manually track and update metadata as it flows through their processing pipelines, as well as data analysts to manually inspect massive object stores to find the specific data they need for analytics and AI/ML data processing workflows.”

Amazon S3 Metadata automatically generates queryable object metadata in near real-time to help accelerate data discovery and improve data understanding, eliminating the need for customers to build and maintain their own complex metadata systems.

S3 Metadata lets customers query, find and use data for business analytics, real-time inference applications and more. S3 Metadata automatically generates object metadata, which includes system-defined details like size and source of the object, and makes it queryable via new S3 Tables.

In terms of deployment, customers can add their own custom metadata using object tags to annotate objects with information specific to their business, such as product SKUs, transaction IDs, or content ratings, or with customer details.

Users can also query metadata using a simple SQL query, enabling them to find and prepare data for use in business analytics and real-time inference applications, as well as fine-tune foundation models, perform retrieval augmented generation (RAG), integrate data warehouse and analytics workflows, perform targeted storage optimisation tasks etc.