Getting Started with Data Lakes
A hands-on walkthrough of querying data lake tables, accelerating them with MergeTree, and writing results back to Iceberg. All steps use public datasets and work on both Cloud and OSS.
Screenshots in this guide are from the ClickHouse Cloud SQL console. All queries work on both Cloud and self-managed deployments.
Query Iceberg data directly
The fastest way to start is with the icebergS3() table function — point it at an Iceberg table in S3 and query immediately, no setup required.
Inspect the schema:
Run a query:
ClickHouse reads the Iceberg metadata directly from S3 and infers the schema automatically. The same approach works for deltaLake(), hudi(), and paimon().
Learn more: Querying open table formats directly covers all four formats, cluster variants for distributed reads, and storage backend options (S3, Azure, HDFS, local).
Create a persistent table engine
For repeated access, create a table using the Iceberg table engine so you don't need to pass the path every time. The data stays in S3 — no data is duplicated:
Now query it like any ClickHouse table:
The table engine supports data caching, metadata caching, schema evolution, and time travel. See the Querying directly guide for details on table engine features and the support matrix for a full feature comparison.
Connect to a catalog
Most organizations manage Iceberg tables through a data catalog to centralize the table metadata and data discovery. ClickHouse supports connecting to your catalog using the DataLakeCatalog database engine, exposing all catalog tables as a ClickHouse database. This is the more scalable path so as new Iceberg tables are created, they are always accessible in ClickHouse without additional work.
Here's an example connecting to AWS Glue:
Each catalog type requires its own connection settings — see the Catalogs guides for the full list of supported catalogs and their configuration options.
Browse tables and query:
Backticks are required around <database>.<table> because ClickHouse doesn't natively support more than one namespace.
Learn more: Connecting to a data catalog walks through a full Unity Catalog setup with Delta and Iceberg examples.
Issue a query
Regardless of which method you used above — table function, table engine, or catalog — the same ClickHouse SQL works across all of them:
The query syntax is identical — only the FROM clause changes. All ClickHouse SQL functions, joins, and aggregations work the same way regardless of the data source.
Load a subset into ClickHouse
Querying Iceberg directly is convenient, but performance is bounded by network throughput and the file layout. For analytical workloads, load data into a native MergeTree table.
First, run a filtered query over the Iceberg table to get a baseline:
This query scans the full dataset in S3 since Iceberg has no awareness of the counterid filter — expect it to take several seconds.
Now create a MergeTree table and load the data:
Re-run the same query against the MergeTree table:
Because counterid is the first column in the ORDER BY key, ClickHouse's sparse primary index skips directly to the relevant granules — only reading the rows for counterid = 38 instead of scanning all 100 million rows. The result is a dramatic speedup.
The accelerating analytics guide takes this further with LowCardinality types, full-text indices, and optimized ordering keys, demonstrating a ~40x improvement on a 283 million row dataset.
Learn more: Accelerating analytics with MergeTree covers schema optimization, full-text indexing, and a complete before/after performance comparison.
Write back to Iceberg
ClickHouse can also write data back to Iceberg tables, enabling reverse ETL workflows — publishing aggregated results or subsets for consumption by other tools (Spark, Trino, DuckDB, etc.).
Create an Iceberg table for output:
Write aggregated results:
The resulting Iceberg table is readable by any Iceberg-compatible engine.
Learn more: Writing data to open table formats covers writing raw data and aggregated results using the UK Price Paid dataset, including schema considerations when mapping ClickHouse types to Iceberg.
Next steps
Now that you've seen the full workflow, dive deeper into each area:
- Querying directly — All four formats, cluster variants, table engines, caching
- Connecting to catalogs — Full Unity Catalog walkthrough with Delta and Iceberg
- Accelerating analytics — Schema optimization, indexing, ~40x speedup demo
- Writing to data lakes — Raw writes, aggregated writes, type mapping
- Support matrix — Feature comparison across formats and storage backends