Benchmarking Modern Data Formats

Your modest laptop is capable of processing dozens of GBs with a simple change of file format and tooling.

In a world that wants to push users towards often unnecessarily expensive and slow systems on the cloud, DuckDb and it’s competitors are a breathe of fresh air - and badly needed.

Herein I’ll benchmark benchmark modern data formats and data analysis tools and show how they can help real world locally run analysis. This is still the way most analysts in banking and finance do their jobs.

The File Format Problem

File based data exchange happens in all industries. A lot of data simply originates as .csv or .xlsx files. Csv still powers data transactions in finance, it’s actually remarkably efficient and can be zipped on the windows machines easily use for a 10x compression (I didn’t test this but 10x is approximately what I normally see). But we can still do better.

Parquet files are columnar formatted files. What is that? Columnar format means we can select only the columns we need in a given operation which can drastically increase performance.

Let’s get straight to the benchmarks then discuss.

Benchmark

The source code for the benchmark is on Github here.

Dataset size: 5.0GB | 8.3M records

Because most analysis takes place at these volumes or lower. Larger than this and we aren’t analysing, we are likely running automated data pipelines. There are datasets larger than this in the wild and they are quite rare.

The Github repo shows the code used to generate a pseudo monte-carlo simulation output of the type often used in banking and insurance. The data set was then run through a scaling function which takes a target file size as a parameter. Using this I generated a data set of 5.3GB in CSV format. This was then compressed down into Parquet using both Gzip and Snappy compression to compare the performance of two of the most popular compression functions.

Straight up we can see a large storage capacity win for Parquet using 3-4x less storage capacity. Not such a big deal, I can always zip a Csv file and get good storage characteristics. Still - a nice to have.

Benchmark Results

The base case is Pandas loading the unzipped CSV and performing the group-by in a time of 34s.

The best case is DuckDb loading a Snappy compressed Parquet file in a total processing time of 0.88s.

Tool	Format	Large Dataset Time	Large Dataset Memory	Time vs Baseline	Memory vs Baseline
DuckDB Streaming	Snappy Parquet	0.887s	409.9 MB	38.3x faster	4.3x better
DuckDB Streaming	Gzip Parquet	1.411s	488.7 MB	24.1x faster	3.6x better
Polars Lazy	Snappy Parquet	1.871s	478.1 MB	18.2x faster	3.7x better
Polars Lazy	Gzip Parquet	2.519s	235.4 MB	13.5x faster	7.4x better
DuckDB Streaming	CSV	2.845s	186.3 MB	11.9x faster	9.4x better
Polars Eager	Snappy Parquet	4.035s	1,159.5 MB	8.4x faster	1.5x better
Polars Lazy	CSV	6.839s	808.3 MB	5.0x faster	2.2x better
Polars Eager	CSV	7.914s	1,940.3 MB	4.3x faster	0.9x worse
Pandas	Snappy Parquet	8.693s	1,753.1 MB	3.9x faster	1.0x same
Pandas	CSV	33.961s	1,753.1 MB	1.0x (baseline)	1.0x (baseline)

Note on file sizes

Parquet format file sizes can be achieved by compressing csv files and this often leads to even smaller file sizes than a compressed parquet.

Use tar or gzip python module to compress csv files Ideally we want to use lossless compression when working with data, perhaps a different blog post.

Extracts of code for duckdb and polars lazy

Polars lazy uses the scan_csv and scan_parquet functions in the data loader and collect in the query. Duckdb is lazy by default. These two options are the most performant.

Duckdb

def query(conn):
    return conn.execute(
        f"""
        SELECT simulation, SUM(loss) as total_loss, COUNT(*) as claim_count
        FROM '{self.filename}'
        GROUP BY simulation
        ORDER BY simulation
    """
    ).df()

Polars Lazy

def query(df: pl.DataFrame):
    return (
        df.group_by("simulation")
        .agg(
            [
                pl.col("loss").sum().alias("total_loss"),
                pl.col("loss").count().alias("claim_count"),
            ]
        )
        .sort("simulation")
        .collect()
    )

Personal experience

I’ve been using Duckdb and Polars for a while, so not surprised by the results, it’s entirely inline with what most users experience.

What is interesting is just how easy it is to use them and get good performance. I was surprised by how much better Duckdb performed than Polars in this test, especially since I personally much prefer the workflow of Duckdb. Writing data workflows in SQL is just a natural fit and doesn’t require learning a new tool.

Polars still has it’s place and enables python type checkers and linters like mypy.

OK so here’s the breakdown.

DuckDB Streaming: 38x faster than Pandas baseline
Snappy Parquet: 3.2x faster than CSV, 68% smaller files
Duckdb upto 2x performance as Polars

Arrow

Arrow format means the choice of Pandas, Polars, DuckDb doesn’t have to be a problem. Each tool can work on Arrow format and transfer data between themselves.

Arrow format is essentially an in-memory format that allows accessing the data with zero-copying between systems. So we can work on a data set in Duckdb, summarize it, transform it in SQL code and then immediately swap to Polars or Pandas on the same data set.

Sometimes there is a feature in Polars I might want to use, sometimes Duckdb is the cleanest solution. Using Arrow I can just pick the one that makes the task easier and swap between them with negligible overhead.

Check the benchmark code here.