Benchmarking Modern Data Formats


Your modest laptop is capable of processing dozens of GBs with a simple change of file format and tooling.

In a world that wants to push users towards often unnecessarily expensive and slow systems on the cloud, DuckDb and it’s competitors are a breathe of fresh air - and badly needed.

Herein I’ll benchmark benchmark modern data formats and data analysis tools and show how they can help real world locally run analysis. This is still the way most analysts in banking and finance do their jobs.

The File Format Problem

File based data exchange happens in all industries. A lot of data simply originates as .csv or .xlsx files. Csv still powers data transactions in finance, it’s actually remarkably efficient and can be zipped on the windows machines easily use for a 10x compression (I didn’t test this but 10x is approximately what I normally see). But we can still do better.

Parquet files are columnar formatted files. What is that? Columnar format means we can select only the columns we need in a given operation which can drastically increase performance.

Let’s get straight to the benchmarks then discuss.

Benchmark

The source code for the benchmark is on Github here.

Dataset size: 5.0GB | 8.3M records

Because most analysis takes place at these volumes or lower. Larger than this and we aren’t analysing, we are likely running automated data pipelines. There are datasets larger than this in the wild and they are quite rare.

The Github repo shows the code used to generate a pseudo monte-carlo simulation output of the type often used in banking and insurance. The data set was then run through a scaling function which takes a target file size as a parameter. Using this I generated a data set of 5.3GB in CSV format. This was then compressed down into Parquet using both Gzip and Snappy compression to compare the performance of two of the most popular compression functions.

Straight up we can see a large storage capacity win for Parquet using 3-4x less storage capacity. Not such a big deal, I can always zip a Csv file and get good storage characteristics. Still - a nice to have.

Benchmark Results

The base case is Pandas loading the unzipped CSV and performing the group-by in a time of 34s.

The best case is DuckDb loading a Snappy compressed Parquet file in a total processing time of 0.88s.

ToolFormatLarge Dataset TimeLarge Dataset MemoryTime vs BaselineMemory vs Baseline
DuckDB StreamingSnappy Parquet0.887s409.9 MB38.3x faster4.3x better
DuckDB StreamingGzip Parquet1.411s488.7 MB24.1x faster3.6x better
Polars LazySnappy Parquet1.871s478.1 MB18.2x faster3.7x better
Polars LazyGzip Parquet2.519s235.4 MB13.5x faster7.4x better
DuckDB StreamingCSV2.845s186.3 MB11.9x faster9.4x better
Polars EagerSnappy Parquet4.035s1,159.5 MB8.4x faster1.5x better
Polars LazyCSV6.839s808.3 MB5.0x faster2.2x better
Polars EagerCSV7.914s1,940.3 MB4.3x faster0.9x worse
PandasSnappy Parquet8.693s1,753.1 MB3.9x faster1.0x same
PandasCSV33.961s1,753.1 MB1.0x (baseline)1.0x (baseline)

Note on file sizes

Parquet format file sizes can be achieved by compressing csv files and this often leads to even smaller file sizes than a compressed parquet.

Use tar or gzip python module to compress csv files Ideally we want to use lossless compression when working with data, perhaps a different blog post.

Extracts of code for duckdb and polars lazy

Polars lazy uses the scan_csv and scan_parquet functions in the data loader and collect in the query. Duckdb is lazy by default. These two options are the most performant.

Duckdb

def query(conn):
    return conn.execute(
        f"""
        SELECT simulation, SUM(loss) as total_loss, COUNT(*) as claim_count
        FROM '{self.filename}'
        GROUP BY simulation
        ORDER BY simulation
    """
    ).df()

Polars Lazy

def query(df: pl.DataFrame):
    return (
        df.group_by("simulation")
        .agg(
            [
                pl.col("loss").sum().alias("total_loss"),
                pl.col("loss").count().alias("claim_count"),
            ]
        )
        .sort("simulation")
        .collect()
    )

Personal experience

I’ve been using Duckdb and Polars for a while, so not surprised by the results, it’s entirely inline with what most users experience.

What is interesting is just how easy it is to use them and get good performance. I was surprised by how much better Duckdb performed than Polars in this test, especially since I personally much prefer the workflow of Duckdb. Writing data workflows in SQL is just a natural fit and doesn’t require learning a new tool.

Polars still has it’s place and enables python type checkers and linters like mypy.

OK so here’s the breakdown.

  • DuckDB Streaming: 38x faster than Pandas baseline
  • Snappy Parquet: 3.2x faster than CSV, 68% smaller files
  • Duckdb upto 2x performance as Polars

Arrow

Arrow format means the choice of Pandas, Polars, DuckDb doesn’t have to be a problem. Each tool can work on Arrow format and transfer data between themselves.

Arrow format is essentially an in-memory format that allows accessing the data with zero-copying between systems. So we can work on a data set in Duckdb, summarize it, transform it in SQL code and then immediately swap to Polars or Pandas on the same data set.

Sometimes there is a feature in Polars I might want to use, sometimes Duckdb is the cleanest solution. Using Arrow I can just pick the one that makes the task easier and swap between them with negligible overhead.

Check the benchmark code here.