Real world Duckdb example for analysts
I’m a huge fan of Duckdb - an analytics database on your laptop that can crunch 100+GB without breaking a sweat.
There’s tons of Duckdb docs online to learn the tool, but I’ve compiled a list of Duckdb features and a straight-up SQL analytics hack as my number 1 and 2.
See my prior post for performance comparisons: Modern data formats
Analysis Hack: Hash aggregation keys
Main use cases of hashing row values are:
- Generating unique surrogate primary keys
- Checking if the row has changed - then a hash of the row content will change
- Grouping multiple columns into a hash column for running aggregations
It is the latter that results in a nice little trick below.
This is simple method of hashing column values in Duckdb.
select
sha256(concat_ws('|', account_id, region, cast(order_date as varchar))) as sk_order,
*
from orders;Anaysis Hack 2: Macro to replicate Table-Valued Functions
Duckdb ships with a macro feature that can be used like a Table-Valued Function. One use case is to write aggregation queries using unique hash keys as the aggregation “columns”. The calling function creates the hash key from any number of aggregation columns -> the macro aggregates and returns the query grouped by the hash key -> the calling function joins back to original data to retrieve the non-hash values.
This pattern is ludicrously useful. Almost every aggregation query can be written and called in this manner, meaning SQL code stays DRY and pipeline logic clean.
-- hash the grouping dimensions into one key
create temp view transactions_keyed as
select
sha256(concat_ws('|', customer_id, region, product_category)) as hash_key,
*
from transactions;
-- macro works on any table that has an hash_key column
create macro calc_metrics(data) as table
select
hash_key,
sum(amount) as total_sales,
count(*) as tx_count,
avg(amount) as avg_transaction
from data
group by agg_key;
select * from calc_metrics(transactions_keyed);Alternative - Grouping Sets
Grouping sets let you run multiple GROUP BY combinations in a single pass. ROLLUP is the most useful shorthand — it generates a hierarchy of grouping sets:
-- ROLLUP generates n+1 grouping sets
select year, quarter, month, sum(sales)
from sales
group by rollup(year, quarter, month);
-- equivalent to:
-- group by grouping sets (
-- (year, quarter, month),
-- (year, quarter),
-- (year),
-- () -- grand total
-- )CUBE generates all possible combinations (2^n grouping sets), which gets expensive fast.
But you must know all the grouping columns upfront. Change the dimensions and you’re rewriting the query. The macro approach just needs a new keyed view.
Grouping sets are the right tool for fixed hierarchies — data warehouse cubes, standard year/quarter/month rollups where the structure never changes. For ad-hoc analysis where dimensions shift, the macro + hash pattern is cleaner.
Hash pattern allows you to iterate quickly, sometimes you don’t know if you’ll add a column or remove on until it emerges from the code iteration process.
Then it will become more obvious to continue using hash macro or cube.
Use calculated columns in the same query
Quality-of-life improvement: use calculated columns in the same query without having to nest subqueries.
select
table_id,
sales,
sales / 12 as monthly_sales,
monthly_sales / monthly_costs as margin
from annual_sales;Exclude columns keyword
When a dataset is wide, you can select everything except a couple of columns. This is especially handy for removing messy columns without listing 50 names.
select
* exclude(your_column)
from './your-csv.csv';Unpivot
I love long format data, use unpivot to slim down query complexity, future proof against changes in columns and make data sets manageable.
unpivot sales
on jan, feb, mar, apr
into name month value revenue;Wrap-up
This post showed some of the most common little tricks Duckdb has in store. And an analytics hack that, while it shouldn’t have, took years of experience for me to implement. The hash key aggregation trick is the realm of true artisans.