DuckDB — The Best Database You’re Not Using (Yet)

What is DuckDB?

DuckDB is an embedded open-source, column-oriented relational database that is optimized for OLAP (Online Analytical Processing; think ClickHouse, Redshift, or BigQuery) rather than OLTP (Online Transactional Processing; PostgreSQL, MySQL, Oracle).

It was started by Mark Raasveldt and Hannes Mühleisen at the Dutch research institute Centrum Wiskunde & Informatica (CWI) in the Netherlands with its first public version being available in 2019. It’s currently under MIT license that isgoverned by the non-profit DuckDB Foundation. One of the motivating reasons behind its creation was that many data scientists often work in data-frame tools like Pandas instead of full databases because of friction (setup, moving data, managing infra, etc.) The short sales-pitch; SQLite for analytics.

What can DuckDB do?

One of DuckDB’s most powerful features is its ability to work directly with data files such as CSV, JSON, Parquet, and Apache Arrow. This means that you don’t have to import data into the database first; you can simply query them in place.

For a quick demo, I downloaded NYC taxi trip data from the New York City Taxi and Limousine Commission for 2024 in Parquet format. These are broken down by Yellow taxis, Green taxis, and High-Volume FHV (For-Hire Vehicles; think your Ubers and your Lyfts).

The Yellow taxi dataset has 41 million rows and I can get that quick count by doing:

❯ duckdb -c "SELECT COUNT(*) FROM read_parquet('yellow/*.parquet');"

┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    41169720     │
│ (41.17 million) │
└─────────────────┘

This took around 0.6 seconds to run on an old Intel based MacBook. That’s not very interesting though so let’s try something else. Let’s try to get a count of trips per month:

❯ duckdb -c "
SELECT
  STRFTIME(tpep_pickup_datetime, '%Y-%m') AS month,
  COUNT(*) AS trips
FROM read_parquet('yellow/*.parquet')
WHERE tpep_pickup_datetime >= TIMESTAMP '2024-01-01'
  AND tpep_pickup_datetime <  TIMESTAMP '2025-01-01'
GROUP BY month
ORDER BY month;
"
┌─────────┬─────────┐
│  month  │  trips  │
│ varchar │  int64  │
├─────────┼─────────┤
│ 2024-01 │ 2964617 │
│ 2024-02 │ 3007533 │
│ 2024-03 │ 3582611 │
│ 2024-04 │ 3514295 │
│ 2024-05 │ 3723843 │
│ 2024-06 │ 3539170 │
│ 2024-07 │ 3076876 │
│ 2024-08 │ 2979192 │
│ 2024-09 │ 3633025 │
│ 2024-10 │ 3833780 │
│ 2024-11 │ 3646372 │
│ 2024-12 │ 3668350 │
├─────────┴─────────┤
│      12 rows      │
└───────────────────┘

This took around 9 seconds on that same MacBook. Not bad. You also don’t even have to have the dataset locally:

❯ duckdb -c 'SELECT COUNT(*) FROM read_parquet("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet");'

┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    2964624     │
│ (2.96 million) │
└────────────────┘

Being able to work with data files directly like this can be a godsend. It cuts out the long ingestion steps, ETL pipeline setup, no extra infra to manage.

Ecosystem

Beyond SQL counts and aggregations, DuckDB can do a whole lot more. We’ve seen it “speak” standard SQL but there are tons of libraries in its ecosystem such as:

Python
Ruby
R
C/C++
Java
Rust
Go
Node.js
WebAssembly
and many more

Performance

There are a number of reasons why DuckDB is so fast with the first being that it’s a columnar database and has vectorized execution. This means that data is processed column-by-column in CPU-cache-friendly batches. It also supports parallelism by automatically using multiple CPU cores. If your dataset is bigger than RAM it can handle that too by streaming it from disk and finally it uses metadata to skip irrelevant chunks in some queries.

Analytical Capabilities

In the realm of analytics DuckDB of course is no slouch. It supports full SQL; which means joins, window functions, CTEs, and common analytical operators and has support for materialized views / tables. It also includes basic math functions, correlations, quantiles, and histograms as well.

Extensibility

DuckDB is modular and many advanced features come from extensions such as:

Spatial / geospatial analytics (aka ST_* functions)
HTTPFS (for HTTPS/S3 access)
JSON extension to support nested JSON data
Full-text search
Excel import and export
and many more

Final Thoughts

If you work with data in any capacity, it’s worth giving DuckDB a shot. Chances are, it’ll surprise you how much you can do with it and how little setup it takes.