DuckDB: The database Swiss Army Knife

Preamble

I was recently rewatching “Silicon Valley” on a flight and found Gavin Belson’s rant about “Data-geddon” even funnier this time around.

Granted, the show is fiction, but we have not run out of storage capacity and the hardware has kept up. It was a conceit of the era where everything had to be “web-scale” and there were mandates against traditional database systems because of scalability.

The clip contains a common fallacy that all exponents are the same and there aren’t upper bounds on growth. There are certainly problem spaces where this is true, but I’ve mostly found them to be a rarity. One of my early jobs involved processing data about people in the state of Georgia, with the main table being approximately 6 million rows. Getting the performance we needed required a lot of specialized knowledge of our database system, i/o settings, disk layout, and buying specialized hardware. Now I can regularly crunch a dataset that size in a minute or two on my laptop, and that’s before thinking about optimizations.

A key thing about that example is the fundamental driver of the dataset size, population, has grow, but the growth has been nothing compared to hardware.

Year	US Population	Rate Change	Max Apple Laptop Memory	Rate Change
1990	258,709,873	-	16 MB	-
2000	291,421,906	13.2	1024 MB	6,300
2010	308,745,538	9.7	16,000 MB	1,462
2020	331,449,281	7.4	64,000 MB	300

We could create a similar table for speed, as well as capacity.

The point of all the preamble is that I’ve been revisiting my approach to building applications around data and have increasingly found myself reaching for a small database.

The database I’ve been finding myself reaching for the most often is DuckDB.

DuckDB

Launched in 2019, this engine has been gaining a lot of (well deserved, in my opinion) hype in the last few years. Unlike the more traditional in processes database engines, DuckDB is a column store, so it’s compact version of the really big databases that drive Business Intelligence use-cases like Snowflake and Redshift. It’s also blazing fast for the types of workloads where you need to crunch data across multiple columns.

Since the launch, they’ve stabilized the database format, created a commercial version called MotherDuck, and fleshed out their feature set.

The Swiss Army Knife

So that’s what it is. This is how my team and I are using it.

We’re working with performance data from moderately sized photovoltaic systems. We get the data in multiple different formats, we need to run some calculations, then we need to make it available to the end users. DuckDB offers a lot of features that help with this use-case.

We can run it anywhere

This has easily been the most powerful feature. In most data systems I’ve worked on, the database has either been a single point of failure or the single largest cost. Often times, it’s both.

With DuckDB, I can run an independent database in every processing node using AWS Lambda’s ability to launch containers to make use of an image we built with DuckDB. This means I never have to worry about the database going down and fully leverage serverless technologies for reliability. It’s not possible for a poorly written query to bring everyone else down.

It also allows us to run every customer’s pipeline completely independently, with fully single tenancy through the dataflow. This process produces a collection of Parquet files.

In our browser application, I can make use of DuckDB WASM to point at the parquet files and then just run normal SQL queries to build up the interface. It has been a pleasure to not have to write code like:

const aggregated = performance_data.reduce((acc, [date, location, sunlight, energy, power]) => {
  const key = `${date}-${location}`;
  if (!acc[key]) {
    acc[key] = {
      date,
      location,
      totalSunlight: 0,
      totalEnergy: 0,
      maxPower: 0
    };
  }
  acc[key].totalSunlight += sunlight;
  acc[key].totalEnergy += energy;
  acc[key].maxPower = Math.max(acc[key].maxPower, power);
  return acc;
}, {});

const result = Object.values(aggregated);

result.sort((a, b) => {
  const dateComparison = a.date.localeCompare(b.date);
  if (dateComparison !== 0) return dateComparison;
  return a.location.localeCompare(b.location);
});

and write:

select
	date
	, location
	, sum(sunlight)
	, sum(energy)
	, max(power)
from performance_data
group by all
order by date, location

Ad-hoc data processing

Closely related to the above, the whole database is a single file. This is especially useful when running experiments with AI or other advanced query features. Since everyone has their own version of the database, there’s no risk of a gnarly query bringing everything own. Just copy over the latest *.duckdb files, then hack and join away.

It’s “immutable”

Since we point DuckDB at files, the database is essentially immutable. Sure, a clever user could figure out how to get at the db object or exploit some sloppy coding, and execute a DROP very_important_table. And good for them, they get their very own custom database, right up until they hit refresh on the browser.

Powerful time bucketing and statistical function

At its heart, DuckDB is an analytical engine, so it comes out of the box with a lot of the functions that you need.

A few examples:

time_bucket lets one group data into standard time increments. Very useful when your data comes in an assortment of time increments.
quantile_cont calculate percentiles directly in the database.

Timezone support

Databases and times are a little funky. The timezone method makes conversions a little simpler from a syntactic perspective.

The drawbacks

Everything in software is a tradeoff and DuckDB is no different. There are serious drawbacks that make it unusable for a lot of use cases.

The largest, by far, is that you can only have a single connection to a writeable database. Our architecture is specially designed to ensure that there is only ever one writer, essentially meaning we’ve implemented our own synchronization primitives. We also end up making copies and moving the databases around frequently to work around this limitation. For smaller data sizes this is effectively free, especially compared to standing up a full featured, fault tolerant traditional DB. There is a size where the latency and bandwidth costs start to become impractical and this advantage is negated.

This also means our primary use-cases have to be read-only. We’ve leveraged that as a feature, but our setup would fall apart if our value prop was real-time information versus historical and predictive analytics on relatively slow moving data. We knew our usage pattern and designed around it.

Similarly, pointing DuckDB WASM in our web application at Parquet files vastly simplified the deployment of our MVP. The tradeoff is any row-level security is totally out the window. We are also completely at the mercy of the user’s internet connection for latency.

A minor issue is that DuckDB is still relatively niche and not everything is done. Within Ruby, reading and writing JSON takes an additional step. The WASM implemenation is still experimental. We haven’t hit anything major, but have found that big int and date time related bindings seem to be tempermental and require revisiting after a version upgrade.

Final thoughts

We were able to use DuckDB’s benefits while avoiding the drawbacks to quickly test out features and deliver value for customers. There may be a point in the future where we have to move to a more traditional database and replace parts of our stack with more traditional-looking components. Even then, with so much of our data in csv style formats, the processing power of DuckDB will still give it a place.

Most customer problems do not look like ‘data-geddon.’ Any tool that can reduce mental overhead and increase focus on the customer problem are invaluable when you are trying to just get something off the ground. If we outgrow our database, that’s great. It means we managed to get the first set of problems right.

Original image is CC-licensed [original source]

Tejus Parikh

I'm a software engineer that writes occasionally about building software, software culture, and tech adjacent hobbies. If you want to get in touch, send me an email at [my_first_name]@tejusparikh.com.

This site is created and maintained by Tejus Parikh.

Any opinions, commentary, and related on this site are entirely my own and do not reflect the views and values of any current, previous, or future employer, unless explicitly stated.