Cobbled together data engineering with Ruby, SQLite3, and reveal.js

I needed to understand a sizeable amount of data. I decided to leverage the tools I already new instead of learning a full data processing framework. This was my approach.

Posted by Tejus Parikh on January 21, 2024

The blog post momentum died pretty hard pretty quickly, didn’t it. I guess one of the major challenges about blogging about technical topics is that unless you are an actual researcher, when you are doing things that are potentially blog worthy, you don’t actually have the time to write about them. Though in today’s publish or perish world it might just be better to write about things even if you don’t have anything great to talk about. This might be one of those posts.

I’ve been really busy because while wrapping up teaching I decided to co-found a company (we’ve been so busy, we haven’t even had time to update Linkedin). Our early days have been focused on gathering and analyzing proprietary datasets, get them in front of experts, and gather feedback. After all one of the fastest ways to a right answer is “to be wrong on the internet.” Being wrong in a conference room is a very close second. So I’ve found myself having to parse, connect, and analyze data for an industry that I’m just learning about. And I had to do it quickly.

The “right way” to do it would be to use tools like jq and pandas like a real data engineer. Unfortunately I don’t know any of that. However I do know lots of other things so maybe I could leverage that to get the data I needed and put it into a format presentable to a potential customer.

I ended up building my own data-engineering stack with:

SQLite3

I knew this was going to be the backbone of my analysis stack. I find SQL a lot easier data processing layer than map-reduce style technologies, especially when you know what datasets relate to each other, but are still unsure how. SQL is built for that use case.

I also didn’t want to invest in a real server of any kind and give myself the ability to blow away the whole database when I realized that I went down the wrong path. rm data.sqlite3 && bin/reload. It’s very easy to recover from an error.

SQLite’s different output format support makes this work especially useful. This functionality allowed me dump JSONs, HTML Tables, and CSVs that I was able to use almost directly in the presentations I was creating. You just have to set .mode json along with .output <filename> and you are good to go.

Rainbow CSV & Numbers

I knew the end result was Sqlite, but I still had a lot of files that needed to make their way into the database.

I had a few different data processing tasks. For the files that were already in CSV, the CSVs all had the same stuff, but with different labels and structures. This required figuring out the mappings and I couldn’t find a better tool than the Numbers app that comes with a Mac. These files were large, with dozens of columns, and Google Sheets was unresponsive. The native Mac app opened the CSVs quickly, navigating the csvs was responsive, and the clear treatment of headers made it a lot easier to find the relevant information.

Two additional features of Numbers came in handy. Sometimes it’s easier to view the data as rows instead of columns. Numbers includes a “Transpose Rows and Columns” feature. It also provides an automatic roll up of cells that have been highlighted, which was nice when investigating which fields summed up into which total.

As fast as Numbers is, it is still far from instantaneous. VIM is a lot faster for a quick double check of data as well as quick manipulations of changing one string to another. However I am not a machine (remember this is a GPT-free space) and visually parsing columns of disparate sizes does not come naturally to me. This is where the Rainbow CSV plugin helps. Every column gets a color greatly enabling visual scanning.

Ruby

For the actual ETL, I just used Ruby. The CSV library has a collection of useful features, including mechanisms to skip rows and translate data. I used the former to automatically skip front-matter that wasn’t relevant to me. I used the latter to downcase all headers and snake case them to be column names. These allowed me to quickly map the majority of data and normalize out small vendor differences.

Ruby’s functional nature also made it easy to swap these out on the few non-conforming files. The language is good at keeping simple things simple and I didn’t need much beyond map, each, and each_with_object. No ETL is completely without a few hacks, so having a full language was very useful.

Data Presentation with reveal.js and ECharts

As mentioned before, I’m not very good at reading raw data. After a few false starts copy and pasting data into numbers to draw graphs, I decided to build myself a quick data visualization scaffold with Reveal.js and Apache ECharts. With the scaffold in place, I leveraged Sqlite’s JSON output functionality to dump the data directly in the format the charts I build required. One complication was that since I wanted it to be able run from a file:// url, I needed to load the data in via a script tag and it was a lot easier to get the outcome I wanted if the JSON was assigned to a variable.

To achieve this, I wrote a wrapper that utilized sed to add the variable. A variation of the script follows:

#!/bin/bash

PUDDLE_LOCATION=${PUDDLE_LOCATION:-./data/db.sqlite3}

echo "Using Puddle ${PUDDLE_LOCATION}"

execute_sql() {
    QUERY=$1
    VAR_NAME=$2
    OUTPUT_FILE=./output/${QUERY}_data.js

    echo "RUNNING ${QUERY}"

    # Runs the query in QUERY, outputs JSON in OUTPUT_FILE
    echo ".mode json
.output ${OUTPUT_FILE}
.read ./queries/${QUERY}.sql
    " | sqlite3 $PUDDLE_LOCATION

    # Add a variable name to be referenced by visualization scaffold
    sed -i '' "1 s/^/const $VAR_NAME=\n/" ${OUTPUT_FILE}
}

rm -rf ./output/*.js

execute_sql 'analysis_1' 'ANALYSIS_1_DATA'
execute_sql 'analysis_2' 'ANALYSIS_2_DATA'

Conclusions

This is a lot to avoid learning a tool that does it all. However, learning tools requires brain cycles and I needed those cycles to learn the data.

With the small exception of sed these are tools that I’ve used before and understand reasonably well. This gave me more time to spend analyzing, testing theories, and finding potentially unique value in the information.

This approach also had the added benefit providing artifacts for collaboration. If someone wants to crunch the numbers, I just send them the Sqlite file. Same for the presentation, which also can be shared on a source control system like Github.

I wouldn’t necessarily roll with this system into production. For building enough understanding to figure out what the production system might be, this worked out really well.

Original image is CC-licensed [original source]

Tejus Parikh

I'm a software engineer that writes occasionally about building software, software culture, and tech adjacent hobbies. If you want to get in touch, send me an email at [my_first_name]@tejusparikh.com.