I’ve known about DBT for quite some time but never had time to take a deeper look. I went through their tutorial and really loved its concept. However, I’ve found it quite limiting. It can’t easily configure many database objects like primary or foreign keys, indexes, serial/identity columns, etc. It supports a relatively small set of database engines (just a few major cloud data warehouses)
I thought that the DBT concept deserves more flexibility and started a little pet project that I’ve named dbd ;) The idea is basically the same: use a hierarchical directories and files structure for definition of database objects (schemas, tables, views, indexes, constraints, etc.).
I used dbd for creating a small analytics project that tracks COVID-19 trends in the Czech Republic (I live in Prague).
Here is what I did.
First I loaded some lookup data tables from data files to my Postgres database. I mostly use CSV, but I’ve also implemented support for JSON, Excel, and Parquet. This was as easy as copying the files to the dbd model directory and executing the tool.
Then I loaded COVID data from the Czech Ministry of Health site (refreshed daily). dbd allows me to stream online data files directly to my database. Again, the configuration is as easy as copying a file with a list of URLs to the dbd model directory.
Finally, I’ve created a few ELT transformations by creating SQL files with SQL SELECT statements that dbd executes and stores results in the database.
I execute this shell one-liner to load and transform fresh COVID data every day:
dbd run covid_cz
Then I’ve created a nice analytics dashboard using GoodData.CN. You can read a short 5 minutes how-to tutorial that shows detailed steps.
I’ve shared the dbd code in this GitHub repo under a very permissive BSD license in case you want to try it out.
Let me know if you find it useful!