Data Analyst’s Introduction to dbt
So you’re a data analyst and you’ve heard people talking about dbt, but don’t really know what it’s all about? If so, this article is for you. It explains how we at Raybeam use dbt and covers what a data analyst who is new to one of our projects would need to know. Hopefully it will give you a practical insight into how dbt could help you in your day to day work.
This article is written with the data analyst audience in mind, for someone who primarily writes complex SQL in their role. It is worth bearing in mind that there is infrastructure and setup required to run dbt that we don’t consider in this article, for a few reasons:
A data analyst shouldn’t need to worry about it! That’s one of the joys of dbt.
There are several ways in which you might want to run dbt, the main choice being between using dbt Cloud (hosted dbt service) and dbt-core (running the set up yourself).
Finally, this would warrant an article in itself. Watch this space!
Here’s our outline:
As a data analyst, I keep hearing about DBT. Why should I care?
What’s a typical set up in a dbt project?
OK, I am intrigued. Tell me more
How does it work?
Where is the code kept?
CTEs and Ephemeral models
The basics of using dbt
Getting started with developing with DBT
I want to change the SQL in an existing table
I want to add a new table. Its source tables already exist in the database
Tools - VSCode
Tests
Why do we need tests?
What does a test look like?
When do I need to add tests?
How do I add tests?
Making your own tests
Macros
Documentation and viewing the whole project
Documenting your models
In summary
Resources to learn more
As a data analyst, I keep hearing about DBT. Why should I care?
As a data analyst, dbt is made for you. Although it is often promoted by data engineering teams, it’s developed with the data analyst in mind.
dbt makes it very easy for data analysts to manage changes to their data models without having to get involved with data engineering processes. Overall, it means less work setting up data engineering processes such as organizing DAGs and scheduling jobs. This means less work for the data engineers too!
In our projects, we use dbt for scheduling and organizing the SQL scripts we’ve written to create tables in our data model.
Without dbt, if we want to add a new table to our data warehouse, first we would write the SQL script to create the table. But then we would need to figure out when in the process to run this CREATE script. And that would involve setting up a DAG and understanding where in our data model the processing of our table sits.
With dbt, we just use their framework to figure all this out for us. All we need to do is write the SELECT statement to create our table, and use some dbt specific tagging to reference other tables in the process.
dbt also provides other useful components in its framework. Testing, documentation and macro functionality are all available and we will talk about this later in the article. But for our projects, this tagging, lineage and auto-CREATEing of tables has been the core benefit, and I view this as the key core value of dbt.
What’s a typical set up in a dbt project?
dbt is most useful when you are setting up data pipelines or some kind of rerunnable scheduled SQL processing. It’s less useful if you are just writing one-off SQL scripts, Though be careful thinking your work is just one off SQL scripts. If you’re writing the same thing again and again, and using similar processes, then dbt can help there too!
The below diagram describes a typical project where we’ve used dbt. For this client, we were setting up a new Snowflake data warehouse to consolidate and modernize their data stack. This data warehouse was used initially to drive Power BI dashboards but also to answer ad-hoc questions from around the organization and to provide direct access to curated data sets for another area of the organization.
Our database schemas were set up as shown below:
As you can see, we use four schemas for our data:
Ingress: This contains the data as it arrives from the source. We don’t modify it in any way.
Raw: We process the data from Ingress, remove any test or useless data, deduplicate any redundant data, and clean and process data without applying business logic. (For example, we don’t make all strings uppercase.) Finally, we write those records into the Raw schema.
Semantic: From the Raw schema, we construct more business appropriate tables and relationships. Some of these we want to display in dashboards, and/or send to our end users. The SQL that we write between Raw and Semantic contains all our business logic.
Reporting: Although the Semantic schema contains business objects, these are still at the lowest level of granularity. For reporting or performance purposes we often want to provide data that is aggregated at a higher level. We write this data to the reporting layer.
OK, I am intrigued. Tell me more
How does it work?
Dbt uses Jinja, a templating tool to markup pieces of your SQL scripts. For example, when creating a model, if you want to reference another table, you use the ref markup shown below:
SELECT DISTINCT user_id, contract_id FROM {{ ref('dim_users_contracts') }} WHERE contract_type = 'Internal'
This has a couple of key advantages over just hard coding the table name and schema:
dbt uses these references to understand which models depend on which, and will order your data pipeline based on these dependencies. It will also give you an error if you have circular dependencies between your models
If you want to run your models in different environments, dbt can configure the schemas it runs these tables into dependent on your configuration
We can, and do, use the Jinja functionality for a whole lot of other configuration. Another use-case relevant for a data analyst is macros, and more on this later. But using references as in the example above is likely the main one you will use to get started.
Where is the code kept?
All dbt code is contained in the dags/dbt directory in the repository, in the below structure:
|- models
This is the main content of our data model. Each sql file in here is a model which you can think of as a table, goes in the relevant subdirectory, dependent on which schema it lives in:
|- ingress
|- raw
|- semantic
|- reporting
|- tests
"tests" contains tests for the data (more on this later). Tests can be used to check for anything from simple uniqueness constraints, to whether a table is empty, to evaluating the results of complex custom SQL.
|- macros
Contains "functions" that generate SQL - use this to extract out pieces of SQL logic that would otherwise be repeated across the codebase.
CTEs and Ephemeral models
When writing SQL code, we often make use of CTEs.
CTEs are GREAT, but we often find ourselves reusing the same ones in different models. To share this logic between models in dbt, you can create an ephemeral model in the same way as you would a normal model file. This model is then never written to the database but can be referenced from other places in exactly the same way as a model that is.
Behind the scenes, this model is injected by dbt as a CTE within the file. Note that you can use regular CTEs in combination with these. And you should, if your CTE is not useful outside of the model you are working on.
In order to define a model as ephemeral, add:
{{ config( materialized='ephemeral' ) }}
at the top of your file and dbt will take care of it for you. (Spoiler: The above is Jinja templating too!)
The basics of using dbt
We’ve had a look at simple dbt model files and the structure of the code base.
But when we’ve created our templated dbt SQL file, we need to convert those source files into the SQL that we can use to update the database. To generate this SQL, from the command line, run dbt compile. This takes source model-files and source test-files and compiles them under the target/compiled directory.
Note that this doesn’t run anything in your database! Nor does it check that your SQL is valid, like a regular code compiler. You’ve just taken the templated SQL (or configured test configuration files) and converted them to actual SQL that can be run against the database.
So, if you want to test the SQL you have written in templated format, you have a choice:
You can load the compiled SQL file from under the target/compiled directory into your database query tool to run. This compiled file contains the SELECT statement that will be used to create your model when dbt runs, so loading and running this will show you what your table will look like, but not update the database itself
You can run dbt run –select <model_name> This does update the database with your model (assuming it is not ephemeral), because it uses dbt’s framework to wrap the compiled SELECT statement with the relevant DDL SQL to create your table.
Getting started developing SQL with DBT
My general workflow varies by what I am doing. If you want to play around with the data and quickly modify queries, I do this in the inbuilt editor for my database, e.g. the Worksheet Editor in Snowflake. Only when I want to put something into the data warehouse as a proper change do I follow the below processes.
I want to change the SQL in an existing table
Open the relevant SQL file in your code editor I use VSCode. Make the changes to the file so that the end SELECT statement will produce what you want to put in your table.
To test your changes:
run dbt compile,
load the compiled SQL file under `targets/ci/<schema>/<filename>.sql in the tool you use to run SQL on your database
Run the SQL, and see if it produces the correct output. If not, alter your source file and repeat this process.
Once you are happy the output is correct, you can run your changes into the development database using dbt run –select <name of your model>. If you want to update your model and all the downstream models run dbt run –select <name of your model>+.
Consider adding tests for your change (see below), and don’t forget to extract any reusable or complex functionality into macros (also see below) or ephemeral models (for CTEs).
I want to add a new table. Its source tables already exist in the database
If it’s complicated, you can begin writing your SQL in raw SQL format and test it and develop it iteratively in the same way you would develop any ad hoc SQL query.
But when you are happy with it, or if it’s straightforward, create a new .sql file for your model, under the folder corresponding to the schema your table should appear in. The name of the sql file is the name of the table that will be created, within the schema in which it’s located.
The last statement in your SQL file should be a SELECT statement that produces the data that you want in your model. Do not add any CREATE TABLE statements or similar, because dbt will take care of this for you. Don’t forget that any other tables/models you reference should use the {{ref(‘<model_name>’)}} syntax and not be hard coded.
Make sure to extract any CTEs that will be useful elsewhere into separate ephemeral models.
You should also add tests for your table, as appropriate (see below), and you can also consider extracting any complex functionality, or functionality that is likely to be reused into macros.
Tools - VSCode
What tools you use to edit your SQL is obviously up to you, but I have had great success using the VSCode “dbt Power User” VS Code plugin, which enables to browse your data model quickly (seeing parent and child models), and to easily run and compile your models without having to leave VS Code.
Tests
Why do we need tests?
We add tests to:
Type I: Check the source data we are importing for freshness (that new data arrives as expected) and consistency (primary keys are unique, etc.)
Type II: Check that our output tables contain meaningful data, that we expect. If we accidentally change business logic, and get it wrong, then hopefully our Type II tests break.
What does a test look like?
For each model in the repository, we add a .yml file to the models/test directory. These contain standard tests such as:
Checking for the number of rows in a table
Checking a column is unique
Checking the freshness of the data in ingress tables
When do I need to add tests?
You should always be thinking about adding tests when you are adding new functionality/changing the code base.
That’s not to say you always need to add tests when you make a change, but if it’s a large change, and/or if you are adding a new table, you should add tests. Here’s some things to think about:
How many rows should the table have? If it’s more/less than a particular number, you can add a test.
Which column(s) should be unique? You can add a test for this.
Are there columns that should take only specific values, or match a certain regex?
Are there any date columns? Does this refer to incoming data? Can we check that the latest date in this column is within the last week, perhaps, or matches a value in a control table somewhere?
Does the table have foreign key constraints? E.g should the user_id in this field always be present in the dim_users table? If so, add a test!
How do I add tests?
We add tests for our models in a file called <model_name>.yml under models/tests. All models have a test file. Many of the above examples can be done using dbt standard tests, dbt_utils tests or the test package we already use, dbt_expectations. That means you don’t need to write any code to set up a test, you can just add the relevant configuration to the yml file.
For example, to check whether a column is unique and not null, a test looks like this:
version: 2 models: - name: dim_contact tests: columns: - name: contact_key tests: - not_null - unique
Or to check that multiple columns are unique, like this:
version: 2 models: - name: dim_users tests: - dbt_expectations.expect_compound_columns_to_be_unique: column_list: ["user_id", "user_id_sfdc"]
Making your own tests
If the test you want isn’t there, you can create your own test. You do this by adding your test under tests/generic.
A custom test uses jinja templating as shown below. You can pass in parameters, reference the table and column you’re calling from, and reference other tables too.
The example below tests that the model we are testing (‘model’) has an exact number of events ‘value’ for the distinct users that we find for the given ‘contract_id’ and ‘year’.
A test can be on the column of a model, or at the model level.
{% test web_views_year(model, contracts, contract_id_short, year, value) %} WITH users AS ( SELECT DISTINCT user_id FROM {{ contracts }} cwu WHERE cwv.contract_id_short = '{{ contract_id_short }}' AND cwv.contract_id_short = cwu.contract_id_short ), page_views AS ( SELECT count(*) AS pvs FROM {{ model }} we, INNER JOIN users USING (user_id) WHERE year(we.date_time)= {{ year }} ) SELECT * FROM page_views WHERE pvs != {{ value }} {% endtest %}
Once you have written your custom test, you call it like any other:
version: 2 models: - name: fact_registered_user_page_views tests: - web_views_year: contract_id_short: '004590' cwu: "ref('dim_contracts_with_users')" year: 2021 value: 12864 - web_views_year: contract_id_short: '664087' cwu: "ref('dim_contracts_with_users')" year: 2021 value: 286471
Note that the first parameter to the macro `model`, which is the model name, is a default that is passed in automatically.
Macros
We use macros to extract pieces of SQL code that we use repeatedly throughout the code. For example, we currently format last names in the same way in several places. Extracting the code into macros means that if we want to change the logic of how we do this, we can do it in one place.
Macros live under macros, and once you have defined one, you can call it by simply inserting {{ macroname(param1, param2 }}, etc. in your SQL code.
Create a macro file with `.sql` extension under `macros`. We usually organize our macros by the schema they run in. You can have more than one macro in a single file, and we tend to group them by functionality - for example a set of macros that refers to a particular business problem all live within one file.
Example - Format last name
{% macro format_last_name(last_name) %} {% set last_names_not_ilike = ["d''%", "da %", "de %", "del %", "den %", "di %", "du %" , "ter %", "van %", "von %", "zu %"] %} CASE WHEN ((SUBSTRING({{ last_name }}, 1, 1) != UPPER(SUBSTRING({{ last_name }}, 1, 1)) {% for not_ilike in last_names_not_ilike %} AND {{ last_name }} NOT ILIKE '{{not_ilike}}' {% endfor %}) OR {{ last_name }} = UPPER({{ last_name }})) THEN INITCAP({{ last_name }}) ELSE {{ last_name }} END {% endmacro %}
You can see here that this macro takes one parameter, which is the column.
This way you can apply this macro to any column you like (e.g. your name columns may be called different things in different tables)
It also has a list of last names to compare against and treat differently. You can see we put these in a list at the top and generate the SQL code through a loop.
In order to call the macro in your model code, you can simply call (for example):
{{format_last_name('employee_last_name')}}
There is no need to declare the file the macro is in, nor to import it in any way. (Because of this, all macro names must be unique)
Note that you can reference dbt models within your macros, but not ephemeral models.
Documentation and viewing the whole project
dbt also makes it easy to document and explore your data model, by providing tools to document and display them.
You can run the documentation locally by running:
dbt docs generate - creates an index.html file for your project, compiles it, and creates a json configuration file representing its objects
dbt docs serve - runs a webserver on port 8000 to serve your documentation
Go to http://localhost:8000 to view the documentation.
You should see something like the below which shows how your models depend on one another.
As you can see, even without documenting your models, dbt gives you “something for free” here where you can see all your models, some basic information about them, their columns (and types) and any tests that you have set up:
It also gives a very useful interactive lineage graph of your models:
On our project, we also have a Github Action that builds and deploys documentation to Github pages. This way we have openly accessible documentation that is guaranteed to be up to date with our code.
Documenting your models
We also document our models. We put the documentation of our tables and columns within our test YML files. This makes it easy to find the tests and documentation for a single model in one place.
To document anything in dbt, add a description tag to the element. You can either add documentation inline (recommended if it’s a short one off description), or you can add a “doc” tag as shown below (if the description is likely to be reused in several models, or is longer) - then in a separate “.md” file you can add the documentation tag’s implementation, as shown underneath.
version: 2 models: - name: fact_web_events description: "{{ doc('web_events_table') }}"
{% docs web_events_table %}# fact_web_events # fact_web_events The `fact_web_events` table is our core model for on-site behavior. It contains all tagged events from the website. This table has several downstream tables for more specific use cases, such as individual page loads, web sessions, etc. {% enddocs %}
Note that if you use a doc tag, it’s possible to include markdown content, and even images.
In summary
This document has been written with the data analyst in mind, for those users who would not expect to set up data engineering pipelines and detailed configuration of dbt itself.
We hope it has been useful to give you a taste of the advantages that dbt can offer to a data modeling project. We think it helps streamline your data warehouse set up, and makes SQL work more robust with tests, documentation and modularization.
If you want to read more about dbt, see the resources below. If you hope to read more about dbt from a data engineering perspective, you can expect more from us on this in the near future.
Resources to learn more
dbt best practices (from dbt)
dbt expectations testing package
dbt also maintain a particularly active and helpful slack workspace if you have any questions.