Home Data integration SQL and dbt: The future of modern data transformation

SQL and dbt: The future of modern data transformation

Lord of the Rings association with connection to SQL and dbt as fighters.

The article describes data processing in companies. It highlights both the established technology SQL (Structured Query Language) and the modern tool dbt (Data Build Tool) for data modeling. SQL is a language for querying databases. dbt, on the other hand, is a tool that optimizes the transformation process of data in data warehouses by applying software engineering practices to data analysis.

Introduction

Let me tell you a story. It's about companies that have set themselves the goal of eliciting valuable insights from their data. For decades, this has been a battle that is convoluted and challenging, but also promising and exciting. We have a wasteland that has been deserted over time, where battles for an uncertain future are raging, deciding important resource distributions and power structures. These battles require the most experienced commanders and the most flexible, adaptive fighters.

The battle in this story is for valuable corporate data hidden deep within historical data platforms. This is also a tribute to one of the most legendary generals, who has won countless glorious victories, as well as his novice and right-hand man: SQL and dbt.

Table of contents

SQL and relational database management systems (RDBMS) have been established on the market for many decades. However, that does not mean they are outdated—quite the contrary.

Over the decades, our wise commander has faced countless critics, rivals, and enemies, all of whom he has outlasted. This article summarizes the key points of a fundamental study by Google. It mentions that SQL was initially declared"obsolete," only to make a stronger comeback than ever before over the years.

However, this phenomenon is not limited to Google, as Michael Stonebraker (Turing Award winner and creator of Postgres) and Andrew Pavlo report in this paper, which is also available as a presentation. They illustrate that RDBMS and SQL adapt emerging technologies, principles, and ideas, thereby continuously improving the product. One could say that our commander learns something from every adversary he defeats, thereby becoming more and more experienced.

Over the decades, systems have been developed that have continuously matured and undergone immense further development. But it is not only RDBMS that have evolved; their lingua franca, SQL, has also been continuously improved. In BigQuery, for example, it is possible to process semi-structured data in JSON format natively. In addition, machine learning workloads can also be expressed in BigQuery using SQL. With their support, even unstructured data can be processed directly in BigQuery. The icing on the cake is that even exotic workloads such as geodata processing are possible – all natively via SQL. That's why Google no longer refers to BigQuery as just a data warehouse, but as an "autonomous data and AI platform."

This exciting project is a showcase for the wide range of operations that can be implemented with SQL. It involves implementing the classic game Tetris in SQL to demonstrate the Turing completeness of SQL. This is a gimmick that won't change the world. However, it illustrates that SQL is so flexible that it can be used to implement operations for which it was never designed. One can therefore imagine the possibilities that exist in the context for which SQL was actually designed, namely the transformation of relational data. I would even go so far as to say that there is no data transformation that cannot be expressed in SQL.

In addition to the gradual advances in language over many years, there have also been developments that have implemented or at least proposed fundamental changes and improvements. One example of this is the fundamental redesign of SQL syntax with pipes, which is already available in Databricks and BigQuery. Another fascinating idea is the proposal in this paper. It aims to do away with the long-standing fact that the result of an SQL query must be a single relation/table. Instead, an entire data model can be delivered as the result.

Both aspects underscore the continuous innovation efforts to improve SQL. Overall, the years in which SQL and RDBMS have remained in the industry can therefore be seen not as a sign of aging, but as categorical superiority.

2. what weaknesses does the legendary army commander SQL have?

Data preparation projects are often viewed as a major one-time effort that can be managed by investing sufficient time and resources in an initial analysis and design phase. However, there are always some aspects that are not known or considered in advance, and requirements change over time. In practical implementation, this approach therefore usually lags behind an iterative approach. This insight is widely accepted in other software development projects, but not always in analytical use cases. This is also reflected in the tooling for SQL-based development. This often makes it difficult to apply established software development practices to data processing processes. These include, above all, practices from DevOps (version control, continuous integration & deployment in multi-environments, Testautomation) and software development (separation of concerns, single responsibility principle, Reuse (DRY), Abstraction & Encapsulation)

Although there is an official standard for SQL that aims to ensure that the same SQL code is compatible across different platforms, this standard is not implemented 1:1 by any platform provider. As a result, different platforms have their own nuances, resulting in different SQL dialects. This means that there is no guarantee that the same SQL query can be executed on different platforms.

In addition, although some transformations are quite simple to express in SQL, they can only be represented with many repetitive steps. It is also said that a lotof "boilerplate code"is necessary, i.e., a lot of code has to be written to implement minor functionalities. Such language features are therefore also referred to as verbose. This example shows how verbose and repetitive common transformations can be in SQL:

A unique hash key ("supply_sk") is generated from the "id" and "sku" attributes of a table. In addition to the verbosity, this is also prone to errors and most likely inconsistent across different applications. Imagine, for example, that you want to change the algorithm used at every point in a project where a surrogate key is generated in this way. This would consume an unnecessary amount of resources.

3. who is this young novice called dbt?

To clarify this question, an excursus on data pipelines and ETL or ELT processes is necessary. Both terms describe a similar purpose, namely to manage the following situation:

The data set D in system S has at least one of two problems: Either D in its current form is difficult or impossible to use for a specific purpose, i.e., you want to change the structure from D to D'. Or D is simply in the wrong place or in the wrong system, i.e., you want to change S. Or both. Or, in short: The "how" (structure) or the "where" (system) of the data is not suitable for a specific use case.

For this reason, data is extracted from one system (Extract = E), loaded into another system (Load = L) and transformed/restructured if necessary (Transform = T). Depending on when the transformation is carried out, the acronyms ETL or ELT are used. These terms tend to be used when data is actually to be transferred from one system to another. But what if the source and target systems are the same system or no transformation is carried out at all?

The term "data pipeline" is a generalization of the ETL process. It simply describes an automated process that transfers data from a source to a destination. Data transformation is optional. The source and destination can also be the same system. Another aspect that is usually used to differentiate between the two is that ETL processes are generally batch-based, while data pipelines can also operate in real time or be event-based.

4 But how and where does dbt fit into this context?

In short, DBT is a Python-based open-source tool for transforming D to D' within an RDBMS using SQL. But how exactly does it work?

An SQL query is simply a character string that must follow a certain syntax so that a specific system (RDBMS) can translate it into concrete processing steps.

This also applies to websites. They are always based on an HTML document, and a specific system (web browser) knows how to process and display it. However, most websites are not static, but have dynamic content, such as specific content for logged-in users, which can change or be updated during use (at runtime). So-called templating engines can be used for this purpose, which enable text-based document content to be generated programmatically. The most widely used engine in the Python context is Jinja.

This is the first special feature of dbt: dbt uses Jinja to generate SQL queries dynamically at runtime. To revisit the hash key example above: This would be mashed to the following in the dbt context:

This dynamic content can be as complex as you like, ranging from simple functionalities to the creation of entire tables. These encapsulated functionalities can be reused in different places to avoid harmful code duplication (copy & paste). Another major advantage of dbt is its extensive ecosystem with numerous high-quality functionalities that are available out-of-the-box and collected in the dbt package hub.

Another core function of dbt uses the property already mentioned that each SQL query can have any number of relations as input, but always generates a single relation as output. This output could then be used as input for a new query.

This creates a directed acyclic graph (DAG) that represents the data flow. Each node in this diagram is an SQL query. In dbt jargon, this is referred to as a dbt model.

The result of each dbt model is represented as an artifact in the database (in a broader sense, a view or table). dbt resolves the model references into concrete database references at runtime.

The DAG also determines the execution sequence of the SQL queries. Independent queries are processed in parallel to ensure the most efficient data flow possible. In this way, dbt also takes over the workflow orchestration.

However, dbt not only contains models that become database artifacts, but also other concepts/resources. The following provides an overview of all possible types of dbt resources. At this point, we will only discuss the second most important one: dbt tests. A major challenge in creating data pipelines is maintaining adequate data and process quality. One of the most promising measures for quality assurance in software development and related areas is the implementation of automated tests. This includes both tests relating to the implemented logic (code tests) and data quality tests in the case of data pipelines, where the data itself is checked for necessary properties. In dbt, these are supported as native functionalities in the form of data_tests and unit_tests. These tests are applied to dbt models and executed after them, so that these checks are performed every time the pipeline is executed. This leads to a significant improvement in quality transparency and enables the development of more stable solutions that are suitable for iterative, fast-paced development. For this reason, testing in dbt should be considered an essential core feature (testing as first-class citizens).

5. architecture and basic mode of action of dbt

dbt-core defines the scope of features and configurations that can be implemented with dbt. This area is also called the authoring layer. If dbt is to be used in combination with a specific RDBMS, there must be a so-called dbt adapter for it. This defines how a specific functionality is implemented within the specific system. A list of available adapters can be found here. This is intended to enable a platform-independent definition of SQL-based data pipelines. As already mentioned, different systems have different SQL dialects, so a dbt model may not be executable on two different platforms. To address this issue, dbt offers what is known as adapter dispatch. This allows system-dependent definitions to be changed dynamically at runtime. Here is an example:

The "cents_to_dollars" macro can be easily used in the dbt model, as shown in the example screenshot above. If the dbt workflow is executed on Postgres, BigQuery, or MS Fabric, the specific implementations are used. For all other adapters/platforms, the default version is used.

6. what makes dbt the ideal right hand of SQL?

A good right hand knows about its master's weaknesses and compensates for them as far as possible. The above should have given you an idea of how dbt can eliminate or at least mitigate the described weaknesses of SQL.

dbt enables SQL to apply the established practices from DevOps and software development mentioned above to SQL-based data processing processes. This achieves a whole new level of maturity. In addition, dbt enables the easy development of reusable components that encapsulate verbose or complex logic and platform-specific language features and manage them dynamically at runtime.

Below you will find a list of the core functions of dbt:

  • Dynamic SQL code generation at runtime
  • Workflow orchestration (DAG)
  • Creation of transparency through management of model dependencies (data lineage)
  • Integrated quality assurance mechanisms (data and unit tests as first-class citizens)
  • Use of established software engineering best practices
  • Platform independence through adapter dispatch
  • Open source as the basis for an extensive ecosystem (e.g. dbt Package Hub) and own specific extensions

7. what might dbt not be suitable for?

DBT in combination with SQL is powerful, but it is only a tool designed for a specific purpose.

"If all you have is a hammer, everything looks like a nail" - Abraham Maslow

dbt is great for non-interactive (asynchronous), batch-based, idempotent, SQL-based data transformations. For use cases that do not fit this description, dbt may still be a possible solution, but you are moving outside of dbt's comfort zone and other tools may serve the purpose better. For example, if a user wants to configure/parameterize a data transformation via a user interface, then start it, and wait for the result, other tools may be better suited for this than dbt.

Another weakness of dbt is that it only considers SQL queries as strings and has no way of interpreting them to detect errors at an early stage.

dbt's driving company, dbt Labs, has recognized this shortcoming and is responding with a completely new product. At the time of writing, however, this is still in the deep beta phase.

8. conclusion

SQL and modern data platforms (RDBMS) are the result of decades of research and development. These systems are virtually unbeatable at the high school level.

dbt uses these great developments, places them in the context of modern software development best practices and eliminates many of the weaknesses of SQL. Particularly noteworthy are the extensive possibilities for test automation with integrated quality assurance.

This combination is a bulwark that should make future battles for valuable company data much more victorious.

Your contact person for SQL & dbt

Would you like to delve deeper into the topic? I look forward to talking to you about it.

 

 

Christiane Maria Kallfass is a Recruiting and Marketing Specialist at s-peers AG
Christiane Grimm
Inside Sales

Published by:

Tobias Vogler

author

How did you like the article?

How helpful was this post?

Click on a star to rate!

Average rating 4.8 / 5.
Number of ratings: 5

No votes so far! Be the first person to rate this post!

INFORMATION

More information

Cover_Photo_SAC_AI_ML_Features_at_a_glance

SAC AI features explained: Joule, Just Ask, and Smart Predict

This wiki explains how to use Smart Predict to create automated forecasting models...
Visual Woman and robot look at each other at a desk, the robot is sitting

SAP Business AI & AI Units

When it comes to AI in the SAP universe, there are basically two approaches...
Visual Databricks and BDC Wiki

What is Databricks? What is the BDC? The ultimate guide to the perfect combination!

In today's data-driven business world, the ability to efficiently analyze and use large amounts of data is crucial for...
Your guide to successful SAC migration

SAC Migration by Q2 2026: The Guide to Transitioning to the Optimized Story Experience

The time for the conversion of SAP Analytics Cloud (SAC)...
Hands with three stars representing the different technologies: SAP Analytics Cloud, SAP Business Data Cloud, SAP Datasphere.

Feature update for SAP Business Data Cloud, Analytics Cloud, and Datasphere

This wiki article summarizes the most important content of the webinar on the topic:...
9.1 Differences between SAP Databricks and native Databricks

SAP Databricks vs. Native Databricks: The detailed comparison for your company

In today's data-driven business world, the ability to efficiently analyze and use large amounts of data is crucial for...
Wiki What is Artificial Intelligence (AI) (2)

What is a semantic layer? Definition, benefits and role in modern data architectures

This Wiki article explains what a semantic layer is and why...