Navigating the Evolving Data Landscape: Optimizing Cost and Performance of the Modern Data Stack with dbt Labs and Databricks
The second half of 2022 saw organizations facing new challenges due to economic downturns. The favorable economic conditions and low interest rates that once helped enterprises fund innovation were replaced by rising rates, prompting companies to shift focus toward optimizing the cost and performance of existing solutions. Technology-focused industries and organizations were hit particularly hard, with these changes also making it more difficult to procure the resources necessary to drive continued modernization.
Market Conditions Foster Innovation in Data and Analytics
“Do more with less” became a common theme as enterprises sought to reduce costs and increase developer productivity. Data and analytics were especially subject to scrutiny – and modern, innovative data platforms began to integrate multiple tools to reduce overhead and streamline data operations. Thanks to the emergence of new data challenges, the market saw an explosion of data companies, offerings, concepts, and business applications, especially for generative AI.
This flood of innovation created a need for methodical approaches to solving data problems, but the constant influx of new tools made this a daunting challenge. Businesses ultimately needed to evaluate their current challenges to understand what platforms and capabilities would be most impactful in accelerating their desired outcomes.
Whether copying data from one location to another, generating data-driven business insights (BI), implementing machine learning (ML), debugging and fixing data pipelines, or fixing schema and metadata issues that break data models – forward-thinking data teams are continually trying to surface new ways to drive value back to the business. And organizations are now seeking to streamline their data stacks in an effort to shorten their time to value while keeping costs down. Never ones to shy away from a challenge, we set out to explore how dbt™ and Databricks can be combined to optimize data architectures for scale, performance, improved collaboration, and lower total cost of ownership.
dbt Labs and the Development of the Modern Data Stack
In 2017, dbt Labs embarked on an expedition to explore the uncharted territories of data management. What began as a simple compass evolved into a full-fledged navigational system for databases, guiding the way through the complex landscape of data pipelines. dbt Labs’ journey expanded to new horizons, discovering paths to semantic objects, governance frameworks, security and privacy. It was a pioneering venture in the modern data stack with several key benefits:
- Enabling self-service and democratization of data transformation
- Making data engineering accessible to a wider audience with SQL skill sets
- Providing better controls and treating data like software development
- Introducing key components of the modern data stack to facilitate workflow management from data engineering to warehousing to BI
What began as a simple compass evolved into a full-fledged navigational system for databases, guiding the way through the complex landscape of data pipelines.
These new capabilities and wider usage were obviously valuable, but the associated costs could still limit dbt’s successful implementation and adoption in organizations. The need for an integrated and collaborative solution to optimize efficiency and contain the costs of growth was apparent, and Databricks’ Data Lakehouse seemed to be the perfect counterpart for dbt.
Databricks Lakehouse Platform Expands and Optimizes the Modern Data Stack for All Use Cases (ETL, BI, and AI)
As the amount of data and transformation workloads continue to grow at an unimaginable pace, modern use cases now require multi-language support beyond SQL. Transformation use cases are also extending beyond traditional warehouses into AI/ML use cases. The traditional cloud data warehouse began to fall short due to:
- Growing costs and complexity for data integration
- Poor transformation performance (since dbt is only as performant as the underlying warehouse’s engine)
- Lack of a unified platform to handle use cases spanning ML/AI/BI/ETL, requiring managing multiple platforms
- Lack of end-to-end control with no functionality for data lifecycle lineage
- Data governance incompatibility between different platforms as well as files and tables
Databricks’ Data Lakehouse provides a single platform to support multiple personas in BI and data warehousing, data engineering, data streaming, data science, and ML. That means it utilizes the same governance layer and analytics engines for these use cases and data workloads – enabling dbt to be further optimized for performance and cost. Leveraging dbt with Databricks also provides:
- Reduced complexity and enhanced collaboration with an all-in-one platform
- Expanded capabilities for modern use cases with multi-language support
- Lower TCO and improved runtime by automating incremental data refresh with Materialized Views
- Increased development agility by enabling continuous, scalable ingestion with Streaming Tables
- Unified governance and lineage for real-time and historical data via Unity Catalog
The Results: Together, dbt and Databricks Yield Massive Cost Savings and Speed when Building Data Pipelines
There are a number of options when it comes to integrating dbt with data platforms, and Effectual’s partner Databricks had what appeared to be the most promising integration capabilities, having open sourced their integration code with dbt to prove their superior performance and lower cost compared to other data platforms. Effectual independently verified these tests to compare the costs of running dbt on Databricks versus other traditional cloud providers and the results, which we’ve highlighted below, were nothing short of impressive.
We set out to test this architecture for high-volume data loads. The test was to use dbt toprocess and transform ~100 GB of flat files in Amazon Simple Storage Service (S3) and store the results in a Databricks Delta table. We found that for the size of this data set, the traditional warehouse needed to be scaled up to a 2XL, as the smaller sizes hit resource limits and could not handle the data load. We tested equivalent Databricks Warehouse sizes and found that:
- Databricks was up to 60x less expensive and 38x more performant when comparing 2XL warehouses
- The medium-sized Databricks Warehouse can process the ~100 GB of data and write to Databricks successfully in 6.8 minutes for a cost of $1.90
- dbt is better suited running on Databricks on large amounts of data
- Databricks with dbt demonstrates a unique capability to split files while processing which sets it apart from other integrations
Together, dbt and Databricks can help data analysts and engineers collaborate more effectively, run faster and more cost efficient data pipelines, and unify data governance.
Databricks was up to 60x less expensive and 38x more performant when comparing 2XL warehouses
Effectual as a Trusted Partner for Implementing the Modern Data Stack
With the proliferation of new data platforms and services, internal teams are hard-pressed to find the time to evaluate and test new ways to increase the performance of their data, let alone develop the expertise to properly conduct comprehensive assessments of these tool. Teams are prioritizing the needs of their organization and they are tasked with using data to drive value back into the business. As a result, many organizations turn to a strong partner with the ability to cut through the noise in the data ecosystem. As a cloud native systems integrator and cloud service provider, Effectual holds deep expertise in Databricks and dbt technologies deployed on the AWS cloud.
Effectual’s engagement methodology has helped customers across all industries complete successful transformation initiatives. Effectual begins by taking a comprehensive review of each customer’s operations, with special attention to their people and processes. Understanding the underlying business drivers, desired outcomes, available skill sets, organizational culture, day-to-day processes, and the entire technology landscape helps Effectual develop a full picture. This understanding is critical to creating a tailored go-forward plan and change management strategy to transform the organization, in terms of technology as well as culture.
Optimize your Data Stack with Effectual’s dbt + Databricks Optimization Accelerator
We’ve developed an accelerator offering to optimize customer’s data stacks by integrating dbt + Databricks for proven solution that delivers faster and more economical data processing. By leveraging the right tools for the right job, dbt + Databricks are truly better together when architecting the most optimized data pipelines regarding cost and performance, and analytics.