Dbt – Incremental but Incomplete

67 points by captaintobs 9 months ago

I really wish data engineers didn't have to hand-roll incremental materialization in 2024. This is really hard stuff to get right (as the post outlines) but it is absolutely critical to keeping latency and costs down if you're going to go all in on deep, layered, fine-grained transformations (which still seems to me to be the best way to scale a large / complex analytics stack).

My prediction a few years back was that Materialize (or similar tech) would magically solve this - data teams could operate in terms of pure views and let the database engine differentiate their SQL and determine how to apply incremental (ideally streaming) updates through the view stack. While I'm in an adjacent space, I don't do this day-to-day so I'm not quite sure what's holding back adoption here - maybe in a few years more we'll get there.

matthewhelm 9 months ago

I wholeheartedly agree. When I worked at Shopify, we had to hand-roll our incremental data models using Spark, and the complexity of managing deep DAGs made tasks like backfilling and refactoring a huge pain. Tools like dbt and SQLMesh face similar challenges.
The chaos of existing approaches was a large part of what drove me to join Materialize. With Materialize, you can use dbt on “easy-mode”, while Materialize handles incremental logic, removing the usual headaches around processing time and keeping everything up to date within a second or two.
I recently gave a talk at Data Council about this unlock, it’s total magic: https://youtu.be/pLb5sFZ7nWw
For anyone interested, my colleague Seth also discussed this in a recent blog post: https://materialize.com/blog/migrating-postgres-materialize/
tessierashpool9 9 months ago

The Databricks AutoLoader for Delta Live Tables with Checkpointing and Watermarking comes to your rescue.

pdr94 9 months ago

Great to see dbt finally rolling out microbatch incremental models! It's a much-needed feature and a step forward for data transformation. Excited to see how this evolves and complements tools like SQLMesh. Keep up the good work!

captaintobs 9 months ago

Thanks! Yes, it's a much requested feature but it's difficult to get right!

0cf8612b2e1e 9 months ago

Is anyone using SQLMesh in production? I love “lessons learned” tools which have the opportunity to improve core design after seeing the weak points of the initial product in the space. That being said, I hate being an early adopter, so will let others determine if the new tool has an entirely novel set of shortcomings vs dbt.

captaintobs 9 months ago

There are many teams using SQLMesh in production. Fivetran, Harness, Hopper, Pitchbook to name a few.
You can read some case studies here https://tobikodata.com/harness.html or join Slack to meet with folks to learn more about their experiences.
- abtinf 9 months ago
  
  How does Fivetran use SQLMesh?
  - captaintobs 9 months ago
    
    They're using it for data transformation.They're long time dbt users, but are switching to SQLMesh because it's extremely efficient, provides a better development experience, and can help them become warehouse agnostic.

pella 9 months ago

Side question: For ETL development based on PostGIS and pgRouting, which tool is more suitable? In other words, which one is easier to work with for geometric data? (e.g., visual test case data display, etc.)

whinvik 9 months ago

Can someone who understands it explain what dbt is and how it is used. I hear a lot about it but I just haven't figured out what it is useful for.

gkapur 9 months ago

Basically people are constantly calculating metrics based on existing tables. Think something as simple as a moving average or the sum of two separate columns in a table. Once upon a time you would set up a cronjob and populate these every day as a SQL query in some python or Perl script.
Dbt introduced a language for managing these “metrics” at scale including the ability to use variables and more complex templates (Jinja.)
Then you do dbt run (https://docs.getdbt.com/reference/commands/run) and kapow the metric is populated in your database.
More broadly dbt did two other things: 1. It pushed the paradigm from ETL to ELT (so stick all the data in your warehouse and then transform it rather than transform it at extraction time.) 2. It created the concept of an “analytics engineer” (previously know as guy who knows SQL or business analyst.)
christoff12 9 months ago

I built the first half of my career as "a guy who knows SQL" (and Excel macros but I digress). I then rode the early wave of Analytics Engineering.
dbt is kinda like Vite (dbt = data build tool) for folks working with data warehouses. Their biggest contribution was a mindset shift that applied principles of the SDLC to the traditional BI/Analytics space.
Almost overnight, analysts went from building business logic in GUIs like Talend or Tableau to code-based models (SQL) checked into git repos instead. It took what Looker was doing with LookML and generalized it across the BI stack.
This shift (+ associated tooling) resulted in less brittle data pipelines, increased uptime for dashboards/reporting, and more sanity when working with more than 2-3 people in a data environment.
Imagine a situation where you're at an e-commerce company and need to reconcile orders from Woocommerce with shipments in ShipStation, returns from tickets in HubSpot, and refunds issued in Stripe. dbt simplifies the management of the relationships between these various systems.
Based on this, you can build data models that allow you and, increasingly, your business stakeholders to answer questions like "Which SKUs have seen an uptick in refunds due to reason X this quarter?" and "Where were they shipped?"
The benefit of having standard abstractions means you can build metrics on top of the models as [gkapur](https://news.ycombinator.com/item?id=41853925) mentions such that "revenue" is the same when marketing pulls it for calculating CAC as when finance pulls it their monthly reports, etc.
tiew9Vii 9 months ago

Some opinionated conventions around defining templated SQL queries in YAML files for ETL.
Then it provides additional tooling around that, GUI’s, governance, everything your average large corporate asks for.
bitlad 9 months ago

I am not sure if it is that popular these days. Couple of years ago it was pretty popular.
- jburbank 9 months ago
  
  The hype may have gone down, but it's usage is good. It's used where I work. It has a slack channel that's pretty busy.
  - christoff12 9 months ago
    
    dbt isn't going anywhere. It's the standard.
    That said, SQLMesh and other tools are pretty interesting and I look forward to new growth in the space.
- riku_iki 9 months ago
  
  sounds like a typical hype-tech lifecycle.