Visual ETL made sense when data teams were small and transformations were simple. Drag a connector, wire a mapping, schedule a job. But at some point the DAG viewer became unreadable, the version control story became "ask Dave," and the server running Talend became the single point of failure nobody wanted to touch.
That's when the conversation about dbt starts.
We've run this migration enough times — across Talend Open Studio, Talend Cloud, and a few Informatica instances for good measure — to know where it goes smoothly and where it doesn't. This is the playbook.
Why teams move
The reasons are remarkably consistent:
- SQL-native transformations. dbt runs inside your warehouse. No external compute. No data movement. The warehouse you're already paying for does the work.
- Git as the source of truth. Every transformation is a SQL file in a repo. PRs, code review, CI — the same workflow your software engineers already use.
- Testing that actually runs. dbt tests execute on every build. Talend quality components exist but nobody enforces them consistently.
- Lineage you can trace.
dbt docs generateproduces a full dependency graph. In Talend, lineage means opening every job and manually following the connections.
The 60–70% improvement in query times we typically see post-migration is a bonus, not the reason. The real win is that your data team can move at the speed of a PR instead of the speed of a change-request ticket.
The audit nobody wants to do (but everyone needs)
Before writing a single dbt model, catalogue every Talend job. For each:
- Sources and destinations. Where does data come from, where does it land?
- Transformations. Rename, cast, join, aggregate, filter — name each one.
- Schedule and owner. Who runs it, how often, what breaks when it doesn't?
- Consumer. Who actually reads the output?
That last column is where the savings hide. In our experience, 20–30% of Talend jobs are orphaned — they run on schedule, they consume compute, and nobody has looked at their output in months. Retire those. Don't migrate dead weight.
Split the rest:
| Bucket | What's in it | Migration path |
|---|---|---|
| Clean SQL mappings | ~70% of jobs | Direct dbt model conversion |
| Iteration / file handling | ~10% of jobs | Orchestrator + dbt vars |
| Obsolete | ~20% of jobs | Archive and delete |
Extraction is not dbt's job
This trips people up. dbt transforms data that's already in the warehouse. It doesn't extract.
Replace Talend's tDBInput and tFileInput components with purpose-built extraction:
- Fivetran for managed connectors — Salesforce, Shopify, HubSpot, Google Ads, 300+ others.
- Airbyte for self-hosted or custom sources.
- Cloud Functions / Workflows for bespoke API pulls.
Raw data lands in your warehouse untouched. Register those tables as dbt sources so lineage starts clean from the first hop.
Converting tMap to SQL models
Each tMap component becomes a SQL file. Structure them in layers:
Staging models (stg_*.sql) — one per source table. Rename columns, cast types, filter junk. No joins, no aggregations. One source in, one clean table out.
-- models/staging/stg_orders.sql
SELECT
order_id,
CAST(order_date AS DATE) AS order_date,
LOWER(TRIM(customer_email)) AS customer_email,
order_total_cents / 100.0 AS order_total
FROM {{ source('erp', 'raw_orders') }}
WHERE order_id IS NOT NULLMart models (mart_*.sql) — this is where joins and business logic live. These are what dashboards read.
-- models/marts/mart_revenue_by_month.sql
SELECT
DATE_TRUNC(o.order_date, MONTH) AS month,
COUNT(DISTINCT o.order_id) AS orders,
SUM(o.order_total) AS revenue
FROM {{ ref('stg_orders') }} o
GROUP BY 1Don't replicate tMap logic 1:1. The visual abstractions in Talend often paper over bad join logic — rewriting in SQL exposes assumptions you didn't know existed.
The iteration trap
dbt is set-based. It doesn't loop.
If your Talend job iterates over a list of client IDs, date ranges, or file paths, that iteration belongs in your orchestrator:
- Airflow generates the parameter list.
- Airflow calls dbt with variables:
dbt run --vars '{"client_id": "abc"}'. - The dbt model reads the variable:
{{ var('client_id') }}.
Clean separation. The orchestrator decides what to run. dbt decides how to transform it.
Teams that try to force iteration into dbt — Jinja loops generating dynamic SQL, macros that call macros — end up with something harder to maintain than the Talend job they replaced.
Testing: the part Talend never enforced
dbt's testing framework is its quiet superpower. Start with the basics:
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_email
tests:
- not_nullThen layer on business-specific tests:
- name: order_total
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000Run dbt test in CI. Failures block merges. You'll catch more data quality bugs in week one than Talend caught in a year.
Parallel validation: the non-negotiable step
Two to four weeks of running both systems side by side. No exceptions.
Compare daily:
- Row counts on critical tables.
- Aggregate values — revenue, user counts, whatever your dashboards report.
- Dashboards built on both outputs.
When the numbers match for a full week, retire the Talend job. Not before.
Teams that skip this step spend months discovering tiny discrepancies in production, usually after the Talend server has been decommissioned and the fix is no longer simple.
What the stack looks like after
| Layer | Talend world | dbt world |
|---|---|---|
| Extraction | tDBInput, tFileInput, tREST | Fivetran / Airbyte |
| Transformation | Talend jobs on a dedicated server | dbt models in your warehouse |
| Orchestration | Talend scheduler or cron | Airflow / Dagster / Prefect |
| Testing | Manual spot checks | dbt tests in CI, every build |
| Version control | "Ask Dave" | Git-native, PR-reviewed |
| Deployment | Export + import job archives | dbt run triggered by CI merge |
Timeline
For a medium-complexity estate (30–80 Talend jobs, 2–3 source systems):
| Phase | Duration | What happens |
|---|---|---|
| Audit + bucketing | 1 week | Catalogue, retire dead jobs, scope the migration |
| Extraction setup | 1 week | Fivetran / Airbyte connectors, raw tables landing |
| Core model conversion | 2–3 weeks | Staging + mart models, tests, documentation |
| Parallel validation | 2 weeks | Both systems running, daily comparison |
| Cutover + cleanup | 1 week | Retire Talend, update schedules, close tickets |
Total: 7–8 weeks for a team of two. Faster if the Talend estate is clean. Slower if there's iteration logic or undocumented tribal knowledge baked into the jobs.
The uncomfortable truth
The hardest part of this migration isn't technical. It's getting the team to stop thinking in visual mappings and start thinking in SQL layers. The engineers who built those Talend jobs often have years of muscle memory — they know which tMap to open, which connection to check, which schedule to restart.
That muscle memory is valuable. What changes is the medium. Instead of opening a job designer, you open a SQL file. Instead of checking a tMap, you read a ref(). Instead of restarting a schedule, you re-run a CI pipeline.
The knowledge transfers. The tooling gets out of the way.
We've run this migration for teams across Snowflake, BigQuery, and Databricks — from 20-job Talend estates to 200+. If you're weighing the move, book a discovery call and we'll walk through what it looks like for your stack.