To Nha Notes | Technology blog posts

Data Engineering After AI: Why Moving Data Was Never the Point

To Nha Notes | Feb. 25, 2026, 11:24 a.m.

A reflection on the shift from ETL pipelines to context architecture — and what it means for data engineers today.

The question isn't whether AI is changing data engineering. It clearly is. The real question is: what kind of thinking was always too important to automate — and why did we let it stay buried under mechanical work for so long?

That's the central provocation in Ananth Packkildurai's recent piece in Data Engineering Weekly. And it's one worth sitting with.

The ETL Era Is Ending — And That's a Good Thing

Extract, Transform, Load made perfect sense for its moment in history. Source systems were siloed, formats were inconsistent, and data engineers were the people who wrote the code to move data from where it lived to where it could be used.

But here's what we quietly knew the whole time: the transformation step was always the most brittle part. Teams buried business rules deep inside SQL logic or Python functions, version-controlled them alongside infrastructure, but rarely treated them with the rigor they deserved. When the definition of "active user" changed — and it always changed — someone had to hunt down every place that definition lived and update it manually, hoping nothing got missed.

AI is now competent at generating this kind of transformation code. Not perfect, but competent enough that the mechanical work of pipeline construction is no longer a meaningful professional differentiator. If your identity as a data engineer is built around being good at writing transformation logic, that identity is under real pressure.

But this isn't a story about loss. It's a story about clarity. The mechanical work was always obscuring more important work underneath it. AI forcing that reckoning is, in a strange way, a gift.

Introducing ECL: Extract, Contextualize, Link

The framework that's emerging as a replacement isn't a technical architecture so much as a reorientation of purpose. Instead of ETL, think ECL — Extract, Contextualize, Link.

Extract remains. Data still needs to move from source systems to analytical environments, and that work still requires engineering judgment about reliability, latency, volume, and failure modes. AI handles the mechanical parts; the architectural decisions about what to extract, when, and how still belong to humans who understand both the source systems and the downstream consequences.

Contextualize is where the real shift happens. This is the work of giving data semantic meaning. Understanding that "revenue" is calculated differently by Finance and Sales. That a timestamp in a clickstream event means something different than a timestamp in a billing record. That a null value in one system represents the absence of information while in another it represents an explicit user choice.

AI can draft this work at scale — inferring field definitions, classifying entities, mapping relationships across a data landscape no human team could manually annotate in full. What AI cannot do is be accountable for itself. The judgment of whether an inference is correct, the organizational authority to declare a definition, the decision to formalize a discovered pattern into an enforced contract — that belongs to humans. Contextualize is precisely where AI inference and human judgment meet.

Link is about entity relationships across the data landscape — connecting a customer record in your CRM to a user record in your product database, linking an analytics event to a support session. As AI generates more of the code that consumes data, the ability to reason about how entities relate across systems becomes more valuable, not less. Linkage is what makes context portable — what allows meaning built in one part of the landscape to carry into the rest.

The Problem With Early Binding Alone

Data contracts — agreements between producers and consumers specifying schema, quality expectations, and the semantic meaning of each field — are the practical implementation of early binding. The data industry spent years debating what contracts were. Meanwhile, software engineering had quietly converged on treating them as interfaces: things that could break, that had versioning implications, that enforced behavior rather than merely described it.

A contract that lives in a wiki and gets updated when someone remembers is documentation. A contract that fails a pipeline when a schema changes without notice, that alerts a consumer when quality thresholds are violated, that an AI agent can reason about deterministically — that is architecture.

But early binding alone has a fundamental limitation: it can't prevent meaning from eroding at every downstream hop.

Consider what happens to a well-contracted dataset as it moves through a Medallion architecture. At the Bronze layer, data lands raw with the contract's guarantees largely intact. Silver applies conformance rules. By the time data reaches Gold, the pipeline has made a series of editorial decisions on the data's behalf — aggregations collapse events into metrics, business logic gets baked into the shape of the table. The Gold layer becomes an artifact optimized for a specific set of questions that seemed important when the pipeline was built.

By the time a consumer queries it, they're working with something several editorial decisions removed from the original intent. This is the telephone game playing out silently in your pipeline.

The Contextualize Pipeline: A New Kind of Infrastructure

The answer isn't to bind context even earlier. It's to build a dedicated pipeline specifically for context — one that runs alongside your data infrastructure, not through it.

This Contextualize pipeline is event-driven, not scheduled. Every new dataset automatically triggers it. Continuous profiling monitors existing datasets for meaningful changes — a new column, a dropped field, a distribution shift that suggests something changed upstream. Any of these events re-triggers the pipeline for affected entities.

The pipeline itself is agentic. An AI agent analyzes incoming data — schema, sample values, statistical profiles, lineage — and infers semantic meaning. It produces structured, versioned context artifacts: inferences about meaning that didn't require a domain expert to pre-specify every scenario.

Critically, those inferences don't automatically commit. They route to a validation layer that works like a labeling workflow. High-confidence inferences get validated by an LLM-as-Judge before any human review is triggered. Medium-confidence ones surface to domain experts. Low-confidence or contested inferences get flagged for deeper investigation. Humans aren't reviewing every artifact — they're reviewing the uncertain ones.

Validated artifacts land in a Context Store — a dedicated, versioned, queryable store of semantic definitions, entity classifications, and relationship maps. This is the new infrastructure component that ECL requires. Downstream agents don't query raw data and infer meaning on the fly. They query the Context Store first, ground their understanding in validated context, and then query the data. The context is stable, reusable, and auditable.

Early Binding or Late Binding? It Comes Down to Accountability

The decision isn't about semantic maturity or how well a domain is understood. It's about where the data comes from relative to your accountability boundary.

When a dataset originates within a controlled environment — produced by a team within your organization's sphere of accountability — early binding is the right tool. Contracts can be negotiated, enforced, and held to.

When a dataset originates outside that boundary — third-party feeds, partner data, public datasets — early binding isn't available. The schema can change without notice. The semantics are inferred, not declared. This is exactly where the Contextualize pipeline earns its place.

The feedback loop works in both directions. Discovered context built through repeated profiling and validation can graduate into prescribed context over time. An external dataset ingested consistently enough to profile, validate, and republish as an internal data product crosses the boundary from uncontrolled to controlled. The Contextualize pipeline is what makes that transition possible.

Context Is the New Git

Here's a reframe that changes everything: context doesn't travel through the data pipeline — it travels alongside it, as metadata, lineage records, and contract provenance. The transformations change the data; the metadata preserves the meaning.

Think of it like git. A file can be heavily modified across dozens of commits — refactored, renamed, rewritten — but the context of how it got there is never lost because it lives in the commit history, not in the file itself. The Gold layer is the latest commit. The lineage graph is the git log. The Context Store is the understanding you build by reading that log systematically.

This reframe changes what data engineers are actually responsible for building. The transformations are increasingly automatable. The metadata infrastructure, the lineage graph, the Contextualize pipeline that reads it, the Context Store that accumulates from it — that is the engineering surface that requires sustained human judgment.

The New Title: Context Architect

Ananth's article ends with an honest acknowledgment: ECL is a reorientation, not a finished methodology. The tooling is maturing. The organizational patterns for governing the Context Store — how conflicts between teams get adjudicated, how discovered context earns formalization — don't yet have established templates. Practitioners are working it out as they go.

But the direction is clear. The data engineer of the next decade owns the architecture of meaning. They design contractual foundations that are executable and enforced, not just documented. They build lineage infrastructure that carries context through transformations without losing it. They govern the Contextualize pipeline and the Context Store — the infrastructure where definitions get built, validated, and formalized into what everything downstream depends on.

And crucially, this isn't only a technical role. Context erosion is as much an organizational failure as a technical one. Teams don't share semantic definitions because no ownership model incentivizes them to do so. Nobody enforces contracts because producing teams have no accountability to the consumers they serve. The new data engineer sits at the intersection of architecture and coordination — the two things that are genuinely irreducible to automation.

The title "Data Engineer" might need updating. What we're actually describing is a Context Architect — someone whose primary material is not data movement but data meaning, not pipelines but provenance, not transformation logic but the semantic infrastructure that makes transformation logic trustworthy.

What This Means for Practitioners Today

If you're working in data engineering — and especially if you're working on AWS infrastructure like RDS, MWAA, DMS, or Glue — the ECL framework offers a useful lens for how to think about the value you're creating.

Building another ETL pipeline is increasingly table stakes. Building the lineage graph that tracks what that pipeline does to the semantics of the data — that's the durable investment. Documenting a data contract in Confluence is documentation. Enforcing it as an executable constraint that fails loudly when violated — that's architecture.

The practitioners who invest in the architectural and organizational work of context now will define the discipline for the decade ahead. The frontier is genuinely open. That's not a threat — it's an invitation.

Based on "Data Engineering After AI" by Ananth Packkildurai, Data Engineering Weekly (February 2026). All interpretations are my own.

Posts

Technical AI Startup Work Learning Tools Books Kinh doanh Đọc sách Làm bánh Cuộc sống Tài chính Bất động sản Đầu tư Thuế Nice words