WritingDatabricks (DBRX)Databricks (DBRX)published Jun 16, 2026seen 1w

What is data pipeline architecture?

Open original ↗

Captured source

source ↗
published Jun 16, 2026seen 1wcaptured 1whttp 200method plain

What is data pipeline architecture? | Databricks Blog Skip to main content

Summary

A well-designed data pipeline architecture separates ingestion, transformation, storage, and serving into distinct layers, with the choice of pattern (batch, streaming, medallion, Kappa, etc.) driven by your latency and cost requirements, not convention.

ELT has largely replaced ETL as the domina bnnt approach because modern cloud platforms make it practical to load raw data first and transform it in place, preserving flexibility for reprocessing and downstream reuse.

Databricks unifies batch and streaming pipelines on a single platform (Lakeflow + Delta Lake + Unity Catalog), eliminating the duplicate infrastructure and governance gaps that make traditional Lambda-style architectures brittle.

Data pipeline architecture is the end-to-end design of how data is collected, processed, stored and delivered from source systems to the people, applications and models that use it. The word “architecture” refers to the blueprint, not the pipeline itself. It covers the choices about how data flows, where it gets transformed and which tools handle each step along the way. Good architecture is matched to the use case rather than picked off a shelf. A data pipeline built for real-time fraud detection looks very different from one that produces a nightly sales report, even though both move data from source to destination. This glossary page covers the core layers every pipeline shares, the common stage models, the major architectural patterns and the best practices that keep pipelines reliable as they scale. How does data pipeline architecture work? A data pipeline moves data through a series of stages, and each stage has a specific job: gather the data, clean it up, store it and make it usable. Architecture is the plan for how those stages connect. It defines what happens to the data at each step, in what order and under what rules. Architecture decisions sit at two levels. The logical design defines which stages exist and what each one does: this is “the what.” The physical design defines which specific tools and infrastructure run each stage: this is “the how.” Orchestration (the automatic scheduling and coordination of each step) and monitoring don’t belong to any single stage. They run across the whole pipeline. Modern platforms have also collapsed an old divide. With Lakeflow , Databricks unifies batch and streaming pipelines on a single foundation, so teams don’t have to build and maintain two parallel systems. The core layers of a data pipeline Regardless of the pattern a team chooses, every data pipeline is built on the same four layers. Each layer answers a different question about the data: how it gets in, how it becomes useful, where it lives and who consumes it. Ingestion Ingestion pulls data into the pipeline from source systems: databases, applications, APIs, files in cloud storage, event streams and sensors. Data ingestion comes in two flavors. Batch ingestion pulls data on a schedule, such as every hour or every night. Streaming ingestion captures data continuously as events happen. Many pipelines also use change data capture (CDC), a method that tracks row-level changes in a source database so the pipeline moves only what’s new or updated instead of reloading everything. Processing and transformation This layer is where raw data gets cleaned, reshaped, enriched and prepared for use. Typical work includes fixing missing values, standardizing formats, joining datasets and applying business logic, the same tasks at the heart of ETL . Processing follows the same split as ingestion. Batch processing works on large chunks of data together, while stream processing handles records one at a time or in tiny micro-batches as they arrive. Storage Storage is where processed data lands so it can be queried, analyzed or fed to models. The destination is typically a data lake, a data warehouse or a lakehouse, a single system that combines the strengths of both. Format matters as much as location. Open formats like Lakehouse Storage and Apache Iceberg let multiple tools read the same data without copying it from system to system. Delta Lake also adds reliability features such as ACID transactions (a guarantee that writes either fully succeed or fully fail, preventing corruption) and time travel (the ability to query older versions of a table). Serving and consumption The final layer delivers prepared data to the people and systems that need it: analysts running SQL queries, business users working in dashboards, data scientists training models and applications calling APIs. Destinations range from BI tools to ML platforms to operational systems, with a data warehouse often sitting at the center of analytics workloads. Across all four layers, orchestration and observability do the connective work: scheduling jobs, tracking data quality and raising alerts when something breaks. How many stages are in a data pipeline? (3 vs. 4 vs. 5) Different sources describe data pipelines as having three, four or five stages, which causes plenty of confusion. The reality is simpler. All three models describe the same underlying work at different levels of detail. Model Stages When you'll see it used 3-stage Sources → Processing → Destination High-level explanations, executive overviews, intro-level content 4-stage Ingestion → Processing → Storage → Serving Most common in modern data engineering. Balances clarity and detail 5-stage Collection → Ingestion → Processing → Storage → Analysis Detailed technical breakdowns. Splits “getting data” into collection (from the source) and ingestion (into the pipeline)

The number of stages is a labeling choice. The work the pipeline performs is the same. Common data pipeline architecture patterns Architectural patterns are the established designs teams choose from when building pipelines. The right one depends on latency requirements, data volume and how the data will be used downstream. Batch architecture Batch architecture processes data in scheduled chunks: every hour, every night or every week. It fits reporting, historical analysis, ML training data and any use case where minutes or hours of delay are acceptable. Batch pipelines are simpler to build, cheaper to run and easier to debug than their streaming counterparts. The trade-off is freshness. When decisions depend on what happened seconds ago, batch can’t keep up. Streaming architecture Streaming architecture processes...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine educational blog post by Databricks.