Posts

Showing posts from March, 2026

Python, Spark, PySpark — a small tech story

Image
  Monday morning. Typical data team standup. Ravi says “Got the dataset. Around 2 GB. I’ll handle it with Python.” No drama. He opens his laptop, writes a quick Python script, uses pandas, does some transformations — done. Fast, clean, local. A few weeks later… Same team, same Ravi — different situation. “Guys… the data is now 2 TB.” He tries the same Python script. Laptop slows down. Memory crashes. Everything stalls. This is where Apache Spark enters. Now Ravi realizes — this is no longer a single-machine problem. Spark distributes the data across multiple machines and processes it in parallel. But there’s a catch. Ravi is comfortable in Python. Not Scala. Not Java. So instead of switching languages, he uses PySpark. Now he writes code in Python style, but under the hood, Spark runs it across a cluster. So what’s really different? Aspect Python Spark PySpark Type Programming language Distributed processing engine Python API for Spark Best for Smal...

Databricks CDC Explained Simply: With an Oracle-to-Databricks Comparison

  In many enterprise data programs, one phrase appears again and again: CDC , or Change Data Capture . To technical teams, CDC is familiar. But to managers, business stakeholders, and even many experienced database professionals entering the Databricks world, it can sound like one more cloud-era term that needs decoding. The good news is that the core idea is simple. CDC means: instead of reloading all data every time, capture and process only what changed . That is it. Yet behind that simple idea lies one of the most important shifts in modern data engineering. And for organizations moving from Oracle-driven data platforms to Databricks-based lakehouse architectures, understanding CDC properly can make a big difference in cost, speed, design, and operating model. A simple way to understand CDC Imagine a school register with 2,000 students. Every day, only a handful of things change. A few students are absent. One student joins. Another leaves. Someone’s contact number is corrected...

DLT vs Non-DLT in Databricks: An attempt to come up with guidelines for Oracle and Traditional ETL Teams

Image
  The first time I heard people talking seriously about DLT in Databricks, I noticed something interesting. The room was full of smart people, but everyone seemed to mean slightly different things when they said “DLT.” Some people were talking about automation. Some were talking about data quality. Some were treating it like just another pipeline feature. And people coming from Oracle or traditional ETL backgrounds often had the same silent question: “Fine, but what is it really? And how is it different from the way we already build pipelines?” I could relate to that question immediately. Because if you have spent years around Oracle, PL/SQL, scheduler jobs, ETL tools, control tables, recovery scripts, and operational dashboards, then modern cloud data platform language can sometimes sound more complicated than it needs to be. In the older world, even when things were complex, the mental model was clear. You had jobs. You had dependencies. You had scheduling. You had ...

Data Story @ Bricks

Image
 A few years ago, building a data platform felt like managing a crowded marketplace.  There was a data lake sitting quietly in object storage. A warehouse lived somewhere else for dashboards. ETL pipelines ran in their own tool. Streaming had another engine. Machine learning experiments happened in separate notebooks. Each team had its space. Each system did its job. But they didn’t naturally work together. Now picture a fast-growing retail company expanding across cities. Sales data flows in daily. Engineers load raw files into cloud storage. Analysts copy pieces into a warehouse for reports. Data scientists request extracts to build models. Meanwhile, governance teams try to answer simple questions like, “Who accessed this table?” The answers aren’t always clear. Nothing is completely broken. But everything feels stitched together. Databricks entered this story with a different idea. Instead of improving the stitching, it asked: What if the lake itself could act like a wareh...