Python, Spark, PySpark — a small tech story

- March 29, 2026

Monday morning. Typical data team standup.
Ravi says “Got the dataset. Around 2 GB. I’ll handle it with Python.”

No drama. He opens his laptop, writes a quick Python script, uses pandas, does some transformations — done. Fast, clean, local.

A few weeks later…

Same team, same Ravi — different situation.
“Guys… the data is now 2 TB.”

He tries the same Python script.
Laptop slows down. Memory crashes. Everything stalls.

This is where Apache Spark enters.

Now Ravi realizes — this is no longer a single-machine problem.
Spark distributes the data across multiple machines and processes it in parallel.

But there’s a catch.

Ravi is comfortable in Python. Not Scala. Not Java.

So instead of switching languages, he uses PySpark.

Now he writes code in Python style, but under the hood, Spark runs it across a cluster.

So what’s really different?

Aspect	Python	Spark	PySpark
Type	Programming language	Distributed processing engine	Python API for Spark
Best for	Small to medium data	Massive datasets (GB → PB)	Large data with Python
Execution	Single machine	Cluster (multiple nodes)	Cluster (via Spark)
Ease of use	Very easy	Medium (needs setup, cluster)	Easy if you know Python
Performance	Limited by local resources	Highly scalable	Scalable + Python-friendly
Typical tools	pandas, NumPy	Spark Core, Spark SQL	PySpark DataFrames
Use cases	Scripting, ML, APIs	ETL, big data pipelines	Data engineering on Spark

Quick reality check

If you process a CSV on your laptop with pandas → mostly Python
If you process 10 TB of logs across a cluster → Spark
If you write that Spark job in Python → PySpark

Closing thought

Start with Python.
Move to Spark when scale demands it.
Use PySpark when you want scale without leaving Python.

https://www.linkedin.com/pulse/python-spark-pyspark-small-tech-story-saptarshi-biswas-hg7pc

Search This Blog

saptzcloudndata

Python, Spark, PySpark — a small tech story

Popular posts from this blog

Building an AI-Driven Ops Command Center with Power BI

DLT vs Non-DLT in Databricks: An attempt to come up with guidelines for Oracle and Traditional ETL Teams

Data Story @ Bricks