Solved: Re: partitioning concept

Jeanxyz · ‎11-14-2025

I'm studying Apache Spark concepts, the concept of partitioning got my interest. I would like to check if my understanding below is correct:

The partitioning concept only works by writing the data into files, using pyspark code below:

df.write \
  .mode("overwrite") \
  .partitionBy("department") \
  .parquet(output_path)

However, if I write data into tables, I am not allowed to partition the data. The code below would throw an error:

***

df.write.mode("overwrite").format("delta").partitionBy("department").saveAsTable("data_withoutpart")

***

So if my understanding is correct, partitioning is not useful in my case at all my data needs to be write into delta tables in the end for use by Power BI reports.

Please correct me if I got it wrong or refer me to relevant tutorials.

Ugk161610 · ‎11-15-2025

Hi @Jeanxyz ,

Yes, you’re absolutely right. If your Delta table is partitioned by department but your merge condition is based on employee_id, then the partitioning doesn’t really help during the merge. Spark will still need to scan all partitions to find the matching employee_id values, because the partition column isn’t part of the predicate.

Partitioning only speeds things up when your filter, join, or merge condition includes the partition column itself. For example, a merge or query that filters on department = 'HR' would benefit immediately, but a merge on employee_id won’t take advantage of department partitions.

So the general rule is:
Pick a partition column only if your downstream workloads actually filter or process data using that column. Otherwise, partitioning doesn’t provide much benefit.

Hope this helps — and glad the earlier correction was useful. If this answers your question, please mark it as Accepted Solution ✔️

– Gopi Krishna

View solution in original post

Ugk161610 · ‎11-14-2025

Hi @Jeanxyz ,

Your understanding is close, but there’s one important detail missing: you can partition Delta tables — it just depends on where and how the table is created. The error you’re seeing doesn’t mean Spark doesn’t support partitioning; it usually means the target table already exists without partitions, or Fabric created the table with a different structure than what your write command expects.

If the table already exists as a non-partitioned Delta table, Spark won’t let you overwrite it with a partitioned version. In that case, you’d have to drop the table first and then recreate it with partitionBy(). But writing a partitioned Delta table itself is completely supported.

Also, even for Power BI scenarios, partitioning can help — especially when your dataset grows and you want faster refresh or more efficient incremental loads. Power BI doesn’t care whether the underlying data is partitioned; it only reads the Delta table through the SQL endpoint or Spark.

So in short:
Partitioning is definitely not useless for Delta tables. The issue comes from trying to overwrite an existing table with a different partition structure. If you create the Delta table fresh with partitionBy(), it will work normally.

Hope this helps. If so, please give a Kudos 👍 or mark as Accepted Solution ✔️

– Gopi Krishna

Jeanxyz · ‎11-14-2025

You are right. I made a syntax mistake when writing partitioned tables. below is the correct one:

df2.write.mode("overwrite").format("delta").saveAsTable("data_withoutpart2", partitionBy="department")

Ugk161610 · ‎11-15-2025

Hi @Jeanxyz ,

Your understanding is close, but there’s an important detail to clarify. Partitioning does work when writing Delta tables — it doesn’t only apply to writing files. The issue you hit is just the syntax.

In Spark, the partitionBy() method is always part of the writer, not an argument inside saveAsTable. So your corrected syntax is the right one. When you write:

df.write \
.mode("overwrite") \
.format("delta") \
.partitionBy("department") \
.saveAsTable("data_with_partition")

Spark will create a fully valid partitioned Delta table, and this works perfectly fine with Power BI, Fabric Lakehouse, and SQL endpoint queries.

So partitioning is still useful even if your final destination is a Delta table. As long as your table has a natural grouping column with reasonable cardinality (like department, date, region, category, etc.), partitioning can improve performance for large datasets.

Your updated example using:

saveAsTable("data_withoutpart2", partitionBy="department")

is also valid — it’s just an alternative syntax that Spark supports.

So yes, your data can be written as Delta and still be partitioned without any problem. You weren’t running into a conceptual limitation, just a small syntax detail.

Hope this helps. If it does, please give a Kudos 👍 or mark as Accepted Solution ✔️

– Gopi Krishna

Jeanxyz · ‎11-15-2025

Thanks for clarification, @Ugk161610 .

One last question: if my delta table is partitioned by 'department' but I'm running a merge query by 'employee_id', the partition won't help in the case. Is that correct?

Ugk161610 · ‎11-15-2025

Hi @Jeanxyz ,

Yes, you’re absolutely right. If your Delta table is partitioned by department but your merge condition is based on employee_id, then the partitioning doesn’t really help during the merge. Spark will still need to scan all partitions to find the matching employee_id values, because the partition column isn’t part of the predicate.

Partitioning only speeds things up when your filter, join, or merge condition includes the partition column itself. For example, a merge or query that filters on department = 'HR' would benefit immediately, but a merge on employee_id won’t take advantage of department partitions.

So the general rule is:
Pick a partition column only if your downstream workloads actually filter or process data using that column. Otherwise, partitioning doesn’t provide much benefit.

Hope this helps — and glad the earlier correction was useful. If this answers your question, please mark it as Accepted Solution ✔️

– Gopi Krishna

Jeanxyz · ‎11-14-2025

You are right. I made a syntax mistake when writing partitioned tables. below is the correct one:

df2.write.mode("overwrite").format("delta").saveAsTable("data_withoutpart2", partitionBy="department")

partitioning concept

Helpful resources

Join our Community Sticker Challenge 2026

Free Fabric Certifications

Fabric Monthly Update - January 2026

FabCon Atlanta 2026

FabCon is coming to Atlanta

partitioning concept

Helpful resources

Join our Community Sticker Challenge 2026

Free Fabric Certifications

Fabric Monthly Update - January 2026

FabCon Atlanta 2026