Solved: GCS Export Issue with PySpark Notebook

Harsha_k_111 · ‎08-01-2025

I'm currently trying to run a script in MS Fabric Notebook Environement which is attached to a Lakehouse using a table shortcut.

spark.conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("spark.hadoop.google.cloud.auth.service.account.enable", "true")
spark.conf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/lakehouse/default/Files/gcs_key.json")
spark.conf.set("spark.hadoop.google.cloud.auth.null.enable", "false")
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.json.keyfile", "/lakehouse/default/Files/gcs_key.json")

query = "SELECT * FROM dbo_Geography_shortcut"
max_records_per_file = 50000
mode = "append"
format = "json"
gcs_path = "gs://qa_dmgr_audience-prod-replica-1_eu_6bf6/15070/raw_test_5/"

df = spark.sql(query)
df.write.option("maxRecordsPerFile", max_records_per_file).mode(mode).format(format).save(gcs_path)

This script is creaating staging files in the GCS path it is uploading and trying to delete them after actual files are uploaded. The service account don't have any delete permission assigned to it, So the spark job fails. I can't provide delete permission as this is restricted by my company to proceed with it.
Error:

Py4JJavaError: An error occurred while calling o5174.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Authorized committer (attemptNumber=0, stage=27, partition=0) failed; but task commit success, data duplication may happen. reason=ExceptionFailure(org.apache.spark.SparkException,[TASK_WRITE_FAILED] Task failed while writing rows to gs://qa_dmgr_audience-prod-replica-1_eu_6bf6/15070/raw_test.,[Ljava.lang.StackTraceElement;@7e3f1bb3,org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to gs://qa_dmgr_audience-prod-replica-1_eu_6bf6/15070/raw_test.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:776)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:499) ..........
.........
Caused by: java.io.IOException: Error deleting 'gs://qa_dmgr_audience-prod-replica-1_eu_6bf6/15070/raw_test/_temporary/0/_temporary/attempt_202507291107556685858892966191353_0027_m_000000_129/part-00000-bb5e042b-8f4d-4e6d-84a5-91149d3429ad-c000.json', stage 2 with generation 1753787276139795

I need to somehow avoid this creation of staging files in the first place and direct uploading to GCS path using the same script.

Please help me on this !!
Thanks in advance 🙂

v-ssriganesh · ‎08-02-2025

Hello @Harsha_k_111,
Thank you for reaching out to the Microsoft Fabric Community Forum.

we understand that the script is failing because it tries to delete temporary staging files in the GCS path, but your service account lacks delete permissions due to company restrictions.

To resolve this, you can configure the script to write directly to the GCS path without creating temporary files. This can be achieved by adjusting the Spark configurations to use a direct write mode, which avoids the need for delete permissions. Specifically, you can set the output stream type to bypass the default behavior of creating staging files. This should allow the job to complete successfully while adhering to your permission constraints.

Best regards,
Ganesh Singamshetty

View solution in original post

v-ssriganesh · ‎08-02-2025

Hello @Harsha_k_111,
Thank you for reaching out to the Microsoft Fabric Community Forum.

we understand that the script is failing because it tries to delete temporary staging files in the GCS path, but your service account lacks delete permissions due to company restrictions.

To resolve this, you can configure the script to write directly to the GCS path without creating temporary files. This can be achieved by adjusting the Spark configurations to use a direct write mode, which avoids the need for delete permissions. Specifically, you can set the output stream type to bypass the default behavior of creating staging files. This should allow the job to complete successfully while adhering to your permission constraints.

Best regards,
Ganesh Singamshetty

v-ssriganesh · ‎08-04-2025

Hello @Harsha_k_111,

We hope you're doing well. Could you please confirm whether your issue has been resolved or if you're still facing challenges? Your update will be valuable to the community and may assist others with similar concerns.

Thank you.

v-ssriganesh · ‎08-07-2025

Hello @Harsha_k_111,

Hope everything’s going great with you. Just checking in has the issue been resolved or are you still running into problems? Sharing an update can really help others facing the same thing.

Thank you.

v-ssriganesh · ‎08-10-2025

Hello @Harsha_k_111,

Could you please confirm if your query has been resolved by the provided solutions? This would be helpful for other members who may encounter similar issues.

Thank you for being part of the Microsoft Fabric Community.