Join us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.
Register now!Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.
Hi all,
I'm currently using a notebook connected to a lakehouse in order to establish my bronze layer tables. I have a relatively large parquet file (approx 7GB of data) that is unable to be read in. Currently I've copied my 3 files in from an S3 connection into my lakehouse , two of which are relatively small and the third that is this 7gb one. My first two were being read in correctly but had data issues that weren't allowing me to save them as tables, which I resolved by adding the following configuration changes:
Hi @TNastyCodes ,
Check Connection and Confirm S3 Access: Verify that your Spark cluster can access the S3 bucket and has the necessary permissions by listing files or performing a simple file read operation.
Increase Resources with Memory and Executor Tuning for Spark: Adjust Spark's executor memory, driver memory, and partition settings to allocate sufficient resources for handling large files efficiently.
Optimize Read: Use mergeSchema=false, ignoreMetadata=true: Disable schema merging and metadata reading to avoid issues when reading large or inconsistent Parquet files.
Check for File Corruption: Use pyarrow to Verify File Integrity: Use pyarrow or another tool to check if the Parquet file is corrupted or unreadable outside of Spark.
Split File: If All Else Fails, Split Large File and Read in Smaller Parts: If the file is too large to process, consider splitting it into smaller chunks using tools like aws s3 cp, then process each chunk separately.
Hope this helps.
Thank you
Hi @v-echaithra ,
Thanks for your response! My spark cluster does have access to the S3 location as the other two files I am using from the same source are able to be read in and saved as a delta table after transformations. My logic also includes those two optimization options for the read as well, but sadly still fails.
I'll try playing with the memory & executor tuning as well as attempting with pyarrow/pandas.
It might be ideal to just break down the file into smaller chunks but how is that done w/ s3 cp?