Skip to main content
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Reply
TNastyCodes
New Contributor II

Unable to open large Parquet file from S3 connection

Hi all, 

 

I'm currently using a notebook connected to a lakehouse in order to establish my bronze layer tables. I have a relatively large parquet file (approx 7GB of data) that is unable to be read in. Currently I've copied my 3 files in from an S3 connection into my lakehouse , two of which are relatively small and the third that is this 7gb one. My first two were being read in correctly but had data issues that weren't allowing me to save them as tables, which I resolved by adding the following configuration changes:

spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
From there, I was able to read and set up tables for my first two files. For the third large file, I'm trying to read it in altogether without trying to set up a table for it yet, but it consistently fails with the following error:
Py4JJavaError: An error occurred while calling o10861.parquet. : Operation failed: "Internal Server Error", 500, HEAD, http://onelake.dfs.fabric.microsoft.com/{warehouse_id}/Files/{file_path}?upn=false&action=getStatus&...

I tried adding other configuration changes such as the following:
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")
spark.conf.set("spark.sql.parquet.timestampNTZ.enabled", "true")
 
I also tried the following options into my read statement:
.option('mergeSchema', 'false')
.option('parquet.ignoreMetadata','true')
.option('parquet.maxMetadataLength', 256*1024*1024)
 
I then tried to read it using openrowset and bulk
spark.sql('''SELECT *
FROM
    OPENROWSET(
        BULK \'abfss://*@onelake.dfs.fabric.microsoft.com/*/file_path\',
        FORMAT = 'PARQUET'
    ) AS test''')
 
And lastly just to make the operation easier by passing in the schema as an option into the read, but with no avail the same error persists.
Using the option to create a delta table directly from the file manager also does not work.
Any idea why or ways I can remedy this?
None of these attempts worked. Am I missing something?
 
2 REPLIES 2
v-echaithra
Honored Contributor

Hi @TNastyCodes ,

Check Connection and Confirm S3 Access: Verify that your Spark cluster can access the S3 bucket and has the necessary permissions by listing files or performing a simple file read operation.

Increase Resources with Memory and Executor Tuning for Spark: Adjust Spark's executor memory, driver memory, and partition settings to allocate sufficient resources for handling large files efficiently.

Optimize Read: Use mergeSchema=false, ignoreMetadata=true: Disable schema merging and metadata reading to avoid issues when reading large or inconsistent Parquet files.

Check for File Corruption: Use pyarrow to Verify File Integrity: Use pyarrow or another tool to check if the Parquet file is corrupted or unreadable outside of Spark.

Split File: If All Else Fails, Split Large File and Read in Smaller Parts: If the file is too large to process, consider splitting it into smaller chunks using tools like aws s3 cp, then process each chunk separately.

Hope this helps.
Thank you

Hi @v-echaithra , 

Thanks for your response! My spark cluster does have access to the S3 location as the other two files I am using from the same source are able to be read in and saved as a delta table after transformations. My logic also includes those two optimization options for the read as well, but sadly still fails.

I'll try playing with the memory & executor tuning as well as attempting with pyarrow/pandas.

It might be ideal to just break down the file into smaller chunks but how is that done w/ s3 cp?

Helpful resources

Announcements
Users online (27)