Overcoming Challenges with Apache HUDI in AWS Glue

6 min readJul 18, 2024

Thi article is Co Authored By: Sanee Salim & Sathiyakugan Balakrishnan

Brief Overview of AWS Glue and Apache HUDI

AWS Glue is a Serverless data integration service that makes it easy to prepare and load data for analytics. It provides a managed environment for running ETL (Extract, Transform, Load) jobs.

Apache Hudi(Hadoop Upserts Deletes and Incremental) is an open-source data management framework used to simplify incremental data processing and data pipeline building.

Introduction

While working with AWS Glue to write data to S3, we faced significant challenges with data deduplication. Our data analysis involved reprocessing all-time data, resulting in resource-intensive jobs. To address these issues, we explored various options including Apache Iceberg, Delta Lake, and Apache Hudi. Ultimately, we chose Apache HUDI due to its native support in Glue 4 (with hudi version 0.12), favorable benchmarks, and ease of implementation. However, integrating HUDI brought its own set of challenges and learnings.

Writing path

The following is an inside look at the Hudi write path and the sequence of events that occur during write. This sequence ensures efficient handling and processing of data with Hudi by performing essential steps such as deduplication, indexing, partitioning, maintaining data integrity, and optimizing storage. Learn more about this at Writing Path

Issues Encountered and Resolutions

Issue 1: Data Upsert Complications

With the default HUDI version 0.12 in Glue 4, we encountered issues with the

hoodie.bloom.index.update.partition.path

configuration during data upserts. In our case, we wanted to set this to true so the data is updated along with partition paths. For example:

Record 1: {name: Sanee, gender: Female, date: 2024-07-07, status: ACTIVE}
Record 2: {name: Sanee, gender: Male, date: 2024-07-08}
Initial Outcome: {name: Sanee, gender: Male, date: 2024-07-08}
Desired Outcome: {name: Sanee, gender: Male, date: 2024-07-08, status: ACTIVE}

In this scenario, while no data appeared in the 2024–07–07 partition as expected, the 2024–07–08 partition did not retain the status from Record 1 due to the upsert behavior.

Resolution: We resolved this issue by upgrading to HUDI version 0.14.0. Further Reading: We couldn’t find mention of this issue on the internet.

Issue 2: Metadata Synchronization with AWS Glue and Athena

Initially, we utilized HUDI’s metadata sync feature to synchronize data with Glue tables for querying in Athena. Due to the serverless nature of AWS Glue, we faced synchronization discrepancies where the data in S3 did not match the queried data in Athena. Besides the synchronization discrepancies, we also saw our glue jobs failing with Failed to sync hive metastore

Resolution: Coinciding with our needs, AWS Glue released crawler support for HUDI formats. We disabled the HUDI sync and opted to use Glue crawlers to keep Athena up to date.

Sidenote: AWS Cloudformation doesn’t support Hudi Targets yet

Further Reading: This issue was also highlighted in the below articles and issues: https://github.com/apache/hudi/issues/9134

https://medium.com/@life-is-short-so-enjoy-it/hudi-0-11-aws-glue-doesnt-work-yet-c61f3aec6638

Issue 3: Handling Duplicates in the Same Batch

Our custom payload classes were designed to handle upserts intelligently. Specifically, we wanted to manage the retention of duplicate status records based on their timestamps. Here are examples illustrating our challenges and solutions:

Scenario 1: Retain the earliest record if the status is unchanged.

Record 1: {name: Sanee, gender: Male, date: 2024-07-07, status: ACTIVE}
Record 2: {name: Sanee, gender: Male, date: 2024-07-08, status: ACTIVE}
Initial Outcome: {name: Sanee, gender: Male, date: 2024-07-08, status: ACTIVE}
Both records are in the same batch, leading to the retention of the most recent record due to HUDI’s default pre-combine logic.
Desired Outcome: {name: Sanee, gender: Male, date: 2024-07-07, status: ACTIVE}

Scenario 2: Retain the most recent record if the status changes.

Record 1: {name: Sanee, gender: Male, date: 2024-07-07, status: ACTIVE}
Record 2: {name: Sanee, gender: Male, date: 2024-07-08, status: REMOVED}
Desired Outcome: {name: Sanee, gender: Male, date: 2024-07-08, status: REMOVED}

Challenge: HUDI’s default behavior to pre-combine using the record key led to the retention of the most recent record regardless of the status consistency, which was not desirable in Scenario

Resolution: To resolve this, we modified the HUDI configuration to set hoodie.combine.before.upsert to false. We then implemented custom deduplication logic within our Spark code to handle the records appropriately within each mini-batch, ensuring data integrity and consistency.

Issue 4: Handling Nulls in Record Updates

Initially, we utilised the https://github.com/a0x8o/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java payload class with HUDI version 0.12, for handling upserts. This approach presented significant challenges as it indiscriminately upserted null values when present in new incoming records. This was particularly problematic in scenarios where only a subset of fields needed updating, and others needed to retain their existing values.

Challenge: The https://github.com/a0x8o/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/OverwriteWithLatestAvroPayload.java class upserts all fields from the incoming record, including nulls. This behavior led to unwanted overwriting of existing data with null values, thereby compromising data integrity. For example:

Original Record: {name: "Sanee", gender: "Female", status: "Active"}
Incoming Record: {name: "Sanee", gender: null, status: "Inactive"}
Undesired Outcome Using Overwrite with the Latest Avro Payload: {name: "Sanee", gender: null, status: "Inactive"}

Here, the gender information is inadvertently lost because the incoming record contains a null value for gender.

Desired Functionality: We needed a payload class that could selectively update fields that are non-null in the incoming record, leaving all other fields untouched.

Resolution: To address this, we switched to using HUDI version 0.14, which supports the PartialUpdateAvroPayload class. This class allows for partial updates, where only non-null fields in the new record overwrite their corresponding fields in the existing record, preserving data accuracy and completeness.

Optimal Outcome Using Avro Partial Update Payload: {name: "Sanee", gender: "Female", status: "Inactive"}

The introduction of the Avro Partial Update Payload class in HUDI version 0.14 offered a more nuanced approach to data handling, allowing us to maintain the integrity of our dataset by preventing unwanted null overwrites.

Issue 5: Optimising Index Performance

We transitioned to using the Record Level Index with the upgrade to HUDI version 0.14.0, which drastically improved our data upsert operations performance by reducing lookup times.

Issue 6: Partition Optimisation for Enhanced Performance

We reduced the number of partitions by removing an unnecessary partition path, which resulted in a 10x improvement in query performance and a 3x improvement in upsert performance.

Issue 7: Metadata Management and Performance Degradation

After disabling HUDI’s metadata feature hoodie.metadata.enable, we observed a 20% increase in query performance and further reductions in batch runtime, leading to more efficient and predictable data management cycles. Do note that we were querying this data via AWS Athena (Presto) so results may vary by case

Further Reading: This issue also seems to be raised here: https://github.com/apache/hudi/issues/6881

More About Our Data and the Results

In our journey to optimize data processing with Apache Hudi within AWS Glue, we focused on handling two main raw tables:

Raw data: These tables contained 915,389,967 records amounting to 339.7 GB.

Our objective was to process these tables efficiently and store the results in conformed Hudi tables for the “Raw” dataset.

Statistics Comparison: Raw vs. Hudi

Metric Raw Data Hudi Conformed Number of Records 915,389,967 874,965,596 Size of Data (GB) 339.7 218.7 GB Number of Objects 2,027 61,928

We observed significant improvements when transitioning from raw data storage to Hudi-conformed tables. The comparison metrics are as follows:

Processing Times

One of the key benefits of using Hudi was the reduction in processing times:

Year Time Taken Now Time Taken Without Resolving Above Issues Initial Dataset Load Time 40 minutes and 25 seconds 16 hours, 57 minutes, and 1 second Daily Upsert Time 12 minutes and 57 seconds Timeout (could not proceed)

Average records received daily: 95,000
Daily data size (GB): 4.5

Partition Count Optimisation

By optimizing our partition strategy, we achieved a drastic reduction in the number of partitions:

Before Optimisation: 400,000 partitions (including unnecessary partition for date)
After Optimisation: 20,000 partitions

Summary of Improvements

The transition to Hudi and the optimization strategies we implemented led to:

Improved query and upsert performance due to better partitioning and metadata management.
Drastic reduction in processing times, enabling more efficient data handling and faster insights.

Conclusion

Implementing Apache HUDI within AWS Glue presented a learning curve and several technical challenges. Through these experiences, we have enhanced our data handling capabilities and established a more reliable and efficient data pipeline within our AWS environment. Please comment below or reach out to us via LinkedIn to share your thoughts!