Cut Cloud Storage Costs with Delta Lake Vacuum Operations

Author

Data Engineering

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction

In the era of data-driven decisions, managing storage costs efficiently is crucial for sustainable growth. Delta Lake, an open-source storage layer built on top of Apache Spark, provides a powerful foundation for building reliable and scalable data pipelines. It brings ACID transaction capabilities to data lakes, ensuring data consistency and reliability. One key feature of Delta Lake is its vacuum operation, which helps optimize storage by removing obsolete data files that are no longer needed, reducing storage costs and improving query performance.

This blog explores the purpose and benefits of the vacuum operation in Delta Lake, detailing how it works, when to use it, and the best practices to implement it effectively. Real-world examples will help illustrate its practical value in optimizing storage and maintaining a cost-effective data infrastructure.

Purpose of the Vacuum Operation

Delta tables, a key feature in Delta Lake, ensure data reliability and scalability in a data lake architecture. Over time, as data is modified or deleted, old versions of files accumulate due to Delta’s versioning and transaction log mechanism. These obsolete files increase storage costs unnecessarily. The vacuum operation removes such outdated files, reclaiming storage while maintaining the integrity of the table for query performance and history access. Here’s why these files accumulate:

1. Overwriting Data

When new data is written to a Delta table using operations like INSERT OVERWRITE or similar, the old data files associated with the table are no longer needed in the active dataset. However, Delta Lake retains them for a certain period (default 7 days) to support time travel and version control.

Example:

Scenario: You overwrite a dataset containing sales data for a specific date.

Result: The previous files with outdated sales data are marked as inactive but remain in storage until vacuum removes them.

2. Performing Updates or Deletes

Delta Lake uses a copy-on-write mechanism for updates and deletes. Instead of modifying data in-place, it writes new data files reflecting the changes and marks the old files as stale.

Example:

Scenario: You delete all rows where region = ‘North’ in your table.

Result: Delta Lake writes new files excluding the North region rows while retaining the original files for time travel.

3. Optimizing Operations Like OPTIMIZE

The OPTIMIZE operation in Delta Lake consolidates smaller files into larger ones to improve query performance. During this process, old, fragmented files are replaced by optimized files, and the fragmented ones are marked as unnecessary.

Example:

Scenario: After a series of incremental data loads, you optimize the table to reduce file fragmentation.

Result: Optimized files replace the smaller files, and the original fragmented files are retained for time travel until vacuum removes them.

By regularly performing the vacuum operation, you clean up these stale files, reducing storage costs and maintaining efficient table performance. This ensures that only the necessary files for the active table version are retained, aligning with your data retention policy.

Benefits of Vacuum Operation

1. Reduced Storage Costs: Removes unnecessary data, reducing overall storage footprint.

2. Improved Query Performance: Eliminates redundant files, optimizing scan times for queries.

3. Enhanced Manageability: Helps maintain a clean and efficient storage environment.

4. Data Governance Compliance: Ensures only relevant data files are stored, aiding in audits and regulatory compliance.

Best Practices to perform this

1. Test in a Non-Production Environment: Always test the vacuum operation before deploying it in production to avoid accidental data loss.

2. Balance Retention Needs and Costs: Determine an optimal retention period that aligns with your data lifecycle policies.

3. Monitor Regularly: Use monitoring tools to track storage costs and ensure the vacuum operation is effective.

Steps to Perform Vacuum Operation

1. Understand Retention Periods

Delta Lake maintains a default retention threshold of 7 days to protect against accidental data deletion. Before running a vacuum, analyse the data lifecycle to determine the optimal retention period.

Example Scenario: A sales transaction Delta table present in Azure blob storage is updated daily. You determine that retaining only 3 days of deleted data is sufficient.

Command:

— Set the retention period to 3 days

SET spark.databricks.delta.retentionDurationCheck.enabled = false;

— Vacuum the Delta table

VACUUM sales_transactions RETAIN 72 HOURS;

Result: The operation removes all unreferenced files older than 72 hours.

2. Analyse Storage Utilization Before Vacuum

Use commands to check the current storage footprint of your Delta table. This provides a baseline to quantify the impact of the vacuum operation.

Example Scenario: You run the following in databricks to check storage usage:

— Inspect storage metrics

DESCRIBE DETAIL ‘/mnt/delta/sales_transactions’;

Result: Displays the total number of files and size before vacuum.

3. Perform Vacuum Operation

Execute the vacuum command on the Delta table to remove stale data files. Be cautious about the retention period to avoid accidental data loss.

Example Scenario: Vacuum the table while ensuring no data under active analysis is removed.

VACUUM sales_transactions RETAIN 168 HOURS; — Retain one week’s files

Result: Unreferenced files are deleted, reducing storage footprint.

4. Validate Post-Vacuum State

After vacuuming, verify that the table is consistent and the performance has improved.

Example Scenario: Query the table and check the execution plan to ensure reduced file scanning:

— Query to check data availability

SELECT COUNT(*) FROM sales_transactions;

— Check file scanning efficiency

EXPLAIN SELECT COUNT(*) FROM sales_transactions;

Result: Reduced file count in execution plan and unchanged row count confirm successful vacuum.

5. Automate and Monitor the Process

Schedule vacuum operations and monitor storage trends to ensure sustained optimization.

Example Scenario: Set up a weekly job in Databricks: By scheduling a job to run every week and also attach a notebook that contains the script to perform the vacuum operation

Result: Automatic cleanup ensures consistent storage efficiency without manual intervention.

6. Final Results

Before Vacuum: In the same example, let’s say the sales_transaction table consumes 2 TB of storage spread across 20,000 files, incurring a monthly storage cost of $140.

After Vacuum: After performing the vacuum steps (1 to 5), storage utilization drops to 500 GB across 8,000 files, resulting in a savings of 1.5 TB. This not only reduces the storage cost to $35/month (a 75% reduction) but also improves query runtime by 15%, optimizing both cost and performance.

Conclusion

Vacuum operations in Delta Lake offer a simple yet powerful way to manage storage costs while maintaining performance. By understanding the retention policy and leveraging automated workflows, organizations can optimize storage and focus resources on actionable data insights.

👉 Learn how ProCogia’s data engineering team helped a client reduce storage costs by 90%. [Read the full story]

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

A wide landscape digital illustration for a blog titled "Turning AI Potential into Impactful Business Use Cases". The image features a futuristic, glowing blue cityscape representing a data-driven "frontier firm". In the foreground, a translucent human hand interacts with a holographic interface displaying data charts and AI icons, symbolizing the transition from human assistants to autonomous, agent-led operations.

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us

Take a deeper dive

Locate Us

Follow Us

Contact Us

Cut Cloud Storage Costs with Delta Lake Vacuum Operations

Author

Sai Gaddam

Table of Contents

Categories

Sign up for our newsletter

Introduction

Purpose of the Vacuum Operation

1. Overwriting Data

2. Performing Updates or Deletes

3. Optimizing Operations Like OPTIMIZE

Benefits of Vacuum Operation

Steps to Perform Vacuum Operation

1. Understand Retention Periods

2. Analyse Storage Utilization Before Vacuum

3. Perform Vacuum Operation

4. Validate Post-Vacuum State

5. Automate and Monitor the Process

6. Final Results

Conclusion

Subscribe to our newsletter

Keep reading

Get in Touch