A futuristic digital illustration depicting cloud storage cost optimization. The image features an Azure cloud with data storage icons, including Delta Lake and Databricks, surrounded by digital storage elements. A visual representation of storage reduction from 40TB to 500GB highlights cloud efficiency and cost savings. The blue and white color scheme symbolizes modern data engineering and cloud computing.

How ProCogia Reduced Azure Storage Costs by 95% for a Marine & Logistics Client

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction

Storage costs in cloud environments can quickly become unmanageable. For a Marine and Logistics client, ProCogia implemented a Delta Lake vacuuming strategy on Azure, reducing storage from 40TB to 500GB—a 95% cost reduction. Here’s how we did it.

 

Challenge  

ProCogia’s client faced significant expenses related to their Azure resource utilization, with storage costs being a primary concern. While they were aware of the substantial amount of data stored in their Azure Blob Storage, they were keen to explore strategies to reduce these costs. Our task was to identify and implement effective solutions to optimize their storage and lower expenses.

 

Approach  

To address the client’s storage cost concerns, we leveraged Delta Lake’s vacuum operation on their Blob Storage. This process systematically removed obsolete and unreferenced data files from Delta tables, which had accumulated due to versioning and transactional updates. The following steps need to be followed to perform vacuum operation  

  • Identify the Azure Blob Storage container containing the Delta tables and configure a retention period of 30 days. This retention period was specified by the client to ensure that only files older than 30 days are removed during the vacuum operation. The retention duration can be customized based on specific requirements to balance data availability and cost optimization.  
  • Develop a pyspark script in Databricks to perform the vacuum operation.  
  • Create a mount point on the Blob Storage account by specifying the container name and securely accessing the storage account key through Azure Key Vault. Mount points in Databricks enable seamless access to files stored in the Blob Storage account. The following script can be used as a reference for setting up the mount point.  
				
					dbutils.fs.mount( 
source = 'wasbs://<container-name>/<storage_account_name>.blob.core.windows.net/', 
mount_point = '/mnt/<mount_name>', 
extra_configs = {'fs.azure.account.key.<storage_account_name>.blob.core.windows.net': 'dbutils.secrets.get(scope="storage-account-key", key="account-key")'})
				
			
  • The next step is to use the mount point created above to locate the path of the Delta tables. Once identified, set a retention period of 30 days to perform the vacuum operation. This ensures that obsolete files older than 30 days are removed while maintaining data integrity. You can utilize the script below to accomplish this task. 
				
					# Define the path to delta tables
delta_file_path = "/mnt/<mount_point_Name>/<delta_file_path>"

def calculate_folder_sizes(delta_file_path, vacuum="no"):

# This function would calculate size of delta folders present in storage container of delta tables and perform vaccum operation if required

    # List all files in the folder
    ls = dbutils.fs.ls(delta_file_path)
    file_info = [file_info.name for file_info in ls]

    # Initialize an empty list to store folder information
    folder_info_list = []

    # Iterate through the file list and accumulate sizes
    for folder_name in file_info:
        delta_folder_path = f"{delta_file_path}/{folder_name}/"
        if vacuum.lower() == "yes":
            # perform vacuum
            spark.sql(f"VACUUM '{delta_folder_path}' RETAIN 720 HOURS")

        ls = dbutils.fs.ls(delta_folder_path)
        folder_size_bytes = 0
        for file_info in ls:
            folder_size_bytes += file_info.size
        # Convert folder size to gigabytes (GB)
        folder_size_gb = folder_size_bytes / (1024 ** 3)
        # Append folder name and size to the list
        folder_info_list.append({"Folder Name": folder_name, "Size (GB)": folder_size_gb})

    # Create a dataframe from the list
    folder_df = spark.createDataFrame(folder_info_list)

    return folder_df

folder_df_before_vacuum = calculate_folder_sizes(delta_file_path)
# Calculate and print the total size of all folders before vacuum
total_size_gb = folder_df_before_vacuum.selectExpr("sum(`Size (GB)`)").collect()[0][0]
print(f"\nTotal size of all folders before vacuum: {total_size_gb:.4f} GB")

folder_df_after_vacuum = calculate_folder_sizes(delta_file_path, vacuum="yes")
# Calculate and print the total size of all folders after vacuum
total_size_gb = folder_df_after_vacuum.selectExpr("sum(`Size (GB)`)").collect()[0][0]
print(f"\nTotal size of all folders after vacuum: {total_size_gb:.4f} GB")

#Join dataframes
df = folder_df_before_vacuum.join(folder_df_after_vacuum, on="Folder Name", how="left")
df.show(50, truncate=False)
				
			
  • The script above calculates the folder size of the Delta tables both before and after the vacuum operation. With the retention period set to 30 days, a Databricks job has been scheduled to run the above script automatically on the first day of each month.

 

Result 

  • Before Vacuum: In the given scenario, the Delta folders consumed a total of 40 TB of storage across multiple folders, resulting in a monthly storage cost of $2,800.
  • After Vacuum: Following the vacuum steps outlined in the approach, storage usage dropped to just 500 GB across the same folders, saving 39.5 TB and reducing the monthly cost to $35.

 

By implementing the vacuum operation, storage size was reduced from 40 TB to 500 GB—a remarkable reduction of nearly 95%. This optimization not only slashed storage costs but also enhanced the query performance of Delta tables. The client was delighted with the transformative results and the overall efficiency achieved.

 

Conclusion 

To conclude, optimizing storage costs using vacuum operations on Delta tables demonstrates the power of strategic data management. By removing obsolete files while retaining essential data for time travel and query performance, organizations can achieve significant cost savings and operational efficiency. The approach outlined not only reduces storage overhead but also ensures the Delta Lake ecosystem remains performant and manageable. ProCogia’s implementation for the client showcased a dramatic reduction in storage size, highlighting the value of leveraging advanced data governance techniques. With a clear strategy, automation, and adherence to best practices, businesses can unlock the full potential of their data systems while maintaining cost-effectiveness.

Looking to optimize cloud storage costs? Contact ProCogia’s Data Engineering experts today to maximize efficiency and savings.

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.