Transforming Data in Azure Data Factory Pipelines

Author

Data Engineering

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction

As Data Engineers it seems that one of our main responsibilities is that of moving data, incrementally, or in bulk. With the advent of data lakes, or more recently data lakehouses, the target for such moves is typically a cloud-hosted storage location such as AWS S3 or Azure Blobs. The data sources, of course can be many and varied.

The curation process may follow the well-known medallion architecture, with engineering pipelines tasked with delivering data in both raw and processed forms. Curation of raw or bronze data is straightforward, and data is saved in its source form. This could be CSV, XML, JSON, or even something like PDF. Data in the silver and gold levels is however, typically stored in a tabular form, with the Delta format from Databricks being a popular choice.

For bronze-to-silver workloads, data isn’t just transferred; it often needs to be transformed from the source format to the format used in your lakehouse. This is even more important for silver-to-gold workloads, which involve specific business aggregations. In both cases, your pipeline should not only move the data but also manipulate or modify it in the process.

This blog addresses that transformation process and discusses options for doing so in the context of silver-to-goldworkloads. Here we adopt Azure Data Factory (ADF) as our technology of choice and propose a generic solution that supports the definition of transformations solely using SQL.

Moving Data: ADF’s Raison d’être

As any Data Engineer working with Azure will know, Data Factory is a key technology. Its core function and reason to exist is to move data from one place to another. Which it does very well, even at scale. Supporting many data types and connectors as sources or sinks.

An Azure Data Factory pipeline is built graphically, in a UI, by sequencing together activities. We’re not going to go into all the different activity types, or how they work. Suffice to say that when it comes to moving data, the Copy activity, does all the heavy lifting for you! Actually, it also supports some limited ability to transform data. For example, it can take a JSON source, flatten it, and convert it to a tabular form, such as parquet. That can be pretty handy!

After landing bronze data into a storage account, you’ll need a silver-to-gold pipeline to convert the data into a standard form. Also typical at the silver layer, is the curation of each logical data set in a tabular form. On top of that, as well as flattening the data, there may be a little cleaning that is needed. For example, flagging or removing badly formed email addresses.

But what if you need more done? Often you’ll be asked to aggregate, join, filter, or otherwise combine multiple silvertables into a single gold table. For a sales report for example. How can we do that within ADF?

Options for Data Transforming Data

There are actually a myriad of options! Let me list a few for you.

Build an Azure Function App and call it from your pipeline
Use Azure Databricks
- Create a notebook and call it
- Create a Python/Spark module and call that
Build a Data Flow inside ADF and embed it into your pipeline
Use Azure Synapse
- Create a Synapse notebook and call that

The list goes on! Anyway, you can appreciate that there are many approaches to choose from. Many of which might be influenced by your choice of target curation format. If it is parquet, you could actually build function apps in Python and process source data using Pandas dataframes. Pandas is a great choice and something all Data Engineers should be familiar with. Furthermore, I really like Azure Function Apps. They are lightweight and easy to build. Microsoft has also done a great job in providing support directly within VS Code, making it easy to develop and deploy a function app, right from your IDE. Plus there are now Python libraries available that support the Delta format (e.g. pip install deltalake). These are definitely worth a try.

However, when it comes to working with Delta files, my experience is to go with Spark. In fact, all but the first option in the list above uses Spark! Even Data Flow in ADF will spin up a Spark cluster behind the scenes. It’s just a little hidden from you, that’s all. Here’s what a Data Flow looks like by the way. If you want to steer clear of coding, this is a good option.

While UI-based tools such as the aforementioned Data Flow offer an easy and friendly way to manipulate data, they can sometimes exhibit frustrating quirks or limitations. Especially if you’re trying to do something a little more complex. Anyway, from an earlier blog post [ref: 1], you can see that I like to build Python-based spark apps. Coding is the most comprehensive tool, offering the greatest flexibility when developing a solution. This is option 2(b) above. The Python/Spark module can be built and tested in a local development environment, such as VS Code. Version controlled in its’ own git repos and when ready, easily deployed into an Azure Databricks cluster. Then all you need to do is, call it from your pipeline as illustrated.

Azure DevOps provides a rich set of tools to automate many engineering tasks. If you’re interested, see [ref: 2] for details on how to automate deployment of Python/Spark modules into a Databricks cluster.

Databricks SQL Serverless

At ProCogia, we have developed a number of Python/Spark modules that run in an Azure Databricks environment. The approach is solid. For each new report, a gold table would be developed in Python/Spark. Unit tests would be identified, implemented, and added to the project to help ensure a degree of quality and support regression testing. Each would be version-controlled as an Azure DevOps git project. I find that these software engineering disciplines, prove very conducive to building high-quality, robust data pipelines.

However, with some of the business reports, the gold tables, while not trivial, were simple to pull together from their dependent silver sources. In those cases, the software engineering effort in building a new project seemed a little onerous. Perhaps overkill you may say. We looked into another approach. Something simpler and more lightweight.

Upon receipt of a request for a new gold table, we noticed that we found ourselves initially developing the aggregation in SQL. In this case using Databricks SQL Serverless. For this, we had already curated a version of the necessary data in a SQL Warehouse.

So it seemed natural to leverage this effort. The question was, how to enable the SQL, any SQL, to be executed from a Data Factory pipeline?

A Generic Spark SQL Approach

Our solution was to one-time develop a generic parameterized Python/Spark module that could run any SQL for us. Which in turn would be deployed to Databricks. The requirements for this script would be as follows:

Accept a string containing a SQL as a parameter
Accept the name of the table to be given to the results of the aggregation, as a parameter
Execute the SQL using the silver tables in the lakehouse as source
Write the result out to a gold table in the lakehouse, named according to (2)
- Support both overwriting the existing contents of the target gold table or
- incrementally adding to it

Details

Let’s get into the details a little.

So the general idea is for a single generic Python/Spark script that can take as parameters a SQL string and execute in a Spark context. Seems easy enough. Something like:

However, what about the references to the silver tables? Those have to be made visible to Spark. One way to do that is to use the statement:

This would have to be done for each source table. No problem, we can pass the names of those source tables in as a parameter to the Python module, just like the SQL statement. Then simply call createOrReplaceTempView() in a loop.

A quick aside:

One disadvantage with this approach is that it does not easily support unit testing. This seems to be a general statement when it comes to testing SQL. While there are many tools such as SQL Server Management Studio (SSMS), DBeaver, or Azure Data Studio, I find it odd that no one has yet developed an IDE for SQL. One that supports SQL coding, unit test development, and git integration in the same way as VS Code or IntelliJ.

While this worked fine, we found that it was far too easy for errors to creep into our parameters. Particularly naming of the source tables. By the way, our table names weren’t just table names. They were slash-separated file paths to the folder locations in the Azure blob store. As such, they were easily mistyped. To help with this we developed regular expressions that would cross-check the provided SQL with the list of table names. We used two separate regular expressions. One to extract the table name that follows a SQL FROM or JOIN statement and another that extracted the table names inside a WITH clause. Here are the expressions we used:

To help you see how this works (as it does look pretty complicated doesn’t it 😊), here is the surrounding code. It’s composed of two methods as follows:

Example Usage

Here finally we can see an example of the generic Python/Spark module as part of an ADF pipeline.

It’s simply a matter of adding a Python activity to your pipeline and providing the necessary parameters. Of course, you also need to have an Azure Databricks cluster set up and the script deployed to it. We have found this approach to be very effective and low effort. The only downside I would say is that there is no support for the unit test of the SQL itself. Instead, we rely on testing and running the SQL manually beforehand in the Databricks Warehouse. However, the ability to quickly build and deploy gold tables as part of our data pipelines greatly outweighs this concern, and this is now our preferred approach to building gold tables.

Conclusion

If you are interested in more details of this approach, feel free to reach out to me at sarbjit@procogia.com. And if you’re looking to unlock the full potential of your data pipelines and streamline your data engineering processes, visit our Data Engineering page to learn more about how ProCogia can help you transform data effectively with tools like Azure Data Factory, Databricks, and more.

References:

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

A wide landscape digital illustration for a blog titled "Turning AI Potential into Impactful Business Use Cases". The image features a futuristic, glowing blue cityscape representing a data-driven "frontier firm". In the foreground, a translucent human hand interacts with a holographic interface displaying data charts and AI icons, symbolizing the transition from human assistants to autonomous, agent-led operations.

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us