Introduction
As a Data Engineer with a long history in software development and engineering, my tool of choice for building Python ETL components is VS Code. That’s right, I don’t use notebooks. My opinion is that notebooks are built for data exploration. They offer an environment which is interactive, supports on-demand graphing and great for tapping out quick code snippets. This is the world of the Data Scientist.
As Data Engineers, our requirements differ somewhat. Our pipelines have to be robust, error free and above all, not stumble when faced with ‘bad’ data. Building or engineering such pipelines is an entirely different discipline. Though not that different from what would be expected of a Software Engineer. Code should be well written, follow standard best practices and include unit tests. Disciplines for which notebooks are not the most conducive I feel.
So when writing bespoke code in Python, say Spark applications to be deployed into Databricks, I build and unit test them in VS Code. But this blog isn’t about how to build PySpark applications. Watch out for that in a future blog! This blog is about automating their deployment using Azure DevOps.
Deployment Scenario
It’s common practice to version control software assets in a repository and in Azure, this is done using the Azure DevOps Repo feature. As an Azure Data Engineer, I use this to manage all my Data Factory pipelines, Synapse SQL scripts, Python functions and Python Spark applications.
Let’s say I have a project in git for a PySpark application that processes JSON formatted data. Here it is. generic-curatejson-sparkapp. Nothing too special there. I have a tests folder that contains unit tests. Then a storage folder with some sample data for use by the unit tests. (In SW test parlance, these would be referred to as test fixtures). I even have an azure-pipelines.yml file. Which I use to trigger automatic execution of the unit tests, each time changes are committed to the main branch.
The file that I want to focus on here is curate_json.py. This is a Python file, written using Spark, which is to be deployed and executed in a Databricks cluster. In case you’re interested, it’s triggered from an Azure Data Factory pipeline. It’s part of an ingestion pipeline that lands JSON formatted data into a bronze landing folder in an Azure ADLS storage account. Then the PySpark script running in Databricks will flatten the JSON. The processed data is curated as Delta in a silverfolder. All standard data engineering stuff really.
But again, that’s not what we’re interested in today. The question is, how do I get curate_json.py deployed into Databricks? You can see right there, where I want to put the file. It lives in the DBFS of Databricks. The full path to it would be dbfs://FileStore/pyspark-script/curate_json.py
Yours would look something like this too.
Without an automated deployment mechanism, it’s easy enough to upload this into the DBFS manually. Click the upload button and navigate to the file sitting in your laptop. However, any self-respecting data or software engineer will tell you, that this is not the way to do it!
Let me show you the proper way.
Azure DevOps Release Pipelines
We need a mechanism that ensures that we only ever deploy the latest version committed to the main branch of our repo. This is standard CI/CD practice. That is, we want to automate the deployment of curate_json.py from the main branch of the repo to its target DBFS location in a Databricks workspace using a release pipeline.
Here is my release pipeline. I called it SMG Databricks. Remember, in Azure DevOps, there are a couple of different kinds of pipeline. There are release pipelines and build pipelines. (Aside. Just to add to the confusion, in Azure there are also data factory pipelines!) The release pipeline is the third option under the blue rocket icon. (Build pipelines are used to build your assets. More on those in, yet another future blog!).
The million-dollar question of course, is how does the release pipeline work? Well before we get into the details, here is a brief summary:
- We’ll need some credentials to access our Databricks workspace. These should be stored in an Azure Key Vault as a secret.
- We’ll need to use the databricks CLI.
- Authentication will use a .databrickscfg config file, populated with details of the target Databricks workspace and associated credentials (from the key vault).
- Finally a PowerShell script needs to be written to issue CLI commands to copy and manage files in Databricks.
Let’s get into some of the details.
Download AKV Secret
Ok. Here is the first thing you need to do. When you create your Databricks workspace, you need to also create an account that DevOps can use to authenticate with. This must be a Service Principal account. You can do this in the Databricks UI as shown here.
Along with the service principal you will also get a client secret. You’ll need these for the next step.
But first, secure the credentials. Taking at least the client secret and saving it under a secret name in an Azure Key Vault. You’ll now be able to refer to it using the secret name. Which you will use when creating the first deploy task in your release pipeline.
This first task should look like this. Simply enter the name of your subscription, key vault and secret.
Ok, onto the next deploy task.
Install Databricks CLI
If you haven’t used the Databricks CLI yet, now is a good time to learn! It enables the ability to configure and administer a Databricks workspace from a command line. Including remotely from your laptop. Or as in this case, from an Azure DevOps VM running in the cloud. This next deploy task is then to install the Databricks command line tool onto that DevOps machine. We do this using a Bash deploy task.
Go ahead and create and populate it as shown.
You can see that we are actually using a bash script, from a publicly available github repo to install the CLI. That’s it. On to the next step.
Create .databrickscfg
When you use the Databricks CLI on your local development laptop, you first have to authenticate with the target workspace. You do that using the command databricks auth login. Ordinarily, this will prompt you to manually authenticate via a browser. We can’t do that here. Instead, we take advantage of the fact that the Databricks CLI, once authenticated, persists the login token to a config file. For automated login, we create and populate this config file as part of the release pipeline itself. It’s simple. Though be careful to follow the same format you see here.
Now onto the last step. The most fun part!
Deploy Scripts if Different
This is the step that does all the real work. The basic idea is that this step would fetch the curate_json.py script from the main branch of the repo and then copy it the target DBFS path.
I have included a few nice-to-have features:
- Only upload if different
- The script should only be uploaded if it differs from that already deployed.
- A trial-only mode.
- Performs a full set of checks, but does not actually deploy the script. Always good to perform a pre-flight check.
- Show differences
- Serves as a last-minute visual to confirm all is as expected.
- Back-up existing script
- As engineers we are always thinking about the plan B. When it comes to deployments, this means rollback. If the wheels fall off because of an unexpected error, we want an easy way to rollback. For this, we backup the existing script on DBFS before overwriting it with a new one. Making it easy to rollback to the previous one.
Here is the deployment task.
Well you can see where the script goes, but not all of it. By the way, the deploy task is a PowerShell. Lets take a closer look.
- Area 1
- A simple global variable. I leave it set to $true except when I’m sure I want to deploy a change.
- Area 2 & 3
- Read the existing contents of the script from DBFS and then the one in our repo. Notice by the way that we apply a trim() to the content. We noticed that on occasion, an empty blank line appeared at the end of the content. Causing difficulties for the equality check. So we trim it away.
- Area 4
- If the content does not match, output the differences.
- Area 5
- These are the commands that backup the existing script in DBFS as well as deploy the latest one.
The very last line calls the DeployIfDifferent() method. Here you just see a single script being deployed. In reality, for any data engineering project, you may have many Python Spark applications. Simply add them to the bottom, using the one shown as an example!
We hope that you’ve enjoyed this blog and that it will help you build automated CI/CD pipelines as part of your Azure Databricks data engineering project! To learn more about how ProCogia can help your company evolve its data pipelines check out our Data Engineering solutions.