ABC of Data Engineering
AI, Big Data, and Cloud for Data Engineering
Data Engineering is a concept whereby data is taken from various sources and made available for BI & Analytics, Data Science, and Machine Learning (ML). Simple enough, right? Blog complete. Thanks for reading my blog!
But wait a second. Is it really that simple? Is Data Engineering only concerned with making data available for further insights? Well, to answer this, let’s do a deep dive into what Data Engineering is all about.
In recent times, we’ve been hearing a lot about the Data Engineering evolution and how Data Engineering solutions are re-shaping the data landscape. With more projects and more jobs, there’s an increasing need to understand exactly what Data Engineering is. To gain a better understanding, let’s use an example of how a start-up company’s data system evolves into a full-fledged, well-oiled data machine.
Imagine a start-up company with one simple web application. They have two sources of data: a CRM and a transactional database through which data is sourced and stored. At the end of a defined period (let’s say a week) a data analyst can extract data from the database, make a report in Excel, and present it to the Analytics team. The team can then carry out further analysis on the Excel file and present their findings to the business to report the overall health of the organization. In this example, everything works fine, the Analytics team is happy, as are the business users.
As the start-up expands its operations, data accumulates month on month, increasing both the volume and number of sources in the organization. The Analytics team are asked to answer KPI questions outlined by the various business units and analyse how the metrics are changing over the defined period. But the current architecture cannot fully support these requests, and it would be very laborious for the data analyst to derive all this information manually. So, things need to be automated. And this is where Data Engineering comes into play.
With automation, ETL data pipelines can be introduced. This will unburden the data analyst and allow an ETL Data Engineer to automate the flow of the data and produce reports. ETL stands for:
- Extract – pull the data from different heterogeneous sources
- Transform – clean, profile, map, and standardize the data
- Load – pushing data into databases such as MySQL
To perform ETL Data Engineering, a Data Engineer can write some code in Python, Spark or SQL, and schedule it to perform in batches as per the required frequency. Then the analytics user can connect to this target database, in this case MySQL, to a BI tool such as Power BI to create compelling data visualizations such as graphs, charts, maps, etc.
Thanks to ETL, there is now a much improved and fully automated version of the data processing and consumption. And this isn’t just about automation. The company is also becoming more data driven, and different teams have started consuming the data to make better decisions, which is a result of the data being centralized.
As volumes of data further increase, retrieval time for reports also increases. Queries and codes start taking hours to execute and the ETL pipelines can become slow-moving. The main issue here is the company is using a transactional database for analytics that simply isn’t optimized for the job. At this point, Data Warehousing is required.
Data Warehouse solutions curate the data and act as a central repository for all the data sources. At the same time, the Data Warehouse segregates the data into different subject areas to support all the company’s analytical needs.
Consider a Walmart Data Warehouse: shoes are kept in one section, clothing in another, furniture in another, and so forth. With Data Warehouse solutions, first you centralize the data, then you segregate it. This Data Warehousing will be architected in such a way that it can handle complex analytical queries more efficiently, which means any slow-moving pipelines will run faster. Common Data Warehouse technologies include Teradata, Snowflake, Synapse, and there are many more.
As the business matures, the demand for statistical models, Data Science, ML, and predictive analytics increases. For a Data Scientist to effectively carry out such work, the data must be curated in a Data Warehouse.
At this point, the Data Engineer needs to work closely with the Data Scientist and make new data elements available for them to carry out their tasks. To achieve this, the Data Engineer must create a Data Lake.
One of the main ways that a Data Lake differs from a Data Warehouse is that the ETL (extract, transform, load) process becomes an ELT (extract, load, transform) process. In a Data Lake, you don’t just store structured data. You can dump all kinds of raw data including semi-structured and unstructured data without pre-processing it and enforcing any schema. Within the Data Lake, you extract the data and load the data. However, the transformation takes place later and is carried out by the Data Scientist.
So, now the company has a Data Warehouse and a Data Lake in place, what is all the fuss about Big Data? Sometimes this term is used unnecessarily. To label any data as Big Data, it must check the 5 V’s, which are:
- Volume – This might seem obvious, and the term Big Data is relative, but it is generally any data with millions of transactions every second/minute
- Value – The value and insights Big Data can provide
- Variety – It can vary in nature such as structured, semi-structured and unstructured
- Veracity – The truth and accuracy of the data
- Velocity – Big Data generates constantly in real time.
Companies dealing with Big Data need Big Data Engineering. Consider banks with credit transactions which are in the millions, or Netflix which collects millions of transactions every second.
These Big Data businesses need Data Streaming. Up to this point, what was essentially happening is called batch processing – where data is taken from the source, processed, stored in a Data Warehouse or Data Lake, and data is generating once an hour, maybe once a day.
With Data streaming, data is generating every second. The Big Data Engineer ensures the data is processed and then made available for the team to analyse. This happens through the pub/sub mechanism. A publisher continuously publishes the data, and a subscriber reads the data and pushes it further into the pipeline. The main take away from this pub/sub mechanism is that the framework decouples the data sources from the consumers.
So instead of pushing the data directly from source to target systems, data is divided into separate topics, and each topic is accessed by the consumer according to their specific data needs. This way, whenever data comes through the source it gets queued in the topic and consumers can consume the data from these topics at their own pace. This type of communication between systems is commonly known as asynchronous communication. And the technology that makes it happen is called Kafka.
Now that the company has a real time system in place with minimized latency. Do you think the RDBMS/Data Warehouses can cater for Big Data? Or the cute little ETL pipelines will be able to process this data? The answer is a resounding “No!”
Such high volume and high velocity data demands systems that can perform distributive computing. A distributed computing environment consists of multiple software components on multiple computers, which run as a single system. Imagine a company with one employee who was taking care of your HR, Finances, Sales, Marketing, and all other departments. (That’s a lot of work, right?) Now imagine you hire 10 more people and distribute the workload amongst them. This is what happens when you have a distributive computing environment such as Apache Spark and Hadoop. You’re basically asking your data processing pipelines to run in parallel and carry out computation in many concurrent servers.
With this last evolution, businesses can now build any kind of system on their own. At this point it’s worth noting that we’ve provided very high-level generic information here. In addition, there are lots of other pieces glued together in this architecture type. For instance, we haven’t mentioned Delta Lakes, Lake Houses, Cloud Computing and so many more elements. But we’ll be covering those in future blogs so please watch this space.