Organizations which produce large volumes of data are increasingly investing in exploring better ways to discover, analyze and extract key insights from the data. These organizations face regular challenges in the shape of data ponds and often struggle to make the right dataset available to their primary users i.e. Data Analysts/Data Scientists.
This project aims to provide a template for defining, ingesting, transforming, analyzing and showcasing data using Azure Data platform. We've leveraged Azure Cosmos DB SQL API as storage layer for data catalog. Azure Blob Storage serves as the defacto store for all semi-structured data (e.g. JSON, CSV, Parquet files). Azure Data Factory (ADF) v2 performs the orchestration duties with Azure Databricks providing the compute for all transformations.
The front-end interface is an ASP.NET Core web app which reads catalog definitions and creates ADF entities using the ADF .NET Core SDK. For visualizing data lineage, vis.js networks were used.
In this solution our catalog definition consists of 2 data sources;
a) Time Series JSON files from IoT sensors
b) SQL Database Table containing sensors metadata
Our data pipeline simply extracts the metadata from SQL Database into tabular form, joins it with time series data and finally publishes it to a REST endpoint.
The pipeline can either be triggered manually using the web app's REST API or in case of dynamic data sources i.e. Time Series, an event trigger is automatically created.
Event triggers unfortunately have performance limitations hence creating more than 100 dynamic data sources is currently not supported.
For configuring the ASP.NET Core web app please follow this document.
To authenticate the web app we create an Azure AD application. This application also has to be assigned contributor role on the resource group so that ADF and its entities can be provisioned.
In order to access credentials of our data sources, ADF relies on Azure Key Vault. When an ADF resource is provisioned in Azure, a Service Identity is automatically generated. That Service Identity has to be granted Get
permission in the Key Vault access policies.
Please follow this document for deployment to Azure.