Reference Architecture for Data Vault on Snowflake with Azure

Andrew Griffin
May 27, 2021
3 min read

https://www.youtube.com/watch?v=qD7s4q3QVFw

Learn from someone else’s experiences implementing Data Vault with Snowflake and Microsoft Azure

If you asked a room of data engineers about building a Data Vault for the Cloud using Snowflake on Microsoft Azure, you could probably spend the next business quarter deciding which was the best suggestion. A slight exaggeration? Maybe, but with a plethora of technology options available agreeing on the best approaches to building solutions is certainly exercising the minds of businesses looking to make the most of their business intelligence and data analytics. So, what would a reference architecture to ensure your Data Vault operates efficiently and securely look like? Datavault CEO Neil Strange addressed the “hot off the press” subject at the 19th meeting of the UK Data Vault User Group, which can be viewed here. A number of his company’s clients are in the process of building Data Vault solutions based on Snowflake, using Microsoft’s Azure Cloud platform, in conjunction with dbtvault the open-source tool developed by Datavault’s own engineers. Neil admitted that covering the concepts he currently favours comprehensively could easily take up half a day, rather than an hour-long presentation. But he was keen to explain his organisation’s latest ideas and thinking, while accepting other data engineers might come up with different ways of achieving the same goals. Neil stressed the reference architecture for such a configuration must support the whole lifecycle of Business Intelligence work – from development to deployment and production. He uses DevOps and automation principles across three planes the project: the platform, the application and the data. As new features are evolving all the time, he recommends using a modular method with clean interfaces, allowing for any one element to be removed and updated going forwards, means you won’t have to keep starting from scratch. Datavault uses a desktop VM (virtual machine) configured on Linux for ease (although clients favouring Windows can be catered for) facilitating the use of Python and dbt, alongside dbtvault. Airflow is used for orchestration, while Neil is a big fan of Terraform, which allows for the easy deployment and testing of infrastructure. Testing that allows for a comparison of the data at the start of the process – and at the end – is also highly recommended, allowing for problem-shooting and ensuring consistency in results. Furthermore, he recommends that you only use test data – not real data – for the testing Neil then talked through the requirements of the production environment and the compute options. He prefers to run several applications inside the V-NET (virtual network): Airflow for the orchestration, dbt/dbtvault to carry out the ETL (extract, transfor, load) and Python to manage the PSA (persistent staging area) and drop zone. For streaming, data is staged in Azure Blob Store, using a suitable ingestion streaming solution such as Event Hubs. Datavault recommends using a PSA (persistent staging area) inside Snowflake to create a Data Lake. A drop zone is used to feed into the Data Lake, which aids security and allows for working areas to process data required for storage within the PSA, enabling good data governance and lineage. Failed loads can then be quarantined. Using Apache Airflow manages the dbt runs with appropriate tags allowing dbt and Snowflake to process the data, and handling recovery of any load failures, re-runs and housekeeping routines. Airflow’s excellent control panel and statistics reporting is a major benefit, Neil believes. All told, creating a standardised approach to delivering Data Vault projects in the Cloud by using Snowflake, Azure and dbt makes life easier. Neil emphasised that he is very keen to receive feedback and opinions, plus any suggested improvements from practitioners running any variations.