Data Vault Performance & Constraints on Snowflake
- Hannah Dowse
- Nov 2, 2023
- 4 min read
How to make your Data Vault sing!
To be, or not to be, on Snowflake… Patrick Cuba explains how to maximise your Data Vault performance on Snowflake. His book “The Data Vault Guru”, marked Patrick Cuba down as one of the world’s leading thought-leaders on the pragmatic and practical considerations of building a Data Vault. That book came out three years ago but since then, Patrick joined Snowflake and specialised on Data Vault implementations on the platform, which has changed the world of data warehousing. Over the past couple of years, he has continued to write numerous blogs around the subject and agreed to present to this month’s Data Vault User Group, having been a keynote speaker at the first DVUG Conference held in the UK in September 2023, giving his thoughts on the current state and future of data modelling. Patrick showed ways in which to improve the performance of a Data Vault o on Snowflake, and as any user knows, you pay for what you use so there are potential savings to be made from understanding many of his pointers. He covered such features as immutable files and mutable tables, and whether to hash or not to hash the micro-partitions the database is based around, and what T-shirt size to choose when using Snowflake. And without indexes supported on the platform, Patrick explained how to efficiently query content, especially from large tables. Patrick explained that Snowflake’s compute services (which attract the charges through the credit purchase model the US computing giant employs) execute jobs and queries as virtual machines which can be vertically and horizontally scaled for concurrency and to reduce potentially expensive queuing issues. As he pointed out, Snowflake charges for the first minute regardless of how long the compute takes, and for every subsequent second while active. Patrick gave concrete examples of how a 61 million record table query runs in around 50 seconds. But by using Snowflake’s “Secret Sauce” and upsizing the virtual warehouse from a XSmall to an 4X-Large,it is possible to cut it to just 5.5 seconds thanks to dynamic pruning. But that would cost 128 credits compared to one and there is a better way to do it by moving away from static pruning and no clustering by using a Point-in-Time (PIT) table to dynamically prune the micro-partitions where the record is likely to live. Using the join filter with the XSmall virtual warehouse cuts the query time to 4.5 seconds and was faster than the 4X-Large using the join filter. The PIT table from the Data Vault will inform Snowflakes Zone Map where to find the micro partitions for a query. Loading a Data Vault satellite will compare the active record per parent key and produces the best performance comparison method thanks to the anti-semi-join, taking advantage of the dynamic pruning and a probabilistic algorithm. Another improvement can be achieved by clustering by load order that reduced the micro partition count as well as the depth and overlap. An experiment with a traditional PIT against a Sequence Number Only Pit (SNOPIT) showed that the former ran in 19.63 seconds against 15.84 seconds for a SNOPIT, thanks to the use of an algorithm against a run time of more than eight minutes without the use of a PIT table. He also explained that Snowflake favours a right-deep join tree in terms of dimensional modelling ,with satellites either around a hub or link leaving dimensions around a fact, meaning table statistics are always up-to-date. Patrick showed the results from his experiment with clustered and non-clustered data loadings, using binary hash keys, natural keys, and text hash keys for snapshot PIT tables and incremental PIT tables against a SNOPIT. He did warn that auto-clustering does not come for free with Snowflake and stressed that natural key column size value and its complexity should be considered. The incremental pit when keys are loaded as they are available, requires trimming and orchestration – and may need rebuilds, while a snapshot PIT will load the satellites, and configure the PIT window, which is simple to set up and deals with the problems of ghost skews. In conclusion, Patrick said for the best load and query performance, a natural key =Data Vault was head and shoulders above the rest and ensures the smallest storage footprint. But he stressed you should consider natural key compression – surrogate hash keys encapsulates BKC (business key collision code) and multi-tenant ID. He emphasised that to understand the shape of your data, using PIT tables will improve query performance with join filters optimising querying and better results will be achieved using dynamic tables. Finally, understanding join-key distribution is essential – binary hash keys enforce architectural behaviours. He also recommends you consider streams and tasks for new data. Using SNOPITs will ensure join columns are predictable and speed up network traffic between the remote storage and the applications using it. Patrick has also written a series of blogs on the above which can be found on