How To Hear What The Data is Telling You

Rhys Hanscombe
Jun 5, 2023
7 min read

https://www.youtube.com/watch?v=MtMevkGwM2E

How many times have you heard senior corporate executives say your business does not have time or the resources to build a data warehouse? And even if your enterprise has one that functions, or maybe is in need of updating to reflect increased digitalisation or business increasingly transferring to the Cloud, does it incorporate a Data Vault to increase value through good data analytics? Too often the solutions are simply big, or too complex – yet the solutions are often found in the numbers, even more than those who may be familiar with data warehouses and the Data Vault method might think. Doug Needham is known as the Data Guy, currently working with DataOps.live, and he has worked with relational databases since 1990. His presentation to the Data Vault User Group focused on ‘Hearing What the Data Is Telling You’. He explained how a better understanding of “the Math” that lies behind relational databases can unlock the answers to the above problems for many corporations and organisations – by concentrating on a data enrichment platform. The results should be “greater than average,” according to Doug. He stressed how Data Vault founder Dan Linstedt had explained that much of the data architecture behind the process is based on mathematical concepts.

Find mathematical concepts ‘hidden in plain sight’

Some of them are hidden in plain sight and Doug went on to show how they can be applied. He started out by explaining the definition of a set and the common language of set theory – set builder notation – which he used throughout. In database terms, a “relation” is the structure of a collection of elements – an element is a name paired with a domain or data type. All elements have unique names. A “relvar” is a set of n-tuples, where each element is the implemented value, which confirms to the attribute specification – a tuple being a finite sequence of elements. He then explained the concept of a relational data model being the set of relations in each domain of relations. Within this set exists all relational algebra, relational calculus, SQL development, along with most relational data modelling. The set of foreign keys in a data model satisfy the property that a foreign key of table M is equal to the primary key in table N. Doug believes an entity relationship diagram (ERD) is the most important artefact created as part of any system using data in a database. He also explained how an ERD is also a data structure graph (DSG) and what the difference is between a topology and a graph. That is important because different kinds of math can be carried out on a graph compared to a topology. When data moves through an enterprise there are several steps it has to go through to get from one application to various layers in the data enrichment platform. That data flow diagram (DFD) is actually one edge in a graph – so when factoring in the concept of distance, the actual diagram of the platform’s ETL architecture diagram is, in fact a topology. Doug prefers the term data enrichment platform precisely because so many senior executives baulk at the very mention of a data warehouse.

Identifying important business keys is key

A data engineering or DataOps team can manage the creation and running of a data enrichment platform – capable of presenting the data through a business intelligence (BI) tool as part of the platform’s function. They can be based on a dimensional model that features two archetypal tables – featuring few distinct facts and dimensions – depending on the enterprise’s reporting and analytical requirements. To enhance the data enrichment platform, identifying the important business keys is fundamental if the solution is to incorporate a successful Data Vault 2.0 implementation. Identifying the business keys is critical for that to happen – they are the keys that the enterprise uses to identify its records in the sources systems, whether they are account numbers or customer IDs. The problem is those keys are not primary keys in the table, and most primary keys are meaningless away from the source systems they are kept in. Again, maths is at hand to help here, Doug says. You need to work out the relationships between those tables and infer the meaning of how they are doing things. By reverse engineering the data model it converts to a graph structure – and from that graph how can the business keys in your application data model be identified? Doug’s answer is that any table that is one sigma away from the mean degree distribution should have the business keys for this application. Data is transformed because the approaches used in dealing with data are different within each business.

Use information theory to identify the keys

Business users’ perspectives require optimising data to answer their questions based on analytics, while users’ data repositories are based on performance and availability. So in the search for the business keys, using a simple equation information theory offers what Doug calls the “entropy of the column”. When run against multiple groupings of sets of rows it will identify the likely business keys.

The formula ∆Ctn/∆Rt should churn out a few numbers for each column – any that are close to unity will be a candidate business key, says Doug. With a selection of 15 candidate keys, a question posed of the business on which are the most important to them. From there if you apply some ontology work – as advocated by the likes of Bruce McCartney – with those candidate keys, the data enrichment you get utilising some natural language enrichment should be significant.

You must ask for the business users’ views

All that progress is based on looking at the data model and speaking to a few application developers, then data profiling a few tables – reducing that target list from literally hundreds with some simple math – before asking the business user for their view. If your business has half a dozen source systems or more, imagine the possibilities if this approach was utilised? From there, you need to find some tables with at least two business keys in the same row. Data has to be transformed because different applications are only designed to focus on the requirements dictated by the business to the development team. However, a mathematical tool from category theory explains how to transform one mathematical object into another – called a functor, which allows you to transform groups to sets, categories to groups and other objects from one sort to another. So data can be migrated from one structure to another, and having identified the business keys and from them, the business important relationships, we can gather all the contextual information about the hubs and links to put them into satellites, as per the Data Vault method. The resulting metadata can be added to an automation tool, but still leaves you with the cost of crunching those numbers. Yet again column entropy analysis can be used to identify smaller clusters of columns in satellite, reducing them from say 40 to 20, 15 and 5 columns that have a similar rate of change, reducing the possible number of queries.

Using maths to manage hubs and links

And when it comes to links – say between two hubs – maths can again remove the need to transform the data and create a dimensional model, add things to a fact table and visualise the results. The rate of change can be defined for each hub as ∆X and ∆Y, which with a bit of manipulating becomes dX/dY. Links as Doug explained earlier, are tuples so larger links that join multiple hubs together can also be reduced to a relatively simple equation. ∂x + ∂y + ∂y ∂t

∂t

These tuples, which represent relationships, also mean the link could represent a bipartite graph. So every link in a Data Vault structure can be thought of as an n-partite graph – with ‘n’ representing the number of hubs that are related together in that link. Doug believes that by treating these hubs with just one satellite, or multiple ones, using data science in the Data Vault could identify where your marketing efforts are succeeding. If the business key is a customer and the 40 attributes are all demographical.

Data science benefits from clustering

So with just with a bit of clustering, the data science can bring results without any additional data structures to assist with the data analytics. Further still, if the probability of a dice landing on one is 1/6, your product hubs have the same probability of one being chosen at any one time in relation to the number of rows. A bit of work leveraging various traits associated with that hub offers the different probabilities under different circumstances i.e. seasonal patterns. Doug is convinced the old adage that data scientists spend 80% of their time identifying and preparing data – and just 20% analysing it – could alter radically with better use of quite common mathematics. So if data scientists understand data structures from a mathematician’s perspective and data architects and data engineers communicate with them through the language of maths, your business’s overall productivity could benefit considerably.

“The laws of nature are written in the language of mathematics – the symbols are triangles, circles, and other geometrical figures without whose help it is impossible to comprehend a single word.“

– M Kline Doug’s own ‘Performance Law’ states that what does not get organised with a data structure, must get organised with application code. So with many business intelligence tool vendors requiring a dimensional model to do reporting combinatorics will explain the potential number of queries that can be run against a dimensional table that BI application requires. So if you are at the stage where the volume of data has got to the point where transforming it multiple times has become untenable, you need to revisit the problem using maths to allow data scientists to analyse those data structures. Data scientists require similarity, context and relationships – so in some ways they talk about the same things as data architects. So Doug recommends if your corporate executives decide they don’t have time to build a data warehouse, ask can you build an enrichment platform to support data science by transforming data using functors, supports probability and graph analysis, set theory clustering , topological analysis, n-dimensional dynamic data representation, and one that produces data products that can support combinatorial reporting by connecting to an ontological knowledge graph, whereby the load on any front-end data analysis can be reduced by applying laws of economic supply and demand? The answer, to borrow phrase, is: “Do the math stupid!” To look at Doug Needham’s book – click

here

Author: ANDREW GRIFFIN