Bringing streaming data into Data Vault in (near) real-time

Andrew Griffin
Jun 21, 2021
3 min read

https://www.youtube.com/watch?v=u8TZsb6CH2M

Bruce McCartney shows how streaming can complement a Data Vault architecture

Data is becoming available in larger and more complex sets – and increasingly in real-time.

That is increasing the demands on data analytics teams – as well as creating more pressure on organisations’ data platforms’ abilities to turn that information around in the most meaningful and accurate ways in “real-time.”

Bruce McCartney has a long history working as a programmer and data architect, and is a qualified Data Vault 2.0 practitioner and trainer, and is a long-time disciple of Dan Linstedt’s work. Bruce works as both a consultant and instructor, from his base in Calgary, in Canada.

His presentation to the UK Data Vault Users Group on ‘Streaming Data into the Data Vault in Near Real-Time’ was therefore, very timely.

You can listen to Bruce’s full presentation and explanations by clicking here.

What is driving the trends in real-time data integration and analytics? Put simply, the Internet of Things, with the IDC (International Data Corporation) forecasting some 150 billion devices connected across the globe by 2025 – nearly all creating data in real-time. That real-time data figure was put at 15 per cent of the “Datasphere” four years ago – in another four years it is expected to double again.

Coupled with increasing use in business of more complex algorithms, ML and AI (machine learning and artificial intelligence) is pushing the data demand graph ever higher.

So why does your data analytics need to be able to operate in real-time, and what are the best methods to achieve it?

Successful businesses in the digital age are very good when it comes to monetising, managing and measuring information to create competitive advantage – dubbed infonomics. Analysing data in real-time not only identifies critical decision-timing by consumers, but also quantifies the Business Value Index of such infonomics as explained by author Douglas Laney.

Bruce believes the most important factor to determine a business’s approach to analysing real-time data is the cost of having less latency in its operation.

So if you are looking to work out the cost of analysing data within 12-24 hours of it arriving in the source system, compared to between an hour and 12 hours, are different.

And should you want – or need – to do it in a matter of less than five minutes, or even in a matter of less than 10 seconds, the goal posts move again considerably.

In all four scenarios, the consistency, accuracy, availability and cost of that data is key, although development of software to ease the process is making it cheaper.

So what is the best approach if you decide that real-time is what the business needs?

Like many areas of life, it is about expectations… and they need to be set carefully for both the latency of the system, and the instruments used to deliver it.

The BI (Business Intelligence) architecture, Bruce explored, requires a data lake to act as a permanent staging area (PSA), or landing area, and a store for the raw data, in its many forms – whether csi, xml or JSON – feeding the data warehouse.

Processing data into your Data Vault can be achieved by the conventional ways of a full or incremental batch, or the more modern way using a microbatch.

You only need to look at the likes of Linkedin and Uber to see how streams can scale, with trillions of messages generated each day, for thousands of consumers and producers.

Bruce explained why he is a big fan of using Kafka to take care of the streaming, software used by some of the biggest data enterprise operations around the world.

There are plenty of options in the Cloud for data storage and have been for some time now – the most important developments have come in the computing area, with solutions offering lower levels of abstraction while some of the former ETL vendors have created packages with additional features. Bruce particularly likes Kafka’s KSQL to take care of the SQL required.

Most importantly, if you go down the streaming route, Data Vault 2.0 will support the most demanding streaming you can construct. The loading mechanisms Data Vault uses mean that out-of-sequence data or any reprocessing of inputted data will result in the same logical output – while parallel loading from multiple sources ensure the processes cannot interfere with each other.

In conclusion Bruce shows how a Data Vault 2.0 architecture helps facilitate the increasing business requirements for real-time data.