top of page

Data Quality and Development Agility

  • Hannah Dowse
  • Apr 9, 2024
  • 3 min read

Data Quality and Development Agility

Bad data is estimated to cost the world a $1trillion annually – and with the volume of data continuing to rise unabated, that is just more and more valuable resources being wasted, with data investment reported to have risen from $259billion in 2021 to $348billion in 2023. The number of days lost in data downtime nearly doubled from 81 million in 2022, to 150 million 12 months later. According to Gartner, the total annual cost of bad data has been estimated at nearly $13m for every enterprise organisation in 2021 alone. But data quality tools are not the answer to the data quality problem, according to Hung Dang, founder and CEO of Y42 – the turnkey data orchestra. During his presentation to this month’s Data Vault User Group, Hung revealed how frustration with all the various tools needed to build out a proper data infrastructure had inspired him to create the Y42 solution for orchestration. Y42’s tool enables clients to design production-ready data pipelines utilising a Snowflake or Google Big Query data warehouse. Bad data is the enemy Hung warned – ungoverned and malformed data not only damages productivity, it can lead to misinformed decisions with the societal and ethical impact of invalid AI models rising. More than half of organisations questioned in Gartner’s survey admitted they do not trust their data and nearly 80% intend to invest in data quality solutions, creating a $2billion market in data quality tools in 2024.

Bad data can only enter pipeline by one of two ways

Hung stressed that Y42’s ability to govern both pipeline updates, source updates and prevent bad data from ever going live – makes it unique. Bad data has two potential pathways into a data pipeline, through updates with broken logic and via source updates allowing it to enter production. The paradigm limitations mean that in many organisations, for problems in data quality to be discovered, they have to be allowed to happen in the first place.

By using Y42, there is an alternative to deploying and hoping that code changes, which appear valid, are correct when defined operations are performed on the live datasets. That means the actual outcome cannot be validated and problems only identified and fixed downstream. A product that allows those operations to be tested and validated offline – by creating a published version of the dataset, means the update can be written and validated offline before going live online. In the beginning, the demand for data centred on simple data pipelines fed by a few data sources. Whether that was sales, operations, and marketing fed by the earliest forms of social media. But the world of data and data analytics is changing constantly and in the Brave New World, the desire and need is to deploy faster while maintaining rigorous quality with the information leveraged in applications and key automated decision-making. The danger has lain in not following best practices, which has seen tooling and processes support ad-hoc analytics but then collapse as a result of the exponential complexity of greater data use cases, and the requirements of SLAs. Hung is a firm believer that by adopting best practices in DataOps good governance and adopting an agile approach to data projects, it is possible to achieve scalable data infrastructure and processes. The difference in the number of engineering hours and the cost of the two approaches is considerable – deploying three to five times faster and with data quality being maintained at 99.9%.

‘Using a Data Vault resolves many modelling issues’

Hung believes employing a Data Vault within your data platform is the best way to ensure you have a scalable data modelling methodology because it was developed specifically to address agility and scalability issues. In the same vein, the use of Git is a major advantage – with version control, as it allows for collaboration, branching and merging, traceability, speed, efficiency and the back-up of disaster recovery, with every clone creating a back-up of all data and history. He also recommends using environments in the data warehouse – acting as a separate instance with different data and configurations, which manages and tests changes safely, before deploying in the production environment. Current version-controlling data requires hefty CI/CD scripts to store and compute duplicates compared to using controlling code via Git. While market research has also identified that more than one-in-two data teams have identified data quality as a top priority going forward, Y42 aims to drive the number of data incidents towards zero to cure the world’s trillion dollar bad data problem. Resolving such data quality problems with supercharged continuous deployment and continuous integration will not only help businesses deploy faster, but also reduce the operating times of a data warehouse by 30-60%, Hung predicted.

bottom of page