How testing can speed up delivery of business value with Data Vault
- Andrew Griffin
- May 17, 2023
- 4 min read
Arguably the most important part of delivering a data warehouse project – certainly in terms of delivering business value and rapid deployment – comes from the strength of the testing built into any Data Vault 2.0 solution.
Thorough testing will ensure your project produces the insights which lead to value, increased efficiencies and better return on investment. , If done correctly, testing will crucially speed up delivery and ensure your team’s project works in an agile way.
Datavault data engineer Chris Fisher explained the finer points of ensuring that testing before deployment is both comprehensive, and accurate . In his presentation to the latest Data Vault User Group meeting, he explained that ‘Using Testing to Deliver Business Value with Data Vault’ was achievable for stakeholders in both technical (IT team) and non-technical (business users) ways.
The key is employing UAT (user-acceptance testing) and BDD (behaviour-driven development) to test both applications and business logic, ensuring the user requirements are incorporated to make for an easier sign-off process.
Why Python behave is a ‘very important tool’
Chris revealed he is a big fan of the Python behave tool which is very useful when it comes to integration testing because it expresses requirements in plain English (using the gherkin feature file format).
That allows for testing specifications to be written in technical and non-technical language.
Subjecting a data warehouse development to the same testing principles used in software development, produces working software and data while managing user-expectations.
The biggest benefits to any testing regime should come from using automation to remove as many time-consuming human tasks, which have the potential for human errors.
The latter can stall the whole process, with systems that either fail or leave the results based on bad or poor quality data, rendering them potentially worthless.
As well as being unreliable, manual testing will also fail to test the full functionality and usability of the application, as well as risking leaving behind redundant code.
Chris gave a full account of the lifecycle of a Test Driven Development (TDD) before explaining the three elements of a testing pyramid, based on his own experiences, starting with:
Unit testing – bottom up style approach, isolating each part of the coded program to ensure each unit is correct
Integration testing – tests multiple units work together correctly, in other words, table joins, aggregations and cloud infrastructure, including the raw and business vaults
End-to-end testing – this can often be ignored as it costs more for little reward but checks the system from start to finish and gives you a fully auditable Data Vault solution.
Overall TDD inspires confidence in the code – and users want to be able to believe in their data too. SQL is also code and can be tested using IDEs and Linters, while its behaviour can be tested on the data sets it applies to.
The three stages of testing
Chris, who has helped run a number of projects for Datavault clients, including one government organisation which required very high standards of data security, then went through the three different stages of testing.
This included key automation and utilising Python’s Unittest and Pytest libraries and data transformation tools that normally have column-level tests and assertions.
He also explained the business value of unit testing by promoting clean coding standards, and identifying bugs in code and data early, building trust among developers and users.
Integration testing can be implemented at multiple points in any Data Vault layer structure, allowing for testing of data loading scripts automatically, templates for hubs, links and satellites, and mart layer logic for star schemas.
The latter is best done with Python Behave and some data transformation tools do have referential integrity tests built in. It will also ensure all architectural components can co-operate with each other.
The other advantage is they are non-destructive as they can run in a local development environment as well as in production, and as part of a continuous integration/continuous delivery pipeline.
Python Behave will help satisfy user requirements – Chris gave examples of technical and non-technical terminology that could be used with the former, and illustrated the lifecycle of integration testing.
Integration testing’s business value
Ultimately, the business value of integration testing is that users get what they want, with the flexibility to change user requests, ensuring the process conforms to agile methods.
As well as adding a layer of auditability testing allows for whole marts to be released in a matter of weeks –not months – cutting costs of new features and development time.
End-to-end testing in a data warehouse is cumbersome, especially with very large data sets, but if carried out in the persistent staging area (PSA), means it should reconcile with the Raw Vault.
So if tables from the latter are recreated and reconciled with the PSA, successful tests signify no data loss, giving developers and users far more confidence in reports, knowing the underlying data is sound.
Again, such auditability of data will help prove compliance, leaving the business able to gain greater and more accurate insights.
That leads to better decision-making as errors are eliminated quickly and safely – in both development and production environments, all adding to the business value.
In conclusion the testing benefits for the developer are:
Confidence in code
Less stressful deployment
Cleaner more modular code
Integrates features more easily
Continuous delivery of working code, and
Faster development – even if projects get more complex.
Testing boosts agile sprints if ‘planned properly’
Chris advised that incorporating testing into agile sprints requires planning – they should be included in the same way as documentation is.
By testing as-you-go, following the testing lifecycles, Chris outlined you avoid leaving it all to the end and avoid the risk of having a major failure to resolve.
Then with the right acceptance criteria, features or business rules can be shown to be correct when they all pass the tests.
Once all tests have passed, the feature should be ready for delivery, making future sprints less stressful.
As your data warehouse scales, confidence in new features and refactors will grow, creating trust. And the rate of delivery will remain consistent, and even increase going forward.
Organisations often ignore testing because it is feared it will slow projects down. As Chris explained, the opposite is actually true.