Editor’s note: this article was originally published on the Iteratively blog on December 18, 2020.
You know the old saying, “Garbage in, garbage out”? Chances are, you’ve probably heard that phrase in relation to your data hygiene. But how do you fix the garbage that is bad data management and quality? Well, it’s tricky. Especially if you don’t have control over the implementation of tracking code (as is the case with many data teams).
However, just because data leads don’t own their pipeline from data design to commit doesn’t mean all hope is lost. As the bridge between your data consumers (product managers, product teams, and analysts, namely) and your data producers (engineers), you can help develop and manage data validation that will improve data hygiene all around.
Before we get into the weeds, when we say data validation we’re referring to the process and techniques that help data teams uphold the quality of their data.
Now, let’s look at why data teams struggle with this validation, and how they can overcome its challenges.
First, why do data teams struggle with data validation?
There are three main reasons data teams struggle with data validation for analytics:
- They often aren’t directly involved with the implementation of event tracking code and troubleshooting, which leaves data teams in a reactive position to address issues rather than in a proactive one.
- There often aren’t standardized processes around data validation for analytics, which means that testing is at the mercy of inconsistent QA checks.
- Data teams and engineers rely on reactive validation techniques rather than proactive data validation methods, which doesn’t stop the core data-hygiene issues.
Any of these three challenges is enough to frustrate even the best data lead (and the team that supports them). And it makes sense why: Poor quality data isn’t just expensive—bad data costs an average of $3 trillion according to IBM. And across the organization, it also erodes trust in the data itself and causes data teams and engineers to lose hours of productivity to squashing bugs.
The moral of the story is? No one wins when data validation is put on the back burner.
Thankfully, these challenges can be overcome with good data validation practices. Let’s take a deeper look at each pain point.
Data teams often aren’t in control of the collection of data itself
As we said above, the main reason data teams struggle with data validation is that they aren’t the ones carrying out the instrumentation of the event tracking in question (at best, they can see there’s a problem, but they can’t fix it).
This leaves data analysts and product managers, as well as anyone who is looking to make their decision-making more data-driven, saddled with the task of untangling and cleaning up the data after the fact. And no one—and we mean no one—recreationally enjoys data munging.
This pain point is particularly difficult for most data teams to overcome because few people on the data roster, outside of engineers, have the technical skills to do data validation themselves. Organizational silos between data producers and data consumers make this pain point even more sensitive. To relieve it, data leads have to foster cross-team collaboration to ensure clean data.
After all, data is a team sport, and you won’t win any games if your players can’t talk to each other, train together, or brainstorm better plays for better outcomes.
Data instrumentation and validation are no different. Your data consumers need to work with data producers to put and enforce data management practices at the source, including testing, that proactively detect issues with data before anyone is on munging duty downstream.
This brings us to our next point.
Data teams (and their organizations) often don’t have set processes around data validation for analytics
Your engineers know that testing code is important. Everyone may not always like doing it, but making sure that your application runs as expected is a core part of shipping great products.
Turns out, making sure analytics code is both collecting and delivering event data as intended is also key to building and iterating on a great product.
So where’s the disconnect? The practice of testing analytics data is still relatively new to engineering and data teams. Too often, analytics code is thought of as an add-on to features, not core functionality. This, combined with lackluster data governance practices, can mean that it’s implemented sporadically across the board (or not at all).
Simply put, this is often because folks outside the data team don’t yet understand how valuable event data is to their day-to-day work. They don’t know that clean event data is a money tree in their backyard, and that all they have to do is water it (validate it) regularly to make bank.
To make everyone understand that they need to care for the money tree that is event data, data teams need to evangelize all the ways that well-validated data can be used across the organization. While data teams may be limited and siloed within their organizations, it’s ultimately up to these data champions to do the work to break down the walls between them and other stakeholders to ensure the right processes and tooling is in place to improve data quality.
To overcome this wild west of data management and ensure proper data governance, data teams must build processes that spell out when, where, and how data should be tested proactively. This may sound daunting, but in reality, data testing can snap seamlessly into the existing Software Development Life Cycle (SDLC), tools, and CI/CD pipelines.
Clear processes and instructions for both the data team designing the data strategy and the engineering team implementing and testing the code will help everyone understand the outputs and inputs they should expect to see.
Data teams and engineers rely on reactive rather than proactive data testing techniques
In just about every part of life, it’s better to be proactive than reactive. This rings true for data validation for analytics, too.
But many data teams and their engineers feel trapped in reactive data validation techniques. Without solid data governance, tooling, and processes that make proactive testing easy, event tracking often has to be implemented and shipped quickly to be included in a release (or retroactively added after one ship). These force data leads and their teams to use techniques like anomaly detection or data transformation after the fact.
Not only does this approach not fix the root issue of your bad data, but it costs data engineers hours of their time squashing bugs. It also costs analysts hours of their time cleaning bad data and costs the business lost revenue from all the product improvements that could have happened if data were better.
Rather than be in a constant state of data catch-up, data leads must help shape data management processes that include proactive testing early on, and tools that feature guardrails, such as type safety, to improve data quality and reduce rework downstream.
So, what are proactive data validation measures? Let’s take a look.
Data validation methods and techniques
Proactive data validation means embracing the right tools and testing processes at each stage of the data pipeline:
- In the client with tools like Amplitude to leverage type safety, unit testing, and A/B testing.
- In the pipeline with tools like Amplitude, Segment Protocols and Snowplow’s open-source schema repo Iglu for schema validation, as well as other tools for integration and component testing, freshness testing, and distributional tests.
- In the warehouse with tools like dbt, Dataform, and Great Expectations to leverage schematization, security testing, relationship testing, freshness and distribution testing, and range and type checking.
When data teams actively maintain and enforce proactive data validation measures, they can ensure that the data collected is useful, clear, and clean and that all data shareholders understand how to keep it that way.
Furthermore, challenges around data collection, process, and testing techniques can be difficult to overcome alone, so it’s important that leads break down organizational silos between data teams and engineering teams.
How to change data validation for analytics for the better
The first step toward functional data validation practices for analytics is recognizing that data is a team sport that requires investment from data shareholders at every level, whether it’s you, as the data lead, or your individual engineer implementing lines of tracking code.
Everyone in the organization benefits from good data collection and data validation, from the client to the warehouse.
To drive this, you need three things:
- Top-down direction from data leads and company leadership that establishes processes for maintaining and using data across the business
- Data evangelism at all layers of the company so that each team understands how data helps them do their work better, and how regular testing supports this
- Workflows and tools to govern your data well, whether this is an internal tool, a mix of tools like Segment Protocols or Snowplow and dbt, or even better, built-in your Analytics platform such as Amplitude. Throughout each of these steps, it’s also important that data leads share wins and progress toward great data early and often. This transparency will not only help data consumers see how they can use data better but also help data producers (e.g., your engineers doing your testing) see the fruits of their labor. It’s a win-win.
Overcome your data validation woes
Data validation is difficult for data teams because the data consumers can’t control implementation, the data producers don’t understand why the implementation matters and piecemeal validation techniques leave everyone reacting to bad data rather than preventing it. But it doesn’t have to be that way.
Data teams (and the engineers who support them) can overcome data quality issues by working together, embracing the cross-functional benefits of good data, and utilizing the great tools out there that make data management and testing easier.