Identity Resolution: Data Warehouse vs. Customer Data Platform

Learn how identity resolution occurs in data warehouses vs. customer data platforms and which one is right for your business.

Best Practices
August 15, 2022
Image of Arpit Choudhury
Arpit Choudhury
Founder, astorik
Identity Resolution

Everybody wants a single source of truth for customer data, but what it entails depends on who you’re asking.

Sure, the data warehouse is a “single store” for customer data collected across multiple sources; however, in the absence of identity resolution, the data is only half-true. Building a unified view of customer activity from the data is anything but trivial—those tasked with it can attest to the complexities involved in getting it right.

Moreover, the definition of identity resolution also varies from business to business—for certain industries, solving for identity resolution is a subset of a broader entity resolution problem.

Identity resolution, as the name suggests, refers to the identity of a person—an individual user or customer who is one of the several entities that a business deals with. Some of the others are accounts, products, suppliers, vendors, partners, and resellers.

In this guide though, I want to delve a little deeper into identity resolution and describe the systems where it takes place, the differences between automated and manual identity resolution, and the benefits of deterministic over probabilistic matching.

Identity resolution: Where and how it happens

Identity resolution, as you probably already know, is the process of unifying user (or customer) records that are captured across multiple sources (or touchpoints).

But where does this process take place? Who performs the unification? How is the data captured and stored? And what are the prerequisite data points to make it all possible?

It’s important to have answers to these questions before investing in an identity resolution endeavor.

Data warehouse (DWH)

Bill Inmon, known as the father of the data warehouse, recently wrote an article titled “What A Data Warehouse Is Not” where he debunks popular myths regarding what a data warehouse is—it’s a fascinating read and I highly recommend it if you want to gain a deeper understanding of what’s happening in the world of data warehousing.

The data warehouse, in its typical form, is a cloud database that stores customer data from disparate sources and is used for analytic workloads.

Before identity resolution can happen, one has to ensure that data from first-party data sources—apps, websites, or smart devices—is made available in the data warehouse, which is typically done using an internal or external customer data infrastructure (CDI) solution. What data is collected and how it is stored is important as identity resolution relies on a set of identifiers (IDs) that are used to match and merge user records originating across multiple sources.

Writing the unification code

The process of unifying or merging records starts once the requisite data is made available in the warehouse. This is typically done by analysts who have a good understanding of the datasets and are adept at writing SQL queries that perform complex joins across tables to create new tables known as materialized views. These tables then serve as the source of truth that is used for analysis and activation.

Probabilistic vs. deterministic matching

In the absence of identifiers such as email, mobile number, device ID, and user ID, or the ability to join them accurately due to other factors, one has to resort to what is referred to as probabilistic matching, which relies on signals rather than personally identifiable information (PII).

Also known as fuzzy matching, probabilistic matching looks for a combination of user properties such as name, location, operating system, IP address, etc. to then merge records when the potential match receives an acceptable score.

In simple terms, probabilistic matching is more flexible but is not 100% accurate. It makes sense to employ it for critical use cases such as fraud detection where the datasets are large and complex; however, it’s not recommended if your goal is to build data-powered personalized experiences.

Deterministic matching is more accurate simply because there’s no “guesswork” involved—it’s a 0 or 1 scenario based on the available identifiers. The benefits of this approach are covered below.

I’m hoping that you now have a fair understanding of how identity resolution is handled in the data warehouse. It’s time to understand how it’s done by CDPs.

Read my guide with Amplitude on Behavioral Data & Event Tracking to learn more about laying your data foundation.

Behavioral Data Event Tracking

Customer data platform (CDP)

I wanted to link to an article describing what a CDP is not (here’s what a CDP is), but unfortunately, I couldn’t find one so I’d first like to quickly mention that a CDP is not a CDI, nor is it a CRM.

In essence, a customer data platform is, well, a platform on top of customer data infrastructure—the platform enables folks to segment and sync audiences with third-party tools using a visual interface.

So where does identity resolution take place and how?

Generally speaking, it takes place at the time of, or soon after, data is collected. Under the hood, a CDP stores a copy of the data and in an automated fashion, performs deterministic matching based on supplied identifiers.

As mentioned earlier, personally identifiable information (PII) plays a key role in enabling deterministic matching and offers a high level of accuracy—an integrated system to collect the data and perform the unification is what makes a CDP appealing.

Some CDP vendors have taken the probabilistic route and tout their offerings to be superior in nature. Instead of detailing the downsides of probabilistic matching, I’d like to highlight some of the key benefits of deterministic matching.

Deterministic identity resolution: Key benefits

Personalization is the holy grail for SaaS and ecommerce businesses, but if gone wrong or ill-timed, personalization efforts can prove to be more detrimental than no personalization at all.

Deterministic identity resolution not only ensures accurate personalization at scale but also enables businesses to be more privacy-friendly and adhere to regulations more strictly. Allow me to unpack this.

Personalization

Since deterministic identity resolution takes place only when the system is able to identify user records based on identifiers provided by the user directly (typically email or phone number), it’s highly unlikely for personalization efforts to get messed up.

Additionally, timeliness is ensured since CDPs are able to automatically perform identity resolution at the time of data collection.

A simple use case that applies to most SaaS businesses is to send a highly personalized welcome email to users—almost immediately after they sign up—that also takes into account other user attributes such as location, industry, or preferences.

SaaS businesses typically allow a user to create multiple accounts or workspaces but sending the same standard welcome email to an existing user makes little sense. Deterministic identity resolution coupled with pre-defined segmentation and real-time syncing can ensure that the user is not treated as a new user and the communication they receive reflects that.

A broader example that applies to pretty much all industries is to notify users when they log into their account on a new device or in an unrecognized location. Since the system already has the user ID associated with a specific IP address and device ID, it is able to immediately recognize unknown patterns and notify the user in real time.

Privacy-friendly

Nobody needs a lesson in why a privacy-friendly approach is critical for businesses—the ramifications of not adhering to GDPR or CCPA can be brutal.

With deterministic matching, brands can be certain that if a user has opted out of receiving communication or wants to be forgotten, they are accurately identified across downstream systems—email, SMS, advertisement channels, and so on—and their data is wiped clean from everywhere.

Achieving this level of compliance in the absence of a CDP with deterministic identity resolution capabilities is far from trivial and can result in multiple violations along the way.

Which form of identity resolution is right for you?

The goal of this guide is to provide an overview of how identity resolution is achieved in different environments under different constraints, and hopefully, I’ve managed to do that.

These tips and suggestions are better suited for the realm of product, growth, and marketing use cases, primarily at B2B SaaS companies. Moreover, this piece is not meant to conclude that one approach is better than the other, and based on certain factors, managing identity resolution in the data warehouse using fuzzy matching might work better for some businesses after all.

Learn more about identity resolution in the Amplitude CDP by speaking with a product expert.

Contact sales
About the Author
Arpit is growing databeats (databeats.community), a B2B media company, whose mission is to beat the gap between data people and non-data people for good.

More Best Practices
Image of Dillon Forest
Dillon Forest
CTO and co-founder, RankScience
Julia Dillon Headshot
Julia Dillon
Senior Product Marketing Manager, Amplitude
Image of Pragnya Paramita
Pragnya Paramita
Group Product Marketing Manager, Amplitude