The Hidden Cost of Duplicate Users: A Data Engineering Playbook

Duplicate user profiles silently kill lifecycle program performance. The four common causes, the math on the cost, and a clean fix.

The Hidden Cost of Duplicate Users: A Data Engineering Playbook

Most product-led SaaS companies have a duplicate user problem they do not know they have. The lifecycle program runs as if every profile in Customer.io represents a unique person. In reality, a meaningful fraction — often 8 to 15 percent — represents the same person captured under multiple identities, or multiple people merged into a single profile. The cost is invisible at first because the symptoms look like normal underperformance: lower-than-expected open rates, inflated unsubscribe rates, segment counts that drift from the warehouse. The cause is identity resolution, and the fix is a data engineering problem, not a marketing one.

TL;DR

  • Duplicate user profiles are a near-universal problem in PLG SaaS, typically affecting 8-15% of the active base.
  • Four causes account for almost all duplicates: missing anonymous-to-authenticated stitching, email-based identity collisions, multi-device unmerged profiles, and account-level versus user-level confusion in B2B.
  • The cost compounds: inflated send volume, lower deliverability, distorted segments, broken attribution, degraded AI Decisioning input quality.
  • The fix is a one-time data engineering project plus an ongoing identity resolution discipline. Both belong in the schema, not the campaign platform.

How duplicates form

The four mechanisms, in rough order of frequency:

  1. Anonymous-to-authenticated stitching fails. A user lands on the marketing site, gets an anonymous_id, browses for two weeks, and signs up. If the identify call at signup does not include the prior anonymous_id, the CDP creates a new authenticated profile and orphans the anonymous one. Now there are two profiles for the same person — and any pre-signup behavior is lost to the lifecycle program.
  2. Email-based identity collisions. Some teams use email as the primary identifier. Then a user changes their email, and Customer.io creates a new profile. Or two family members share an email, and a single profile represents two people. Or a typo at signup creates a profile that the corrected signup attempt never merges with.
  3. Multi-device profiles never merge. A user signs up on web, downloads the mobile app, and authenticates with the same account. If the SDKs are not configured to propagate id consistently across devices, the same user shows up as two profiles. This is more common than teams realize because mobile and web instrumentation are usually owned by different engineers.
  4. Account-level versus user-level confusion in B2B. A workspace admin invites teammates. The admin's profile has account-level events; the teammates have user-level events. If the schema does not distinguish, the lifecycle team builds segments that are some mix of accounts and users.

Identity resolution failures look like marketing problems but they are data architecture problems. The campaign platform is the place where you discover them, not the place to fix them.

The cost, with rough math

Assume a SaaS company with 200,000 active users in Customer.io. Assume 10% are duplicates — 20,000 extra profiles representing 10,000 real users counted twice.

  • Inflated send volume. Every campaign sends to 220,000 profiles instead of 200,000. At scale, that is a 10% premium on profile-based pricing, plus 10% of every email's deliverability budget consumed by duplicates that will never engage with both copies.
  • Lower aggregate deliverability. Duplicate profiles often have stale or wrong email addresses. They drag down sender reputation. Every campaign that goes to the duplicates degrades the inbox placement for the legitimate sends.
  • Distorted segments. A "highly engaged" segment of 50,000 might really be 45,000 unique users plus 5,000 duplicates of users already in the segment. The lifecycle team builds campaigns to a segment size that is partly fictitious.
  • Broken attribution. Conversions on one profile do not credit pre-conversion behavior on the other. Attribution models underweight the channels that drove early-funnel engagement because the early-funnel signal is on the orphaned anonymous profile.
  • Degraded AI Decisioning quality. Customer.io's AI Decisioning treats each profile as an independent decision unit. Duplicates mean the model is making decisions about the same person twice, with partial behavioral history on each profile.

The compounded cost across these is hard to quantify precisely but routinely runs in the high single-digit percentage of total program ROI. For a Series B PLG company spending $30K per month on lifecycle infrastructure plus ad spend and team costs, that is a meaningful number.

How to detect duplicates in your data

Three queries will tell you whether you have a duplicate problem and how bad it is:

Query 1: Count distinct user_ids per email. In your warehouse, group by email and count distinct user_id. Anything greater than 1 is a duplicate. Look at the distribution. If 5% or more of emails have multiple user_ids, you have a meaningful problem.

Query 2: Count orphaned anonymous_ids. In Segment or your CDP, count anonymous profiles with at least 30 days of activity that never received an identify call. These are pre-signup users who either never converted or whose identify call failed.

Query 3: Check cross-device coverage. For users who have engaged on both web and mobile, check whether they appear as one profile or two in Customer.io. If you find two-profile users, mobile and web instrumentation are not unified.

The fix: identity resolution as a schema decision

The fix is not in Customer.io. Customer.io can merge profiles after the fact, but the merge logic in the campaign platform is downstream of the cause. The fix is in the schema and the CDP.

  1. Standardize the identity model. Decide on a primary identifier — usually user_id, generated server-side at signup. Email is not a primary identifier. Decide on the relationship between user_id, account_id, and anonymous_id. Document it.
  2. Fix the identify call at signup. The single highest-leverage fix. At signup, the identify call must include the prior anonymous_id. This is the call that stitches pre-signup behavior to the new authenticated profile.
  3. Standardize cross-device identity. Mobile SDKs and web SDKs must use the same user_id after authentication. This is usually a one-line configuration change per SDK, but it requires that mobile and web teams coordinate.
  4. Run a one-time deduplication. After the schema is fixed, run a deduplication pass on existing profiles. Customer.io supports profile merges via its API.
  5. Monitor ongoing. Add a weekly check to data ops: count duplicates created in the last 7 days. If the number is non-zero, the schema fix did not fully take, and there is still a leak somewhere.

What to do next

Run the three detection queries first. If you find duplicates above 5% of active users, the fix is worth the project. If you find them below 5%, monitor and prioritize other work first.

Key takeaways

  • 8-15% duplicate rate is typical and almost always invisible.
  • The four causes are anonymous stitching, email collisions, multi-device, and B2B account-versus-user.
  • The cost compounds across send volume, deliverability, segments, attribution, and AI Decisioning quality.
  • The fix is in the schema and CDP, not the campaign platform.