Get Updates
Get notified of breaking news, exclusive insights, and must-see stories!

The Incrementality Problem: Proving AI's Revenue Impact in Sales Intel with Deepak Gupta

The Incrementality Problem: Proving AI's Revenue Impact in Sales Intel with Deepak Gupta

AI s Revenue Impact The Hidden Truth Revealed
AI Summary

AI-generated summary, reviewed by editors

Deepak Gupta, a Google expert, exposes the 'incrementality problem' in AI sales intelligence. Discover why proving AI's true revenue impact is crucial for enterprise adoption. Learn how to build auditable, defensible AI systems that deliver real ROI, avoiding common measurement errors and ensuring your AI investments truly pay off.

Enterprise spending on generative AI hit roughly $37 billion in 2025, up from $11.5 billion the year before, a 3.2x year-over-year jump that has placed AI inside almost every commercial workflow that touches revenue. By Deloitte's count, around 74 percent of organizations now hope to grow revenue through AI initiatives, but only about 20 percent say they actually are. McKinsey's data on EBIT impact tells a similar story: meaningful enterprise-wide bottom-line gains from AI remain rare, with only around 6 percent of respondents qualifying as high performers. The gap between what these systems are claimed to drive and what they can be proven to drive has become the central engineering and governance question of the next five years.

Deepak Gupta is a Senior Software Engineer at Google with more than a decade of experience building large-scale machine learning systems. His current work centers on AI-powered sales intelligence platforms used by global enterprise sales organizations: systems that ingest a wide range of campaign and account features, score them in real time, and surface revenue opportunities to sellers. The harder problem inside that work is not generating recommendations. It is proving that the recommendations actually caused the revenue.

We spoke with Deepak about why incrementality measurement has become the structural engineering problem inside enterprise AI, what separates a defensible system from one that just looks effective, and how he thinks about building infrastructure that survives an audit.

The phrase incrementality problem gets used a lot now. What does it mean inside an engineering team?

Incrementality is the gap between what your system did and what would have happened anyway. If a sales rep closes a major deal after acting on an AI recommendation, the system gets credit in the dashboard. But maybe that deal was going to close regardless. Maybe the rep would have found that account through normal pipeline review. The interesting number is not the conversion. It is the difference between the conversion rate when the system is on and the conversion rate when it is off, on comparable populations, measured cleanly. That difference is the only number that actually tells you whether the AI is doing work.

Most teams skip this and rely on observational metrics. Engagement, click-through, last-touch attribution. Those numbers always look good because the system gets credit for outcomes it did not cause. The reason this matters now is that enterprises are scaling AI platforms based on those numbers, committing budget, restructuring sales teams. If the underlying lift is half what the dashboard says, you have built an organization on a measurement error. That is not hypothetical. It is what the McKinsey EBIT data is showing.

You came into this work from an ML systems background. How did causal measurement become the part of the job that mattered most?

The platforms we build for enterprise sales process a wide range of campaign and account features in real time, integrate with auction data, search trend signals, and internal CRM systems, and surface ranked recommendations to a global seller population. The model side is well-understood. You can hire experienced ML engineers, train good models, ship them. That is not where these projects fail. They fail when leadership asks how much revenue the system is actually generating, and the team cannot answer with a defensible number.

What I learned over the last few years is that the incrementality layer is its own engineering discipline. You need controlled experiments built into the platform from day one, not retrofitted. You need holdout populations large enough for statistical power but small enough that you are not leaving revenue on the table. You need a measurement pipeline an independent reviewer can audit and reproduce. The tooling around this is far less mature than the modeling tooling, which means the engineering team has to build a lot of it. That is the part of the work that decides whether the platform survives its second budget cycle.

What does the architecture for that look like? Where does causal measurement live in the stack?

It has to live below the application layer, in the same infrastructure that handles model serving and recommendation ranking. If measurement is a separate system bolted on after the fact, you cannot guarantee the holdouts are honest. The serving system has to know which accounts are in the treatment group and which are in the control, and it has to enforce that assignment consistently across every surface where recommendations get exposed. If a seller in the holdout group gets the recommendation through a different channel because the assignment did not propagate, your experiment is contaminated and you do not even know it.

The pattern that works is to treat the experimental assignment as a first-class identifier in the data model, alongside account ID and campaign ID. Every event flowing through the pipeline carries it. Every model that scores an opportunity respects it. Every report that aggregates outcomes filters by it. That sounds straightforward, but it requires substantial coordination across teams that historically did not need to think about experimental design. ML teams have to accept that their model output gets gated by an assignment they did not control. Reporting teams have to accept that their dashboards now have a second set of numbers, the lift numbers, that often look smaller than the headline numbers and have to be presented honestly.

Where does this break? What goes wrong when teams add this discipline to a platform already in production?

The biggest failure mode is what people in the experimentation field call divergent delivery. You think you are running a clean A/B test, but the system underneath is making targeting decisions that correlate with the treatment assignment. The treatment group ends up systematically different from the control group, not because of randomization, but because the model is steering itself toward populations it thinks will convert. You measure a huge lift. You celebrate. Six months later, an independent audit team reruns the analysis with proper randomization and finds the lift was a fraction of what was reported.

The second failure mode is sample size pressure. Enterprise sales is a low-event domain compared to consumer advertising. You are working with a finite, relatively small population of accounts globally, not the population scale of consumer products. Detecting a real lift on a metric with high natural variance requires longer test windows than product managers want to wait. The temptation is to declare results too early, or to widen the metric definition until the lift looks significant. Both are how you end up with a system that passes internal review and fails an external one.

Who actually audits these systems? Is this the engineering team's job, or is there an outside check?

Inside large organizations, there is usually a separate function that exists to evaluate revenue claims independently of the team that built the system. The naming varies, but the principle is the same. They re-run the analysis, validate the methodology, and certify the lift number that gets reported to executive leadership and used in budget decisions. If your system cannot survive that review, it does not matter how good the engineering inside the platform is. The number that ends up on the slide is the audited number, not the one your team computed.

That external audit function is starting to look a lot like what financial auditors do for revenue recognition. The methodology has to be reproducible. The data has to be preserved. The assumptions have to be documented. You cannot point to a model output saying an account is a major opportunity and have that count. You have to show the controlled experiment, the statistical test, the confidence interval, and the methodological choices that justify the claim. The EU AI Act is pushing this further. By the time the high-risk obligations are fully enforced in 2026, the systems that get to keep operating in regulated sectors will be the ones whose owners can answer the audit question on demand.

The industry is also moving on attribution itself. Open-source causal inference frameworks, geo-holdout testing services, and a wave of new measurement vendors. How does that change what an internal engineering team needs to build?

Some of it commoditizes the basics. If you just need geo-level holdout testing for a marketing channel, vendors do that well, and you do not need to build it in-house. What does not commoditize is integration with your own data and decision systems. The attribution vendors do not know which accounts your sales team is talking to, what stage of the pipeline they are in, or what the prior interaction history looks like. That context lives in your CRM, your auction data, and your activity logs. Connecting it to the experimental framework is engineering work no external tool will do for you.

The other thing the vendor landscape does not solve is operational integration. Running an incrementality test on a static marketing campaign is one problem. Running continuous incrementality measurement on an AI platform that is generating recommendations at high volume, where the platform is being updated frequently and the population of users and accounts is shifting, is a different problem. You need a measurement infrastructure that can keep pace with the platform's rate of change. The teams that get this right early will have a structural advantage when the audit pressure intensifies.

What's still unsolved? What are you working on next?

The hardest open problem is sequential decision-making under causal constraints. Enterprise sales is not a single-shot recommendation. It is a sequence of touches over weeks or months, where each interaction changes the state of the relationship, and the AI is making suggestions at each step. Standard A/B testing assumes the treatment is independent of the outcome path. In a sequential system, that breaks. The recommendation at step three depends on what happened at steps one and two, which were themselves treatment-affected. Measuring the causal effect of the entire sequence is genuinely hard, and the tooling for it is early.

The work I want to do over the next few years is in that sequential layer. How do you design experiments that respect the temporal structure of sales engagement? How do you build infrastructure that attributes revenue to a sequence of decisions rather than a single touchpoint, in a way an auditor can still verify? Those are not modeling questions. They are infrastructure questions, and they sit at the intersection of experimentation, ML systems, and governance. The teams that solve them will define what defensible enterprise AI looks like for the next decade. That is the platform problem I am trying to be useful on.

Notifications
Settings
Clear Notifications
Notifications
Use the toggle to switch on notifications
  • Block for 8 hours
  • Block for 12 hours
  • Block for 24 hours
  • Don't block
Gender
Select your Gender
  • Male
  • Female
  • Others
Age
Select your Age Range
  • Under 18
  • 18 to 25
  • 26 to 35
  • 36 to 45
  • 45 to 55
  • 55+