Products

Resources

Security

Enterprise & Industry Insights

Enterprise & Industry Insights

Your AI underwriting model is only as good as the data feeding it - and that's the problem

Your AI underwriting model is only as good as the data feeding it - and that's the problem

Your AI underwriting model is only as good as the data feeding it - and that's the problem

Rohan Mahajan | Tartan

Rohan Mahajan

Rohan Mahajan

May 18, 2026

10 Mins

Risk in AI underwriting - stale data
Risk in AI underwriting - stale data

Table of Contents

What AI underwriting models actually need to work

The stale data problem in credit underwriting

What AI underwriting models actually need to work

Why this is getting more urgent now

What verified, real-time employment data changes

The infrastructure question

Build Connected Systems with Tartan

Automate workflows with integrated data across your customer applications at scale

India's BFSI sector is in the middle of a genuine AI moment. 56% of BFSI leaders anticipate significant AI integration in credit underwriting and risk modelling in the near term. 

Bajaj Finance has already originated substantial loan volumes through AI-assisted underwriting. Banks are building LLM-powered credit stacks. The ambition is real, the investment is real, and in several cases, the early results are real.

But there is a problem sitting underneath all of this that does not get discussed enough at the CTO or CDO level - and it is the kind of problem that only becomes fully visible after a model has been live long enough to disappoint.

The problem is the data.

Not the model architecture. Not the feature engineering. Not the choice of algorithm. 

The raw input data that the underwriting model is trained on and making decisions from - where it comes from, how current it is, how verified it is, and how much of it is still arriving through channels that were designed for a world before machine learning existed.

What AI underwriting models actually need to work

An AI credit or underwriting model is, at its core, a pattern recognition engine. It finds correlations between applicant characteristics and repayment behaviour, and uses those correlations to score new applicants. The quality of its predictions is directly bounded by the quality of the data it is trained on and the quality of the data it receives at inference time.

For a salaried lending model, the most predictive signals are employment status, income level, income stability, tenure with current employer, and designation. These are not exotic data points. Every HR system in the country maintains them as a matter of routine. The problem is how that data reaches the lender.

In most cases today, it arrives as a document. A salary slip the applicant downloaded and uploaded. An employment letter issued three months ago. A bank statement that shows net salary credits but tells the lender nothing about whether the person is still employed. 

The lender's model ingests these documents - sometimes via OCR, sometimes via manual entry - and makes a decision.

That decision is based on a snapshot of data that is weeks or months old, sourced from a document the applicant chose to share, in a format that may have been tampered with or simply may not reflect the current employment reality. 

Scaling AI models will be constrained by foundational and governance issues - and this is exactly what that means in practice. The model is sophisticated. The data pipeline feeding it is not.

The stale data problem in credit underwriting

Here is the scenario that plays out more often than most risk teams publicly acknowledge.

An applicant applies for a personal loan in October. They upload a salary slip from August showing ₹85,000 per month. The model scores them positively. The loan is approved and disbursed. In November, it emerges that the applicant was laid off in September - after the salary slip was issued, before the loan application was submitted.

The model did not fail. The data failed the model. Every signal the model received was technically accurate as of the date on the documents. None of it was accurate as of the date the decision was made.

This is not a contrived edge case. Employment transitions - exits, layoffs, role changes, salary revisions - happen continuously. In a high-velocity lending environment processing thousands of applications a day, a meaningful percentage of applicants will have had a material employment change between the date of their most recent payslip and the date of their application. 

A model that cannot access current, verified employment data will systematically misprice this risk. Not because it is a bad model. Because it is working with old information.

"A model trained on verified, real-time employment data and a model trained on applicant-submitted documents are not two versions of the same thing. They are solving fundamentally different problems - one of them just doesn't know it yet."

The insurance version of the same problem

In group health insurance, the underwriting and pricing decision is made at the time the corporate policy is issued. The insurer assesses the covered population - headcount, age distribution, salary bands, industry type - and prices the premium accordingly.

The data for that assessment comes from the corporate client's HR team, usually as a one-time export at onboarding. From that point, the insurer's picture of the covered population drifts. New employees join. Others leave. Salaries change. The composition of the group shifts. But the insurer's data does not update in real time - it updates when the corporate HR team remembers to send an endorsement file, which is rarely more frequent than monthly and often quarterly.

The consequence is an underwriting model that is pricing risk on a population that no longer exists as described. Premiums are miscalculated. Claims ratios deviate from projections. Renewal pricing conversations become contentious because neither party has clean data on what the covered population actually looked like during the policy period.

Again - not a model problem. A data currency problem.

Why this is getting more urgent now

Two things are happening simultaneously that make this problem more acute in 2025 than it was two years ago.

First, AI underwriting models are making faster decisions with less human review. The entire value proposition of AI in credit is speed - underwriting in minutes rather than days, automated approvals, reduced ops overhead. But faster decisioning with stale data does not produce better outcomes. It produces the same bad outcomes, faster, at higher volume. The risk of systematic errors scales with the throughput of the model.

Second, the Digital Personal Data Protection Act has changed the game on consent - it must now be granular in purpose definition, revocable, traceable, and data minimisation must be strictly applied. Applicant-submitted documents - salary slips emailed over, bank statements uploaded to a portal - exist in a grey zone under this framework. 

Data pulled directly from an employer's HRMS via a consent-based API, with explicit authorisation and a full audit trail, is on solid ground. As regulatory scrutiny of AI decisioning increases, the provenance of the input data matters more, not less.

What verified, real-time employment data changes

When a lender or insurer's underwriting model has access to live, API-sourced employment data - pulled directly from the employer's HRMS at the moment of the application, with the applicant's consent - several things change at once.

Employment status is confirmed as of today, not as of the last payslip date. Current salary is verified against the employer's payroll records, not a document the applicant provided. Tenure is calculated precisely. Department and designation are confirmed. Any recent changes - a salary revision, a role change, a notice period - are visible.

The model is now working from ground truth. Not a document that approximates ground truth. Not a bank statement that implies income without confirming employment. The actual, current state of the applicant's employment relationship, sourced from the system the employer uses to manage it.

The downstream impact on model performance is significant. AI-powered underwriting could reduce default rates by 25% - but that figure assumes quality input data.

A model making faster decisions on better-verified data will outperform a model making faster decisions on applicant-submitted documents, on every metric that matters to a risk head: default rate, approval rate among genuinely eligible applicants, and the ability to explain decisions to a regulator.

The infrastructure question

Getting real-time employment data into an underwriting pipeline requires connecting to the HRMS platforms that Indian employers use - and there are more than 50 of them in active use across the country. Darwinbox, GreytHR, Keka, SAP SuccessFactors, Oracle HCM, Zoho People, and dozens of others. Each with its own API structure, authentication method, and data schema.

Building direct integrations with each of these is not a realistic path for a lender or insurer whose core competency is risk, not API infrastructure. The more tractable approach is a unified employment data API - a single integration point that covers the full diversity of HRMS platforms and returns standardised, normalised employment data regardless of which system the employer uses.

This is what Tartan's HyperSync provides. A consent-driven, real-time employment data layer covering 80+ HRMS platforms - built specifically for BFSI use cases where data currency, verifiability, and audit trail are non-negotiable. Lenders and insurers integrating HyperSync can pull verified employment and income data at the point of application, feed it directly into their underwriting models, and make decisions on current ground truth rather than documents of unknown age and provenance.

The AI underwriting models being built across India's BFSI sector are, in many cases, genuinely sophisticated. The data infrastructure feeding them is, in most cases, not keeping up. Closing that gap is not an AI problem - it is a data pipeline decision. And it is one that risk heads and CDOs can make today, without waiting for the next model retraining cycle to find out how much the stale data has been costing them.

One platform. Across workflows.

One platform.
Many workflows.

Tartan helps teams integrate, enrich, and validate critical customer data across workflows, not as a one-off step but as an infrastructure layer.