Detecting frauds using Tartan Payslip OCR

Pay stubs, also known as pay slips, are frequently utilised by lenders to evaluate your creditworthiness as a form of income verification. As a working employee, or someone who has worked in the past, you have likely encountered one before. Typically, pay stubs provide information about an employee's earnings for a specific period, including details on tax deductions, insurance premiums, social security numbers, and more. They may be available in either paper or digital formats and are occasionally distributed via email or physically through post.

Lenders these days get scanned or a digital copy of these payslips and then the details are manually entered into their system to issue loans etc. Processes such as these are time consuming, tedious and subject to frequent errors. If only one could scrape thes epayslips and reduce this time to a mere few seconds and delight customers with faster loan processing options.

Through this blog we will examine a few techniques for automating the process of extracting information from payslips (referred to as Payslip OCR or Payslip PDF extract) using Optical Character Recognition (OCR) and converting it into structured data. Additionally, we will explore the common challenges involved in developing a reliable OCR system integrated with machine learning and deep learning models.

Extracting texts from Payslips using OCR

OCR is a computer algorithm that can convert images of typed or handwritten text into a text format. While there are several free and open-source OCR tools available on platforms like GitHub, such as Tesseract, Ocropus, and Kraken, they have certain limitations. For example, Tesseract is accurate at extracting organised text but struggles with unstructured data, while other OCR tools have limitations based on fonts, language, alignment, templates, etc. An ideal OCR for extracting information from payslips should be able to extract all the necessary fields despite any or all of these limitations.

Before setting up an OCR, let's examine the standard fields we need to extract from a payslip document.

Gross salary
Net salary
Bank account
Employee name
Employee number
Employee address
Salary period
Date of birth
Employer name
Employer address
Days worked
Hours worked
In / out service date
Hourly rate
Tax rate
Date of issue
Pan Card
Aadhar Card

OCR operates blindly and does not recognize the specific type of document it is extracting data from nor the fields/identifiers that are mentioned within it. Therefore, before setting up an OCR and examining its output, this fact must be acknowledged. If we use free and open-source OCR engines (Tasseract for example), it comes with its own set of limitations and requires a lot of post processing work to pick other missed out and relevant fields in order to put them in a structured way. Machine learning and deep learning models play a critical role in this process, as they can intelligently identify the location of the fields within the document and extract all the necessary values.

Drawbacks, challenges and more…

When scanning pay slips, we may face various issues such as capturing them at incorrect angles or in dim lighting conditions. Additionally, once they are captured, it is crucial to verify whether they are genuine or fraudulent. This section will explore these critical challenges and propose potential solutions to address them.

Fraudulent and blurred image checks

It is critical for companies and employees to ensure if the payslips are genuine.

To identify if an image is fake, here are some key indicators to watch for:

Watch out for backgrounds that are bent or distorted.
Be cautious of low-quality images.
Check for signs of edited or blurred text.

Defective scanning

The most common issue encountered when performing OCR is low accuracy while dealing with distorted or blurred text scans. In contrast, OCR tools have a high accuracy rate when working with high-quality, aligned images, producing searchable and editable text with ease. However, when working with lower quality scans, inaccurate results may occur. To address this problem, it's crucial to learn techniques such as image transforms and de-skewing, which can help align the image correctly and improve OCR accuracy.

Why should you consider automating your payslip process?

Manual data capture refers to the process of entering payslip data field-by-field by hand, which takes an average user roughly around 111 seconds per invoice. The number of keystrokes per employee varies depending on individual efforts, and mistakes can lead to corrections and missed fields. Assuming a consistent processing rate of 78 KPM, and factoring in the FTE (Full Time Equivalent) load, it can be estimated that one can process around 30 to 40 payslips per hour. However, it's important to note that time should also be allotted for making corrections and checking for missed fields.

Additionally, there are hidden charges and indirect costs associated with the work, such as the need for employees to take breaks and the resulting loss of productivity. When errors occur, there is also a time-consuming process of searching for the necessary details. For a business that handles more than 10,000 documents per month, managing corrections, synchronising data systems, and following up with vendors can all become challenging tasks.

In addition to the direct FTE load, there is also an indirect FTE load that is solely attributed to reworks. This includes wages that have to be paid for doing reworks, as well as the slow processing times associated with manual payslip data entry. While a company may choose to manually structure the data, the process of retyping fields and organising information from thousands of payslips can ultimately decrease efficiency resulting in increased expenses.

All these challenges can be resolved with Intelligent OCR data capture solutions. While the initial cost of investment for the software may be high, the benefits of automated data entry far outweigh those of manual data entry. With various subscription models, Tartan provides a competitive advantage to its users, eliminating the need for a fixed pricing commitment or a specific number of payslips to be processed.

Automated payslip data extraction software offers several benefits, including the ability to convert invoices into a variety of electronic formats such as JSON, PDF, and CSV. The software also provides easy payslip classification, scanning, and accurate data entry. Additionally, payslip data management, optimization, and automation are made possible through the use of Tartan APIs. Users can store their data on the cloud and be assured that it is kept encrypted for security.

Conclusion

Industry statistics predict that the global data extraction market will reach $4.9 billion by 2027 and continue to grow at a CAGR of 11.8%. By implementing automated payslip processing solutions, businesses can save valuable time and focus on their core activities. Tartan's automated payroll processing software helps to expedite processing times and prevent delays. Users can enjoy quality conversions and save invoices in a variety of file formats, which makes income documentation and e-verification convenient. Additionally, the OCR engine automatically extracts payslip data, organises it, and ensures that it is free from duplication errors, inaccuracies, and fraud.

Tartan is a full-stack verification API and payroll connectivity company that provides real-time identity, address, and income verification. It automates consumer verification and KYC to evaluate financial health. The company uses 150+ accurate and verified data points for accurate risk assessments to expedite customer verification for lending companies.

Would you like to automate your payslip processing today? Sign up for a free demo with Tartan and experience an immediate boost in productivity!