How To Extract Structured Data From Mortgage Documents Using OCR & AI

Guide for mortgage operations teams.
Vova Pylypchatin
CTO @ MortgageFlow

Advances in OCR (Optical Character Recognition) and AI (Artificial Intelligence) have made it possible to automate the extraction of highly accurate structured data from mortgage documents.

Since every stage of the loan lifecycle requires extensive document analysis, it pays to know how OCR and AI can help automate this process.

In this post, we’ll cover:

  • What structured data extraction means
  • How you can use structured data in mortgage operations
  • Workflow to extract data from mortgage documents using OCR & AI
  • How to implement this workflow into your operations
  • What OCR/AI providers are available to choose from

What Structured Data Extraction Means

Automated structured data extraction is a process where unstructured data, like a mortgage document, is used as input. The output from this process is a set of data points—such as borrower name, age, assets, etc.— structured data.

Structured means that for each type of document processed, you receive the same set of data points consistently. This data can then be used by other software systems.

A familiar example of data extraction in the mortgage industry is a Loan Officer reviewing borrower documents (unstructured data) and manually inputting required information into forms in a Loan Origination System (LOS).

In this context:

  • Borrower Documents represent the unstructured data.
  • Forms in your LOS are the consumers of the structured data.
  • The Loan Officer acts as a human data extractor.

Automated structured data extraction is the same process but done by software instead of humans.

How You Can Use Structured Data in Mortgage Operations

Structured data from mortgage documents alone won’t make much impact on your mortgage operations. It’s how you use this data to automate processes that will.

You usually need structured data as input to automate the process.

And if you don’t have structured data, you either:

  • Manually review documents and enter data into the software (E.g. LOS forms completion)
  • Or don’t automate the process because the effort required to extract data doesn’t justify the benefits.

So, having structured data makes it possible for you to:

  • Automate data entry for mortgage software you already use
  • Build new automated workflows that weren’t viable before

Here are a few examples of tasks you can automate by having structured data from mortgage documents:

  • Complete forms in LOS
  • Validate data from docs against an application in LOS (stare and compare)
  • Calculate income and debt-to-income ration
  • Identify red flags like late payments, charge-offs, etc. in credit report
  • Verify that borrower has necessary funds for down payment and closing cost
  • Ensure property value on appraisal report aligns with loan amount
  • Identify liens or disputes from title documents
  • Identify discrepancies in documents

How to Automate Data Extraction from Mortgage Documents

Below is a six-step process for automated mortgage document processing and data extraction.

1. Pull Files from Upstream Integration

To extract data, your document processing system must first receive the documents.

Thus, the process begins with your system pulling documents for processing from various sources.

Common sources include:

  • Loan Origination Systems (LOS)
  • Emails
  • FTP Servers
  • Dropbox Folders

2. Split Files into Documents with OCR & AI

Each document should be processed with a specialized processor to ensure consistent, structured, and accurate data.

The problem is that File ≠ Document.

A single PDF file can contain multiple documents.

For example, a correspondent lender loan package PDF might include more than 20 distinct documents.

So, it's crucial to identify and separate the individual documents within each file before processing.

An OCR & AI model trained for document classification and file splitting can be used. Such a model takes a PDF file as input and outputs a list of individual documents.

3. Split Documents into Pages

The more pages you're trying to process with a single processor, the lower the accuracy you can expect.

So, to process lengthy mortgage documents like URLA (Form 1003) with high accuracy, we need to split them even further by page.

This will let you use even more specialized processors trained to extract data from a single page.

4. Extract Data with OCR & AI

Once the system knows your document on the borrower file, you can route each to a specialized processor to extract structured data.

Sometimes, you might need to apply multiple processors per page to extract data that some processors can't.

For example, non-text data like signatures and checkboxes are better extracted by a different processor than the one you use to extract text data.

5. Review and Correct Extracted Data

Sometimes, AI can't extract fields from documents with enough accuracy.

In this case, we need to loop in humans to review the data and correct if it is wrong.

Usually, AI document processing products offer out-of-the-box Human-In-The-Loop (HITL) interfaces to handle this workflow.

Human-In-The-Loop Platform by Google Document AI

6. Push Data into Downstream Integrations

Once data is extracted and items with low confidence reviewed and corrected, we can push it downstream.

You can feed this data into other systems to automate mortgage lending operations.

Here are some common destinations:

  • LOS
  • Underwriting Systems
  • Income Analysis Systems
  • Credit Analysis Systems
  • Compliance Software
  • Analytics and Reporting Tools
  • Mortgage Data Warehouse

How to Implement Automated Data Extraction Workflow

You saw the workflow to extract data from mortgage documents in the section above.

But it is different from implementing one in your operations.

Below, I outline how to approach building your automated data extraction workflow.

1. Understand what Data, Documents and Integrations you need.

Start by defining what data you need to extract and where to use it.

Once you know what data you need, list the documents you need to get this data.

And once you know what document you need, find out how you'll get them.

You should have:

  • What data do you need
  • List of documents you'll process
  • List of up-stream integrations
  • List of down-stream integrations

2. Get Splitter and Extraction Models

The next step is to find models based on the list you created in the previous step.

You'll need 2 types of models.

The first is the one that splits files into document types you need to support.

The second is the one that can extract data from each document type.

To get this model, you have 2 options:

  • Train your model (for example, Google Document AI, Amazon Textract, Azure Form…)
  • Rent pre-trained model (for example, DocSumo, Super.ai, Ocrlous, etc.)

You can find more details about the differences between these options in the section below.

By the end of the step, you should have the following:

  • Trained document extraction models
  • Trained file splitter model

3. Piece it all Together

Once you have models, the next step is implementing the data extraction workflow.

  1. Implement upstream integration to get the files.
  2. Use the splitter model to split files into the documents.
  3. Split documents into pages if the vendor of your choice doesn't already do it.
  4. Feed the split documents into the extraction model.
  5. Push extracted data up into the upstream integration.

By the end of this step, you should have an end-to-end extraction workflow, from getting raw files to pushing extracted data into downstream integrations.

4. Review, Correct and Up-train

The last step is to fine-tune and up-train your models to improve accuracy.

That's especially true for extracting data from mortgage documents, as fewer providers have pre-trained models for mortgage documents.

So unless you find a provider that already has pre-trained models for every document type you need to support, there will be a period where you'll need to invest more time into up-training.

The process will involve reviewing and correcting data for documents with low accuracy.

You can either:

  • Use your own workforce
  • Use managed labelling services from providers (Ocrlous, Super.AI)

By the end of this step, you should have a document data extraction system that processes most of the documents with high accuracy. And only in rare cases does human involvement need to correct items that have low confidence.

What AI/ML Model Providers are Available for Mortgage Document Extraction

Quite a few AI document-processing products & tools are available on the market.

Their main difference is the degree to which they work for mortgage document extraction out of the box.

And it comes down to how many steps of the 6-step process they cover:

  • Do they have upstream integrations with your mortgage software?
  • Do they have downstream integrations with your mortgage software?
  • Do they have a pre-trained model to split files into mortgage documents?
  • Do they have pre-trained models for mortgage documents?
  • Do they split longer documents by pages and train models for each?
  • Do they have human-in-the-loop services included?

Some of the solutions cover all 6 steps. Other solutions cover none.

The less customization you need, the higher the cost per document you can expect.

The more you invest to get it working, the less cost per document is.

Featured Providers

Here, you can find a list of providers that you can use to automate data extraction from mortgage documents. That’s not an exhaustive list of the providers; these are the ones that, in my opinion, are the most relevant for mortgage document data extraction.

Low-level:

💡 Low-level solutions are the ones that need the most engineering involvement to make them work for mortgage documents. But they tend to have the lowest cost per document.

Mid-level:

💡 Mid-level solutions are usually built on top of one or multiple low-level solutions and remove some complexity in implementation. Most come with pre-trained models relative to the mortgage industry and have up/down-stream integrations with popular mortgage software.

Specialised:

💡 Specialised solutions are usually built on top of mid-level solutions. They take them further by providing out-of-the-box automation using the data they extract.

What’s next?

I hope you enjoyed this piece, and it helped you get an insight into how to use OCR & AI to extract structured data from mortgage documents.

If you’d like to stay on top of the latest mortgage tech and how it can be applied to mortgage operations, consider joining our mortgage technology newsletter.

MORTGAGE TECH NEWSLETTER

Discover how technology can assist your mortgage company in reaching its strategic goals

A weekly newsletter about leveraging data, custom software, and modern technology to drive efficiency in mortgage operations.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Written by
Vova Pylypchatin
CTO @ MortgageFlow

I’m a software consultant with background in software engineering. Currently, I run a mortgage software consulting and development company that builds custom tools and automation solutions for mortgage lenders.