Data Action Layer May 2, 2024

The 5 steps in our ML model training process

by Keerti Hariharan

Website 5 Step Model Training

To fully solve the problem of unstructured data processing for financial services firms, platforms have to move beyond template-based and OCR-only solutions. AI and machine learning allow us to build automated workflows that can understand the content within a document, even if the format varies. Machine learning is based on statistical models that are trained to recognize patterns and identify relationships from data, allowing them to make predictions or decisions without explicit programming. These models serve as the backbone of machine learning-based applications like Alkymi, enabling systems to automate tasks, make decisions and extract meaningful information from complex datasets.

In financial services, this challenge is compounded by a lack of standardization. The documents frequently processed by our customers, like Capital Notices or Brokerage Account Statements, may have thousands and thousands of varying formats. These formats may vary in layout, structure, and style,with different arrangements of text, tables, or other elements. Each fund or broker may have their own format—their name may appear in the document header in one instance, but in the introductory paragraph in another. The same sender might even present their data differently every time they send a document.

Alkymi builds machine learning models dedicated to solving our customers’ unstructured document data challenges. How do we build these models, and how do they understand document formats they’ve never seen before? We break down our process into 5 steps below.

Alkymi's 5-step approach to building high quality machine learning models

Step 1: Define the problem

Before we build and train a model, we need to understand our customers’ document and workflow challenges. What do their documents look like? What data do they need to extract? How should the data be represented and organized? How does the structured data move downstream? Is there a relationship between different data elements? Our customers span from investment managers and asset management firms to insurance companies and beyond. Each customer's workflow and data requirements are unique and specific to their business.

Step 2: Develop a data-driven strategy

Our data scientists leverage an extensive array of machine learning models and tools that can be tailored to each unique use case. Based on their detailed analysis of the documents and workflow, our team decides the best approach for each scenario, in collaboration with every customer. Options include layout transformer-based ML models, proprietary OpenAI large language models, in-house algorithms to locate elements on a page using regular expression, anchoring, and LLM-based tools, or a combination of these tools and models best suited for the workflow..

Step 3: Document annotation

Using a representative sample of documents from our customer, our team meticulously labels the location of each data element for every document using our own platform. This is an iterative process that includes labeling, analyzing and taking customer feedback to ensure clarity around their requirements, minimal ambiguity, and most importantly, consistency across labeled documents. With a consistent set of high quality and clearly labeled documents in a training set, model training can begin.

Step 4: Model training

Much of what has been done until this point could be considered preparation. We employ a rigorous approach to training each machine learning model focused on building for generalization. In this context, generalization refers to the ability of the model to perform effectively on unseen data, including formats it did not encounter during training. We define a successful model in part by its ability to effectively generalize, because unstructured documents, like Capital Notices or Brokerage Statements, come in a myriad of formats, rendering it impossible to train on 100% of the formats that will appear during production.

There are several approaches that our Data Science team leverages to ensure high quality models, including:

  • Stratified sampling: With this method, a percentage of sample documents from each available format are excluded from training. For example, given 10 formats, we will create a training dataset comprising of 90% of the documents from each format and excluding the remaining 10% across each of the formats from training, to be used during testing. These excluded documents are known as the holdout or test set, or the set of documents that the ML model will never see during training. Instead, these excluded documents are used to evaluate the quality of the model after training. A successful test will ensure that the formats that have been used in training will perform well.
  • Random sampling: In this method, the training dataset will comprise of only 9 out of 10 available formats. The documents belonging to the 10th format will be used to evaluate the quality of the model against new formats. “Random sampling” is often used by randomly selecting documents for testing. The goal is for the training datasets to look as similar to production workloads as possible to build confidence around production accuracy. When available, Alkymi may also incorporate additional document sets not provided by our customer, to help build confidence in the model.

The quality of the training dataset or datasets is critical to training the model. During the selection of documents, our team determines any documents that are outliers. These may be documents that will unexpectedly skew the model or simply are not representative of the general body of documents. Our validation system will flag these documents for review by the user during production.

Model training goes through multiple iterations of training and QA, integrating subject matter expert and customer feedback at every step.

Step 5: Deployment and ongoing monitoring

Once Alkymi’s model is trained and validated, it is ready for deployment into production. From this point onward, we conduct ongoing maintenance by proactively monitoring model performance, capturing model confidence scores, and providing re-trainings to improve specific fields or formats throughout. Additionally, any manual reviews performed by customers are analyzed and incorporated into tailored re-trainings. As our customers provide more data to Alkymi through the manual review process, the quality of the model continues to increase.

At Alkymi, training machine learning models isn’t simply about algorithms and data—it’s about understanding our customers’ needs, leveraging the right tools and techniques, and delivering tangible results.

About the author: Keerti Hariharan

More from the blog

June 12, 2024

Zen and the art of portal retrieval: automating your document ingestion

by Elizabeth Matson

We weigh the pros and cons of the most common options for managing retrieving your documents from portals and data rooms and propose a unified solution.

May 30, 2024

Finding the data needle in the document haystack

by Elizabeth Matson

Increase your team's efficiency with two new Alkymi features, enabling you to instantly search across all your documents to find the exact data you need.

April 24, 2024

Sitting down with Alkymi’s new VP of Customer Success

by Bethany Walsh

Hear from George Chedzhemov, our new VP of Customer Success, on his approach to enhancing customer experience and maximizing value.