Inside Look at Powering Intelligence, How ML Models Are Trained

Financial services firms are inundated with unstructured data, investment documents that arrive in thousands of different formats. Traditional approaches like template-based tools or basic OCR quickly fall short when documents vary widely in structure, style, and layout. These rigid systems can’t keep pace with the complexity or scale required to support operational efficiency in private markets and beyond. To truly automate unstructured data workflows, platforms must apply machine learning—enabling systems to interpret content, extract insights, and adapt to new formats without manual rules.

At the core of this capability are machine learning models trained to recognize patterns and relationships within complex datasets. These models learn from labeled examples, improving over time and generalizing to document formats they’ve never seen before. This allows intelligent applications to automate tasks, make decisions, and extract meaningful insights from complex, unstructured datasets.

In financial services, this challenge is compounded by a lack of standardization. The documents frequently processed by our customers, like Capital Notices or Brokerage Account Statements, may have thousands of unique formats. These formats may vary in layout, structure, and style, with different arrangements of text, tables, or other elements. Each fund or broker may have their own format, their name may appear in the document header in one instance, but in the introductory paragraph in another. The same sender might even present their data differently every time they send a document. That’s why Alkymi builds machine learning models specifically designed to handle the complexity of unstructured financial documents, at scale and with precision.

Alkymi's 5-step approach to building high quality machine learning models:

At Alkymi, we train machine learning models to tackle the toughest unstructured data challenges in financial services. So how do these models learn to understand thousands of document formats they’ve never seen before? Here’s a breakdown of our approach:

Step 1: Define the problem

Before we build and train a model, we need to understand our customers’ document and workflow challenges.

What do their documents look like?
What data do they need to extract?
How should the data be represented and organized?
How does the structured data move downstream?
Is there a relationship between different data elements?

Our customers span the financial services industry. Each customer's workflow and data requirements are unique and specific to their business.

Step 2: Develop a data-driven strategy

Our data scientists leverage an extensive array of machine learning models and tools that can be tailored to each unique use case. Based on a detailed analysis of the documents and workflow, the team collaborates closely with the customer to determine the most effective approach.

Depending on the scenario, options include:

Layout transformer-based ML models
Proprietary OpenAI large language models
LLM-based tools
In-house algorithms using regular expressions to locate specific elements on a page
Anchoring techniques
Or a combination of these methods, chosen to best support the workflow.

Step 3: Document annotation

Using a representative sample of documents from the customer, our team meticulously labels the location of each data element for every document using our proprietary platform. This is an iterative process that involves labeling, analysis, and incorporating customer feedback to ensure clarity around their requirements, minimal ambiguity, and most importantly, consistency across labeled documents.

With a consistent set of high quality and clearly labeled documents in a training set, model training can begin.

Step 4: Model training

We employ a rigorous approach to training each machine learning model focused on building for generalization. In this context, generalization refers to the ability of the model to perform effectively on unseen data, including formats it did not encounter during training. We define a successful model in part by its ability to effectively generalize, because unstructured documents, like Capital Notices or Brokerage Statements, come in a myriad of formats, rendering it impossible to train on 100% of the formats that will appear during production.

There are several approaches that our Data Science team leverages to ensure high quality models, including:

Stratified sampling: With this method, a percentage of sample documents from each available format are excluded from training. For example, given 10 formats, we will create a training dataset comprising of 90% of the documents from each format and excluding the remaining 10% across each of the formats from training, to be used during testing. These excluded documents are known as the holdout or test set, or the set of documents that the ML model will never see during training. Instead, these excluded documents are used to evaluate the quality of the model after training. A successful test will ensure that the formats that have been used in training will perform well.
Random sampling: In this method, the training dataset will comprise of only 9 out of 10 available formats. The documents belonging to the 10th format will be used to evaluate the quality of the model against new formats. “Random sampling” is often used by randomly selecting documents for testing. The goal is for the training datasets to look as similar to production workloads as possible to build confidence around production accuracy. When available, Alkymi may also incorporate additional document sets not provided by our customer, to help build confidence in the model.

The quality of the training dataset or datasets is critical to training the model. During the selection of documents, our team determines any documents that are outliers. These may be documents that will unexpectedly skew the model or simply are not representative of the general body of documents. Our validation system will flag these documents for review by the user during production.

Model training goes through multiple iterations of training and QA, integrating subject matter expert and customer feedback at every step.

Step 5: Deployment and ongoing monitoring

Once Alkymi’s model is trained and validated, it is ready for deployment into production. From this point onward, we conduct ongoing maintenance by proactively monitoring model performance, capturing model confidence scores, and providing re-trainings to improve specific fields or formats throughout. Additionally, any manual reviews performed by customers are analyzed and incorporated into tailored re-trainings. As our customers provide more data to Alkymi through the manual review process, the quality of the model continues to increase.

At Alkymi, training machine learning models isn’t simply about algorithms and data - it’s about understanding our customers’ needs, leveraging the right tools and techniques, and delivering tangible results.

Products

Solutions

Resources

Company

Inside Look at Powering Intelligence, How ML Models Are Trained

Alkymi's 5-step approach to building high quality machine learning models:

Step 1: Define the problem

Step 2: Develop a data-driven strategy

Step 3: Document annotation

Step 4: Model training

Step 5: Deployment and ongoing monitoring

More from the blog

New Client: Top North American Public Pension Fund

On the Ground in the Middle East, A Region Driving Innovation in Private Markets

Alkymi Recognized in Four Categories for the Data Management Insight Awards USA 2025