Tech Corner October 6, 2021

The Right Tech for Your Structured & Unstructured Data

by Harald Collet

615ce23787111d90e3958456 Blog Image Data p 800

It’s accepted wisdom in the financial services industry that data-driven firms are winning in most markets. One category of valuable business data that has proven particularly difficult to process is data locked inside documents and emails. The complex nature of data creation, sharing, and formatting doesn’t make this process easy. Partially to blame are the three types of data: structured, semi-structured, and unstructured. This last category makes up the majority of data—and is the most difficult to extract.

Unstructured data poses a significant challenge for financial services firms. The variety, volume, and velocity of this data make it difficult to extract information from pdf documents, emails, and scanned images. Once the data is found, it needs to be transformed into process-ready formats that drive insights and enable further action.

In the end, the competitive advantage comes down to who can make this raw unstructured data actionable the fastest without impacting accuracy and ensuring traceability. Until recently, there were only two options for doing this: hiring large numbers of workers to manually search through unstructured information or investing in a costly customized IT-centric automation solution. But now, there are several data extraction and process automation tools that can help businesses effectively address data noise and harness its potential.

Get to know your data

As previously mentioned, data can be divided into three categories: structured, semi-structured, and unstructured. Businesses usually deal with all three types; however, the process of sourcing, collecting, and processing can be quite different for each. Therefore, to select the right solution, one must first understand the difference between the data types. Here’s a quick overview:

Structured data

Often referred to as quantitative data, structured data exists in pre-defined, neat formats and usually consists of numbers and text. Structured data can be stored in an organized way, typically in an SQL database or Excel spreadsheet. Such data repositories possess relational keys and can easily be mapped into pre-designed fields. This is the easiest type of data to search and analyzed. Examples of this data type include banking transactions, health records, claims forms, and more.

Semi-structured data

Semi-structured data is a little less clear cut. It generally isn’t kept in a relational database but has organizational properties that make it relatively easy to analyze. For example, XML data is semi-structured, as are documents stored in JavaScript Object Notation (JSON) format. Key-value stores and graph databases also tend to be semi-structured.

Email messages provide good examples of semi-structured data. The content of the email is unstructured, as it consists of running text, but emails do possess structured aspects such as the name and email address of sender and recipient, time sent, and other elements. Another example is a digital photograph taken by a smartphone. Although the image is unstructured, the photo would be stamped with a date and time, geotag, and device ID. You could even tag the photo to give it structure by adding descriptors such as “house” or “bicycle.”

Unstructured data

Data that doesn’t follow any organizational structure nor adheres to any specific format is defined as unstructured. Unstructured data is typically stored in NoSQL databases, applications, data warehouses, or data lakes as it takes on any shape or form.

Not having a structure makes unstructured data much more difficult to search through and analyze. Humans had to step in to extract value from images, Word files, pdfs, and other sources to manually transform the information from raw- to process-ready data. Until now…

How do you access and analyze the different kinds of data?

The good news is that technological solutions have emerged to help you extract these different types of data from documents, web pages, social media, and, of course, traditional databases. But you need to be careful to select the right tool. Some are adept at dealing with structured data but fall short when confronted with unstructured or even semi-structured data. Other, more advanced solutions can handle all three.

Let’s take a look at the different tools: Robotic process automation (RPA), optical character reading (OCR), and intelligent data processing (IDP) with workflow automation.

Robotic process automation

Robotic process automation (RPA) is software that has been around for about a decade but has only become mainstream in the last few years. It uses business logic and structured inputs from humans to automate business processes. Using RPA tools, a company can build software robots, or “bots,” to automate rules-based digital processes. RPA use cases range from something as simple as generating an automatic response to an email to as complex as deploying thousands of bots, each programmed to automate a specific task within a lengthy process spanning teams and systems.

RPA works great with structured data and is used for automating repeatable, simple tasks. In addition, some vendors offer intelligent automation, which might be able to process some forms of semi-structured data effectively using artificial intelligence (AI). But when it comes to unstructured data, RPA bots will likely reach their limits.

Optical character recognition (OCR)

Optical character recognition (OCR) is a technology developed to detect the characters (and thus words and sentences) of text found in printed books, photographs, or other types of documents. OCR converts the “images”—which we know as numbers and letters of the alphabet—into digital characters “readable” by computers.

Financial services firms have traditionally used OCR to transform the information printed on paper records such as passports, invoices, bank statements into digital data for easy storage, search, and retrieval and for use in machine-based analysis and processing. A more developed iteration of OCR, intelligent character recognition (ICR), uses advanced technologies like machine learning and computer vision to recognize handwriting.

OCR is a good solution to automate the processing of semi-structured data because it identifies text within the documents—both numbers and letters—which allows it to extract all the necessary information, transform it into machine-readable form, and process it.

In financial services, OCR can extract data from checks to capture the account information, the handwritten dollar amount, and the signature. In mortgage processing, it can convert multiple printed forms into digitally accessible and editable ones. In insurance, OCR can automate claims processing by converting the many different documents into digital data. And in any invoice-capturing process, OCR can be used to transform data from printed invoices into digital assets that can be processed faster.

Intelligent data workflow automation

An emerging class of solutions goes beyond RPA and OCR to do end-to-end collection, extraction, and normalization of relevant data from a broad range of sources and formats to feed it to either a human worker for validation or a system to be processed further. It’s more than just intelligent data processing or workflow automation; it’s both.

This type of technology can be used as a stand-alone solution to effectively and accurately address the processing and management of any data. It can also be paired with RPA to tackle the “first mile” of automation by creating transforming the data to fully automate processes end to end.  

Some vendors even offer solutions that are specifically designed with the business user in mind—allowing them to get off the ground running fast with an intuitive interface and enable them to build their own automations using no-code tools. These vendors empower the users to fully own their processes and alleviate the pressure on IT.

We get your workflows flowing

Data volume will only continue to increase, and manual data processing will no longer be sustainable. If you’re ready to unlock your data’s full potential while empowering your employees to build new products, deliver better services, and improve processes using clean data, then give Alkymi a try.

More from the blog

September 22, 2023

A high-performance approach to personalizing an LLM

by Elizabeth Matson

Fine-tuning is not the only way to get relevant, domain-specific responses out of an LLM. Alkymi’s team of expert data scientists explain an alternate route.

September 6, 2023

IDP: find the right document processing solution for your business

by Bethany Walsh

Find out which type of automated document processing solution is right for you: data extraction, an IDP, or a complete business system for unstructured data.

August 29, 2023

Alkymi and Portfolio BI partner to empower alternative asset managers

by Elizabeth Matson

We’re partnering with Portfolio BI, a provider of portfolio analytics and reporting solutions, to bring structured and unstructured data sources together.