AI Powered Automatic Classification: The Challenges in Managing Data in Clinical Trials

AI powered systems are adept at reading 1000s of documents and automatically classifying them into the right categories.

Published in

Product Coalition

7 min readDec 12, 2022

Sorting through and organizing high volumes of unstructured documents can be time consuming and painful. Organizations that receive documents from multiple channels (paper, email, electronic fax, FTP, etc.) need an efficient and convenient way to sort through all of their documents and data streams to identify documents related to specific processes and handle them accordingly.

Life science companies operate in a highly regulated, data-and-document-intensive environments. These companies have to continue to innovate while maintaining tight regulatory compliance with governmental guidelines such as the FDA’s 21 CFR Part 11 and have to deal with vast amounts of data and documents. Inefficient, paper-based processes can hamper both tasks

80% of the healthcare data is in unstructured format. Most organizations have trouble extracting insights from these documents. Clinical trials especially generate vast amounts of complex, unstructured data. Cleaning, organizing, and managing this data always proves challenging to clinical trial organizations. In addition, it is very important to maintain a compliant record of data for regulatory and reporting purposes.

Some Clinical trial sites still use paper. Having the data presented in a standard structure will help speed new discoveries.

In this article we discuss some of the challenges dealing with Unstructured data in clinical trials and regulatory submissions and how AI powered automatic classification can help to solve some of these challenges.

Challenges and Opportunities in Clinical Trial Data Processing

Clinical trials for a drug are typically conducted in many countries and each country may have many sites. The trial documents originating from these sites can be in many formats.

Many trial sites still do paper based documentation. These documents can be in emails or nested attachments in emails, shipped in paper formats, or scanned documents or could be in a file share or uploaded to a portal or faxed. Email being one of the important ways these documents are shared back to the study partner.

Because of how these documents are sent it leads to many challenges:

Misfiled documents
Missing documents
Duplicate documents
Documents with errors
Documents with missing / blank pages
Non searchable documents (as they are paper documents or scans of paper documents)
Documents in obscure formats

FDA reviewers spend far too much valuable time simply reorganizing large amounts of data submitted in varying formats.

All these essentially create significant delay in the trial process. During COVID-19, the cost of trial delays was as much as $8 million per day and there was over a month delay in almost 95% of the trials.

Categorizing Documents for Regulatory Submission

One of the most important parts of clinical trial is the process of submitting these trial documents in an organized format for FDA review. Regulatory submission of these trial documents involves transforming these documents into a common format and classify them into the right categories and extract relevant information.

Clinical trial documents need to be placed into the right categories / subcategories with specific metadata extracted for regulatory submission to FDA

For example, one of the Life Sciences companies generates 2 million documents of various types (including paper documents) per annum and these documents need to be classified into 130 nested categories & more than 40 entities need to be extracted from these documents to prepare for regulatory submission. Imagine being able to categories and extract from 1000s of documents.

More than 57% of the trial documents are misfiled or missing and associated with manual processes for sharing and classifying documents.

Challenges With Organizing Documents for Regulatory Submission

To be able to process these documents correctly and put them in the right bucket for regulatory submission, companies traditionally resort to manual classification. This could be done in-house or can be outsourced depending on the size of the organization. Despite taking a great deal of time, manual classification is error-prone, costly, and inefficient.

**A document could take 20 mins or more to read and classify manually**

Manual documents classification suffers from two major constraints:

Excessive time taken— The time required to classify and process documents can be significant.
Inconsistent / Subjective — Differences and biases in the approaches can impact documents classification, leading to subjective and incorrect classification.

It takes about 15–30% of an person's time to search and locate a document manually, and another 50% to search and look for the information. For example, on an average a document could take 20 mins or more to read and classify. And if there are lots of document it significant amount of time to read, process and classify these documents in the right category.

Companies are looking for ways to reduce time to process these documents and the minimize the potential for human error. This is where intelligent automatic classification / extraction can be of tremendous help to minimize the potential for human error.

Intelligent Document Classification

AI powered document classification enables the user to upload different kinds of documents in bulk and classify them into their respective types / categories.

Document classification tasks can be a huge bottleneck typical trial across multiple sites receives a large number of multiple document types to process. Being able to consume 1000s of documents, automating the process of reading it and classifying is a significant benefit to clinical research associates.

AI technologies can help identify and classify the type of document and extract key information to assist in the insight generation process.

For example, let’s say the Clinical research associate receives several documents over an email- Those documents could be form 1572, Site Staff Qualification supporting information or Investigator Curriculum Vitae etc. These clinical documents need to be read and classified into their respective categories (like Site Management — Site Setup — Form 1572), streamlined in the processing queue, and assigned to the right team member to review and complete it. In addition, the system needs to be smart enough to mark any documents with erroneous or missing pages. If 1000 documents are sent, all those 1000 documents are read and sorted into the right category.

The example I gave was a simple example but in reality these classifications are nested. It can go into a content zone as the highest category and each zone can have many sections and each section can have artifacts. For example, a document can belong to the Zone — Site Management, to the section — Site Setup and to the artifact / folder Form FDA 1572. This detailed multi level categorization is very important for regulatory submission. So we are looking at a nested categorization of over 130 categories which is a complex problem to solve for a human but not as complex for an AI system.

AI powered systems have the ability to read through 1000s of documents and classify them into the right bucket. This helps the user to review the document in 2 mins as opposed to what it used to take before (20 mins ) which is a great time savings.

Intelligent Document Extraction

In addition to the classification, many times additional information from this document needs to be extracted. For example, need to be able to extract the investigator name, document title, document type, signature presence, signature date, expiration date, license date etc.

Extracting key information from clinical documents and storing it in database for further downstream processing and intelligent searches

One of the important benefit of AI powered classification system is its ability to learn from the mistakes and get better over time.

Example of the Form 1572 where information about the principal investigator has to be extracted

Before any of this can happen it’s important to standardize all the trial documents into one format.

Standardizing the Variety of Documents Received

There are two important steps to be done before the automatic classification:

Firstly, getting access to these documents. These documents could be in a folder or a portal or in a paper stack or in fax reports or in a document management system or images in EDC or EHR or any possible location. First step is to be able to automatically access these trial documents from any sources on a timely automated basis.

Standardize all documents to PDF format. Report any errors or missing pages

Secondly, before automatic classification one needs to make every document in whatever format it was in originally fully accessible & searchable. Since there are a variety of documents and some of those documents are in paper its important to transform all the types of document in to PDF format for subsequent processing.

Paper documents, images & fax reports need to do OCR to make them searchable

This helps to search for content that used to exist in paper documents. In this process, flag documents that are empty or expired or has errors to follow-up.

Summary

Because of the type, volume and the nature of how clinical trial documents are sent to the study partner, it brings a set of challenges for properly categorizing and extracting data from these documents for regulatory submission on a timely basis.

There are 3 primary goals of an Intelligent Automatic Document classification & Extraction system

Automatic Classification and routing — Automatically read the source documents and figure out what type of document this is and route it to the right category or folder to be parked / sorted.
Language Identification — Since trials are done in many countries, its important for the system to be able to identify language in a document.
Automatically extract relevant metadata about the document to aid in submission and also support intelligent searches

The key is to be able to do this at scale. These AI powered systems / models are adept at reading 1000s of documents and automatically classifying them into right categories and thus helping to eliminate the manual effort and speed up the time for regulatory submission.