Let’s be honest, what comes to mind while thinking about clinical research? Cleaning messy patient data containing blood work reports, lab results, and blood pressure readings dumped into a spreadsheet, right? But is that the actual case in live settings? A big no.

In the world of unstructured data, the role of clinical notes, audio recordings, and radiology images is inevitable and possesses greater value with credible insights. But the biggest challenge lies in processing, structuring, and extracting meaning from the heterogeneous data sources.

Think of unstructured data as a doctor’s handwritten notes, case reports in digital format, scanned reports, free-text clinical records, etc. Now, from a clinician’s perspective, the interpretation of tonnes of such a messy collection of valuable, sensitive patient data is a horrendous task despite their hectic routine. Let’s dive deep to understand the nuances in clinical research data processing.

There are several pain points for clinicians dealing every day with different categories of unstructured data. Interpreting all the multifaceted records at a time is exhaustive and inefficient. Repetitive routines, juggling between documents, make their productivity non-existent.

Assume the scenario, for a clinical trial protocol, the patient eligibility assessment is to be carried out to recruit patients matching the trial requirements. Keeping aside the technical discussions about the processes behind the matching, the challenge lies in the frontline lies in data interpretation and evaluation. It requires domain knowledge and reasoning skills to shortlist the suitable patients for the right clinical trial.

Say the patient information contains flat files (CSV documents), general practitioner notes (PDF documents), imaging data, audio notes/voice recordings, etc.

Article content

The common categories of information present in the general practitioner’s notes

Let’s talk about the ways to handle the unstructured data that fits the scenario. General practitioner notes containing more unstructured clinical information and also closely aligned with the trial eligibility assessment task.

Traditionally, this process is handled by the clinical experts in the following ways:

Manual review
Screening list/checklist (not often in a specific format, still manual evaluation required)
Keyword-based information mapping.

But all these are not accurate, time-consuming, and inefficient. As the number of patient records increases, the complexity grows, requiring more experts to review and evaluate against clinical trials. Recent breakthroughs in the generative AI domain have reformed the conventional approaches with their impressive reasoning skills.

Large language models (LLMs) inherently possess strong capabilities to perform text generation, summarisation, classification, parsing and information extraction, semantic search, and the list is exhaustive. But LLMs have certain limitations.

While LLMs are particularly trained on a very large, versatile corpus that represents the knowledge base, they can’t be upgraded in real-time and are static. The recent information can’t be possessed because it’s limited to the training data. Next is the hallucination issue: the LLMs relying solely on their pre-trained knowledge sometimes generate fabricated or inaccurate information.

Retrieval Augmented Generation (RAG) efficiently mitigates the issue by allowing the retrieval of information from external sources. It increases the accuracy via verified, up-to-date information, thereby reducing the hallucinations with precise facts. It increases the transparency and the trust in the results due to the content retrieval is on verified sources. Alongside, it is inexpensive and time-efficient as it allows for leveraging existing LLMs and augmenting them with external knowledge sources, often reducing the need for extensive retraining.

How SrotasIQ efficiently handle the problem of unstructured data?

SrotasIQ platform is equipped with advanced AI algorithms, designed in an intelligent way to address the ever-challenging problem -> “clinical trial patient eligibility assessment”. It efficiently handles the unstructured information through the RAG approach, which is a part of the whole enterprise solution.

Let’s take the same example: processing the general practitioner’s notes. Before we jump into the RAG, the first barrier is the category of the notes. It can be a PDF document or scanned notes (requires OCR processing). Then, the type of information present inside the GP document, primarily:

Free text
Images
Forms
Tables, and more

Here’s the catch. Form information extraction is straightforward, and images have no direct need for the eligibility assessment. Free text and table hold the most crucial information in the form of metadata.

Example:

Past Medical History: Breast cancer (as above), Mild asthma (childhood, resolved)

Current Medication:

Ondansetron tablets 8mg - AS REQUIRED for nausea

Recent medication: Carboplatin/Paclitaxel chemotherapy - Last cycle 15-Nov-2024

Granulocyte colony-stimulating factor - As per protocol

Blood Pressure: 118/75 mmHg (Last recorded: 25-Nov-2024)

Body Measurements: Height: 162cm, Weight: 58kg, BMI: 22.1 (weight loss of 8kg since diagnosis)

Recent Laboratory Results (20-Nov-2024): Haemoglobin: 11.2 g/dL Platelets: 185,000/μL

Neutrophils: 3.2 × 10⁹/L Creatinine: 0.9 mg/dL ALT: 45 U/L Bilirubin: 1.1 mg/dL

The first step is to extract all the readable information from the general practitioner’s notes. In case of a scanned document, OCR processing is an additional effort required.

Preprocessing is mandatory, where it removes the page numbers, junk information such as repetitive header information and similar, irrelevant content.

With the above processed information, follow the core steps of RAG with re-ranking:

Chunking the text information
Generating embeddings for the chunks & storage
Semantic search & retrieval
Re-ranking
Generation

Before we step into the RAG, the cluttered information is organised with LLM, which structures the information into sections. From the above example, the details can be grouped into medical concept tiers. It helps to preserve the relevancy among the details, yet distinguishes each section appropriately.

Chunking of the text data is the first and crucial step in the RAG pipeline. The chunking strategy matters as it decides the size of each chunk. Further, it stands as a key factor in determining the context preservation between each chunk.

As important as chunking is, the next step, generating the embeddings of the chunked data. This process ensures the retention of semantic understanding from the data, important for similarity search and retrieval. We use the high-performance, multifunctional embedding model to generate the embeddings.

Reranking, on the other hand, is adding more robustness to the retrieved result by improving the relevancy to the query. It re-evaluates and reorders the initial retrieval from the dense embedding search.

The tabular information (structured) is processed and stored as a chunk. While retrieving the data based on the query, it retrieves all the content of the data, thereby preserving the structured nature of the information and relevancy at a time.

Article content

This illustration depicts the architectural map explaining the process of handling unstructured general practitioner notes by SrotasIQ services at a high-level representation.

We at SrotasIQ build solutions scalable and adaptable to recent technologies. At each step, from chunking to retrieval, each step is crafted with the best technique to ensure accuracy and relevancy of the retrieved data about the search query, as well as content preservation. It makes life easier, simple and efficient for every stakeholder running massive clinical trials.

SrotasIQ offers a versatile services to solve a diverse set of issues in the clinical trials patient evaluation process.

TrialMatch - Delivers an accurate clinical trial retrieval service powered by a user-friendly search experience. Its elegant, lightweight interface offers intuitive filtering options that help users quickly navigate and discover the most relevant trials tailored to their needs. Unlike traditional keyword-based retrieval, InsightsMatch uses a custom-built mechanism designed for precision and relevance.
Site Feasibility - Agent-based data analysis platform that performs patient feasibility analysis based on the user’s natural language query. It analyses the user input in a deeper sense to understand and responds with accurate information.

The platform is GDPR compliant, ensuring the safety, security and trust of the patient data. By supporting FHIR, HL7 and other healthcare interoperable data formats, the data exchange, integration, and interpretation become more efficient and standardised.

At Srotas Health, we’re committed to leveraging AI innovations that streamline clinical research. By implementing AI-driven site identification, sponsors and CROs can optimise trial performance paving the way for more rapid, cost-effective, and patient-centric clinical studies.

Authored by: Karthik S. Dr Heeba Altaf Ramji Balasubramanian Vikram Parimi Suman Bhaskaran

The "Messy Stuff" that Matters in Clinical Research: Navigating Unstructured Data