Develop pipeline that utilizes phased variants to detect circulating tumor DNA in cancer patients

Company Information

Detecting ctDNA with high sensitivity and specificity is crucial for early cancer diagnosis and monitoring. A client had initiated a proof-of-concept (POC) pipeline in Python to utilize PVs for ctDNA detection but needed expertise to evaluate and enhance this pipeline for practical application. ProCogia was brought on board to refine and expand the POC pipeline, leveraging its expertise in data science, bioinformatics, and data consultancy to build a robust, feature-rich PV pipeline.

The Challenge

The existing POC pipeline demonstrated potential but required significant enhancements to meet clinical needs. Key challenges included improving the pipeline’s sensitivity and specificity, removing reliance on external software for PV identification, and incorporating new features to reduce false positives and accurately estimate tumor fractions. Additionally, the pipeline needed to be optimized for processing efficiency and equipped with tools for rigorous evaluation using real patient samples.

Procogia’s Approach

Evaluation of POC

The client had built a proof-of-concept (POC) pipeline in Python in early 2020 but development was paused until late 2021. ProCogia was first tasked with evaluating the POC pipeline using serial dilution samples and real patient samples. Our analysis showed the POC pipeline had increased sensitivity and specificity in detecting ctDNA compared to an existing pipeline that relied solely on SNVs.

Develop PV Pipeline

ProCogia was then tasked with developing a new PV pipeline in Python that would incorporate many new features and build off from a recent publication that showcased the utility of PVs for ctDNA detection.

Develop New Module

The POC pipeline relied on output VCF files from a separate software for identifying patient-specific PVs. To remove the reliance on this software, ProCogia developed a new module that would identify PVs directly from the read alignment (SAM/BAM) file using Python/Pysam.

Additional Features

Filter SNVs by positional base quality to reduce false positive calls.Discard potential germline SNVs by comparing allelic frequencies to matched-normal samples. Discard PVs the did not overlap target regions (defined in BED format) by utilizing the binary search tree algorithm. Discard artifactual PVs that are present in matched-normal samples. Identify and track the number of unique DNA molecules supporting PVs to estimate tumor fraction.ents.

Ongoing implementation:

A module to finetune parameters of the pipeline on a set of training samples. By implementing Python class objects to store SNV and PV data, the time for identifying PVs was reduced by an order of magnitude.A module to perform Monte Carlo sampling of the data to evaluate background noise and estimate p-values for each PV.

The Results

We delivered a stable, fully unit tested and documented PV pipeline with new features that can improved sensitivity and specificity in ctDNA detection.

Lab samples were used to confirm real world data produced similar results as Monte Carlo simulation PV pipeline results.

PV pipeline was then optimized to improve processing ctDNA sample identification at a 85% reduction in run time.

ProCogia continue to upskill the client’s internal team and apply best practices for developing and unit testing Python code.

Services Used

Data Consultancy

We provide Data Consultancy to organizations to optimize your investment in people, processes, and technology.

Data Science

Using a blend of mathematics, software tools, business intelligence, and algorithms, we can draw insights and patterns from your raw data, allowing you to make intelligent data-driven decisions.

Bioinformatics

We deliver scientific results that drive clinical and translational research decisions. Our Bioinformatics team has extensive experience designing, optimizing, executing and analyzing pre-clinical and clinical research projects using next-generation sequencing technologies.

Conclusion

The existing POC pipeline demonstrated potential but required significant enhancements to meet clinical needs. Key challenges included improving the pipeline’s sensitivity and specificity, removing reliance on external software for PV identification, and incorporating new features to reduce false positives and accurately estimate tumor fractions. Additionally, the pipeline needed to be optimized for processing efficiency and equipped with tools for rigorous evaluation using real patient samples.

Explore more stories

Dig deeper into data development by browsing our blogs…

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.