Introduction
Proteins are molecular machines: inside each living cell, they are the primary movers, cutters, and builders of the stuff of life. Therefore, if the biologist’s job is to understand how life functions, they would do well to pay attention to proteins. Indeed, like watching machines on a construction site, knowing which proteins are active can tell us a lot about what a cell is up to and, importantly, give us insight into what is going wrong in malfunctioning cells, such as cancer cells and other diseases. One of the primary ways biologists do this is by “listening in” on messenger RNA—little transcribed snippets of DNA that are on their way to create a protein. Over the last 20 years, RNA sequencing (“RNA-Seq”) has evolved into a workhorse technique, allowing us to take a snapshot of all proteins being created in a cell.
However, RNA-Seq has limitations. In particular, traditional methods like bulk RNA-Seq measure protein production (“gene transcription”) in millions of cells at once. If we have a tumor biopsy, for example, bulk RNA-Seq cannot distinguish between which proteins are being created by the tumor and which are being created by the surrounding cells to fight it. Because the complex sample is first turned into a slurry—or molecular milkshake—bulk RNA-Seq treats the sample as a single entity, even when it’s composed of many different cell types. In tissues such as tumors, where multiple cell types (cancer cells, immune cells, stromal cells) interact, bulk RNA-Seq provides an average gene expression profile across all those cells, making it difficult to pinpoint which genes are coming from which specific cell type.
Imagine being at a crowded party and being unable to pick out individual voices—confusion abounds. Now, imagine that you have a special tool that lets you separate out the voices by identifying which conversations are coming from which people. A number of modern methods have arisen to listen in on gene transcription in individual cells. Many, like single-cell RNA-Seq, are powerful but costly laboratory techniques. But wouldn’t it be great if some fancy number-crunching could let us use bulk RNA-Seq to infer what individual cells are up to?
Enter bulk RNA-Seq deconvolution, a computational technique for breaking down bulk RNA-Seq data into individual “voices,” or cell types, helping us figure out which genes are being expressed by specific cells, even though they were all mixed together in the first place. Deconvolution is particularly important in cancer research, where understanding the composition of the tumor and the cells around it—including immune cells like T cells and macrophages—can be crucial for determining a patient’s prognosis and response to therapies.
Different deconvolution methods use different algorithms and generally fall into two main categories: supervised, which use some prior knowledge about which proteins likely come from which cells, and unsupervised, which do not. Supervised methods are either reference-based or enrichment-based. Reference-based methods use known gene expression from pure cell populations to estimate the proportions of each cell type in a bulk sample. Enrichment-based methods, on the other hand, assign scores to specific cell types. These two types of supervised methods often have a tradeoff, with enrichment-based methods doing better at differentiating broad cell categories but performing poorly at differentiating more fine-grained cell types.
A recent deconvolution “bakeoff” paper by White et al. compares many popular deconvolution methods, such as CIBERSORT and EPIC, along with dozens of community-supplied unique deconvolution methods. This paper provides a comprehensive benchmarking of deconvolution algorithms used in bulk RNA-Seq. The paper aimed to benchmark deconvolution methods systematically using both in vitro (lab-generated) and in silico (computer-simulated) admixtures of cancer, immune, and stromal cells.
A key innovation of the study was the creation of these controlled mixtures, where the exact proportions of different cell types were known. This “ground truth” allowed the authors to evaluate how well each deconvolution method could predict the cellular composition of these mixtures, providing an objective comparison of their performance. The study focused on two levels: predicting coarse-grained cell types (e.g., major populations like T cells, B cells, and fibroblasts) and fine-grained sub-populations (e.g., different types of T cells like memory or regulatory T cells).
The results were fascinating. Many of the deconvolution methods, including popular tools like CIBERSORTx and MCP-counter, performed well in predicting broad cell populations. However, predicting finer cell subtypes, such as memory or naïve CD8+ T cells, proved more challenging. Interestingly, the study demonstrated that deep learning-based approaches, such as Aginome-XMU, showed great promise in improving predictions for these more nuanced cell types.
The study also highlighted a few persistent challenges in the field. For example, deconvolution methods struggled with distinguishing between closely related cell types, such as different subsets of T cells. Sensitivity and specificity varied widely across methods, especially when it came to predicting the presence of rare immune cells like macrophages. Despite these challenges, the authors suggest that ensemble methods, which combine the predictions of several deconvolution tools, can offer improved accuracy across cell types.
Conclusion: Moving Toward More Robust Deconvolution Methods
This new study underscores the importance of computational deconvolution in modern cancer research and highlights both the progress and limitations in the field. While bulk RNA-Seq deconvolution remains a crucial tool for dissecting tumor microenvironments, the study’s findings suggest there’s room for innovation—especially in improving the detection of rare and complex cell types. The success of deep learning approaches in the DREAM Challenge points to exciting new directions for computational biology. As these methods evolve, we may soon see even more accurate tools for understanding the intricate cellular composition of tumors, leading to better diagnostics and therapies for cancer patients.
Interested in learning more? Dive deeper into the evolving world of bioinformatics by exploring ProCogia’s latest insights on RNA sequencing and cancer research. Discover how advanced data science techniques are shaping the future of healthcare. Read more of our Bioinformatics blogs here.