R Vulnerability Due to Serialization Process

Author

Brian Carter

Data Consultancy
Open Source Services

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Introduction to R Vulnerability

HiddenLayer researchers identified a vulnerability in the R language that has security experts worried due to the widespread use of the language in data science and analytics. This vulnerability affects all R versions prior to 4.4.0 and the risk is widespread among the tens of thousands of extension libraries available on CRAN, Bioconductor, and GitHub.

The vulnerability is based on the interaction between the lazy evaluation of promised values and the serialization process in R. The researchers demonstrated that a malicious actor could hide scripts in serialized objects that would execute then the object is referenced in the environment. This mechanism is fundamental to the way R works: R packages are compiled through this serialization process and simply referencing an object from a package could execute malicious code.

Since this vulnerability was first reported, the R community seems somewhat nonplussed by the news. Some argue that this is a feature of the R language, not a bug, and that developers should be aware of the risks associated with loading serialized objects from untrusted sources. Others have called for a more robust solution to the problem, such as a more secure serialization process or a more secure way to load packages. Still others have identified novel ways malicious actors can hide code in R packages.

This article will try to provide an overview of these vulnerabilities, how they could be exploited, and what kinds of risks for which security experts should remain cognizant. At ProCogia, we are developing ways to evaluate R installations for these vulnerabilities to protect our clients from malicious code.

Serialization

The R Data Serialization (RDS) format is a binary used to store R objects in a serialized format. R developers are familiar with the RDS format because it is a common way to save and share R objects or environments. Objects are serialized and saved to disk using the saveRDS() or save() functions; conversely, the files are deserialized and loaded into memory using the readRDS() or load() functions.

				
					x <- 1:10 
saveRDS(x, "data/x.rds") 
 
y <- readRDS("data/x.rds") 
print(y)

				
					[1]  1  2  3  4  5  6  7  8  9 10

HiddenLayer researchers argue that serialization is a prime target for attackers because these objects are often called in processes. The RDS format is fundamental to the way R packages are compiled and loaded. When a developer compiles an R package, all objects are stored in a .rdx file with a .rdb file including metadata about the organization. When a user loads the package, the .rdb file is read and the objects are deserialized as RDS files.

dplyr package

The researchers confirmed that they were unable to execute code directly through the deserialization process in R. However, they suspected that, when combined with R’s lazy evaluation strategy, they could execuse a system process by creating a promise that could be serialized within these .rdx files.

The promise of an exploit

R evaluates expressions and functions in R in a lazy manner; that is, it delays the evaluation of a value until it is first referenced through a mechanism called a promise. A promise is an object that points to an expression that has not yet been evaluated. Developers can observe this process explicitly using the delayedAssign() function. In the following example, will define an object y in the global environment that points to an expression using an undefined object x. This code runs without error, even though x is not defined.

				
					delayedAssign("y", x + 1) 
x <- 10 
print(y)

				
					[1] 11

The researchers at HiddenLayer were able to exploit this mechanism by burying an promised value within the serialized package files. These promised expression would execute as soon as the package was reference. These findings were concerning because there was no way to detect the malicious code until the package was loaded and it required no explicit function calls from the users.

What next?

The R Core Team has released a patch for the vulnerability in R 4.4.0, but this has not fully addressed the underlying issue because this is a feature of the R language, not a bug that should be fixed. The problem with the RDS file type is that is directly represents the internal state of your R environment. When you run save(list = ls(), file = “filepath.rda”), you are saving your entire environment, including any executable code. This is a powerful feature for sharing processes across environments and sessions and very useful for collaborative work. However, it remains an open door for malicious actors to exploit and simply upgrading your versions to R 4.4.0 will not protect you.

To demonstrate this vulnerability, consider the following code, saved via RDS. If I bundle this .rda file in my package and call it, a function will be loaded into memory and can be executed when referenced. This is a simple example, but have you evaluated all of the RDS files in your packages to see if they contain executable code?

				
					fun <- function() { 
  system("whoami") 
  system("uname -msrn") 
} 
 
save(fun, file = "data/fun.rda")

Other exploits

A clever programmer can trick you into running malicious code by disguising it as a more common function. My favorite example from recent days was demonstrated and distributed on GitHub. This package contains a simple R script in an RDS file that loads a function into memory disguised as quit(). This file could be hidden within an R package and loaded into the environment without being noticed. How many of your automated workflows and pipelines include a quit() function to end a session?

				
					quit <- function (...) { 
  cmd = if (.Platform$OS.type == 'windows') 'calc.exe' else 
    if (grepl('^darwin', version$os)) 'open -a Calculator.app' else 
      'echo pwned\\!' 
  system(cmd) 
} 
 
save(quit, file = "data/calculator.rda")

Hidden Functions

The above examples hide deserialized code in the global environment, a vigilent developer would detect this code by observing objects as they are loaded in the Environment pane. However, R includes a mechanism for hiding functions. There are all sorts of reasons to do this, these type of functions are often included within an .Rprofile for configuring an environment for the user. However, this provides a mechanism for hiding code from the user. The following code creates function that will be invisible in the Environment pane and not returned when the user runs ls() with the default arguments.

				
					.invisible_functions <- function() { 
  system("ls -ll") 
} 
 
# optional parameter to return all objects in the environment 
 
ls(all.names = TRUE)

				
					[1] ".invisible_functions" ".main"                "fun"                  
[4] "has_annotations"

Namespace Events

R includes features that execute arbitrary code when a user loads or attaches a package. This is used by package developers to load specific options set up an environment for the package. This feature is used by the most popular packages in the R ecosystem. For example, the dplyr package includes a program called zzz.R that executes a number of functions when the package is loaded or attached.

				
					.onLoad <- function(libname, pkgname) { 
  ns_dplyr <- ns_env(pkgname) 
 
  op <- options() 
  op.dplyr <- list( 
    dplyr.show_progress = TRUE 
  ) 
  toset <- !(names(op.dplyr) %in% names(op)) 
  if (any(toset)) options(op.dplyr[toset]) 
 
  .Call(dplyr_init_library, ns_dplyr, ns_env("vctrs"), ns_env("rlang")) 
 
  # TODO: For `arrange()`, `group_by()`, `with_order()`, and `nth()` until vctrs 
  # changes `vec_order()` to the new ordering algorithm, at which point we 
  # should switch from `vec_order_radix()` to `vec_order()` so vctrs can remove 
  # it. 
  env_bind( 
    .env = ns_dplyr, 
    vec_order_radix = import_vctrs("vec_order_radix") 
  ) 
 
  run_on_load() 
 
  invisible() 
}

Mitigating the risk with R 4.4.0

The R Core Team has released version 4.4.0 addressing the vulnerability demonstrated in the HiddenLayer findings. But as many developers have pointed out, the solution to this vulnerability is not to fix the RDS file type, but to be vigilant about the code the objects you are loading into your environment. Basically, we would advise clients to treat RDS files like any other software found online: don’t download and load files from untrusted sources.

But how does an organization protect itself given the realities of the R ecosystem? Consider the following:

CRAN includes over 20,000 packages
Bioconductor includes over 2,000 packages
GitHub includes unknown numbers of packages
Nearly all of these include sample data in RDS format

How many organizations are fully evaluating these packages prior to loading them into their repositories? Many of our clients configure their Package Manager application to one of many CRAN mirrors without any evaluation of the packages they are loading. Their data scientists simply load whatever package they want into their workflows. CRAN is an incredible resource that evaluates packages for documentation and code quality, but it has not evaluated the security of the code. Even more packages are sourced from private repositories like GitHub or BitBucket. How many of these type packages are currently on your systems?

Has your company ever made a thorough inventory of the R packages your developers are using? HiddenLayer researchers demonstrated how malicious code can be hidden in an RDS file, but there are easier ways to delivery code to your system through the R langauge. Ideally, your organization should have a process for evaluating community-developed R packages much the way that you evaluate other software packages. You can provision these packages within a central repository that is monitored and updated regularly. Systems with sensitive data should require a curated list of packages that have been evaluated for security.

ProCogia’s approach

ProCogia is developing a system for evaluating your environment and mitigating any effects of this vulnerability. We have a tool that will profile all packages used on your system. Each package will be loaded into a quarantined environment and RDS files will be loaded individually and examined. We look for hidden functions, promises, and other executable code. We will load, attach, require, and unload all packages and observe the system for any system processes, and we will rerun unit tests to observe any side side effects.

We can take an inventory of all other RDS files in your system of evaluate them similarly. We can provide your team with a database of safe objects that can be locked down as a Package Manager repository. You can then safely return to your work with the knowledge that your system has been evaluated and is safe from deserialization attacks.

Conclusion

Upgrading to Version 4.4.0 is still necessary for protecting you from the self-executing promises identified by HiddenLayer, but this is only the first step to protect your organization. We can support your transition to a safer analytic environment by fully updating your software systems with the latest and safest versions and dependencies. We will provision a repository of R packages that have been fully evaluated for safety that includes a database of findings and tools to monitor packages as they are updated and added.

Also, we help mitigate the downstream consequences of updating your software: we can provide guidance on alternative packages, refactor Shiny applications incompatible with these updates, and develop comprehensive unit tests to ensure that subsequent upgrades are more straightforward.

The R language has a remarkable track record for safety and security, but it is not immune to vulnerabilties. Finding a partner that specializes in the R ecosystem can help you navigate the complexities of modern analytics while protecting your organization from unwelcome risks.

Keep reading

Dig deeper into data development by browsing our blogs…

The Best IDE for Data Science Still Needs to be Built

Introduction Data science involves many programming languages and each comes with its preferred integrated development environment (IDE). I know this because I’ve been doing it