Introduction
My team specializes in SAS to R migrations. SAS is an established company that offers a wide variety of technologies to its customers, so every migration requires some flexibility in our approach. But every SAS migration requires my team to be able to be fluent in the SAS Macros system.
SAS users love their macros for the same reason R developers love their functions. Essentially they serve the same purpose: automate efficient workflows that are less prone to errors. I think that a good macro programmer can transition to writing good R functions with relatively little fuss. This blog post will document my experience in guiding this transition by highlighting the similarities and important differences between macros and functions.
SAS Macros
The best description of SAS macros I’ve found is that macros are programs that write other programs. They are programming tools that generate a piece of code incorporating a set of parameters. This is more efficient than programming multiple iterations that only differ by a few things. But essentially, there is no difference under the hood between making manual changes to the program vs using macros because macros are just shortcuts.
The macro system in SAS includes two types, the macro object and the macro variable. Macro objects call some sort of subroutine that is user-defined or bundled with SAS. They are easy to spot because all of them have the % prefix. Macro variables are parameters that can be used in SAS programs and are identified using the & prefix. These macro variables can be anything: character strings, data set variable names, entire data sets, or even other macros.
The following example shows how a macro variable can be added to the environment using the %LET macro and then used in a data step. In this case, %LET is a macro subroutine that assigns a value to the ¯o_var variable.
A well-constructed SAS program often employs a section of macro variables used to parameterize the workflow. These parameters can be used to pull data from a given source, subset these data based on a particular criterion, and output and file a report to the appropriate location. A process can be fully updated in the future by simply editing this section. It is all very handy and limits the opportunity for errors!
SAS users will be comforted to know that this is very similar in R. Many production R programs will include a similar set of parameters. These can be loaded automatically at startup or defined explicitly inside the program. Once defined, these variables can be used in R much the same way they can be used in SAS.
SAS Code
R Code
The glue package is a handy set of tools from the tidyverse that allows a program to interpret expressions within character strings. If variables or expressions are included in curly braces, glue will evaluate them and return the output into the final string.
year <- 2024
month <- "JUNE"
programmer <- "BRIANCARTER"
input <- glue::glue("/users/{programmer}/SAS/{year}")
output <- glue::glue("/users/{programmer}/SAS/{year}/{month}")
readRDS(glue::glue("{input}/data_for_{year}.rds")) |>
dplyr::filter(month == !!month) |>
saveRDS(glue::glue("{output}/{month}.rds"))
User-defined macros
The most powerful application of the macro system is the ability for SAS programmers to develop custom macros for any given task. Earlier in my career, I would run a series of models for a research project, pull and organize the estimates, and then use PROC REPORT to generate a pretty table for publication. Rather than develop de novo code each time, I could write a general-purpose macro that did the work for me.
When I work with SAS programmers migrating to R, I like to point out that the general anatomy and construction of a custom R function is very similar to a SAS macro. A user-defined SAS macro begins with the %MACRO command to create a named object that includes a series of macro variables that are included as parameters in the body of the macro to automate some task. The end of the macro is closed using the %MEND command.
R functions follow a very similar structure: a user defines a named function() object that includes an arbitrary number of arguments that parameterize the body of the function. Any object the user wishes to retain from the function is included in return().
SAS Code
R Version
fun <- function(dat, event, time, categorical, continuous) {
require(survival)
# define my model formula
f <- formula(
paste("Surv(", time, ",", event,") ~ ",
paste(c(categorical, continuous), collapse = "+"))
)
# fit the model
fit <- coxph(f, data = df, ties = "breslow")
# Pull the confidence intervals
limits <- confint(fit) |> exp() |> data.frame()
# format my output
out <- data.frame(estimates = exp(coef(fit)),
limits)
names(out) <- c("HazardRatio", "HRLowerCL", "HRUpperCL")
# return a single object
return(out)
}
fun(dat = df,
event = "event",
time = "time",
categorical = "exposure",
continuous = "covariate")
Loading required package: survival
HazardRatio HRLowerCL HRUpperCL
exposureNot Exposed 1.625042 0.3829869 6.895170
covariate 1.159639 0.9735426 1.381307
Conditional programming
A basic use of the SAS data step is the IF-ELSE conditional logic for cleaning and deriving variables, but there is a complimentary %IF-%ELSE for conditionally evaluating a macro flow. These provide the programmer with an option to build a single macro that behaves differently based on the input parameters and corresponds identically to IF() – ELSE()within an R function.
SAS Code
R Code
fun <- function(dat, event, time, categorical, continuous, subset) {
require(survival)
# define my model forumla
f <- formula(
paste("Surv(", time, ",", event,") ~ ",
paste(c(categorical, continuous), collapse = "+"))
)
# fit the model
fit <- coxph(f, data = df, ties = "breslow")
# Pull the confidence intervals
limits <- confint(fit) |> exp() |> data.frame()
# format my output
out <- data.frame(estimates = exp(coef(fit)),
limits)
names(out) <- c("HazardRatio", "HRLowerCL", "HRUpperCL")
if (subset == 1) {
out <- out["exposureNot Exposed",]
}
# return a single object
return(out)
}
fun(dat = df,
event = "event",
time = "time",
categorical = "exposure",
continuous = "covariate",
subset = 1)
Evaluation of macro arguments
Both SAS macros and R functions allow the users to include arguments that are not used. This is referred to as Lazy evaluation and is often referenced in discussions of R functions. Lazy evaluation simply means that an object is not evaluated until it is explicitly required. Although I’ve never heard the term applied to SAS macros, the functionality is the same. A quick demonstration of lazy evaluation in SAS and R is below. Although both include arguments for x and y, only x is referenced so the program will run without error if we fail to provide a value for y.
SAS Code
R Code
fun <- function(x, y) {
i <- x **2
cat(glue::glue("the value i is {i}"))
}
fun(x = 2)
the value i is 4
Default values are included in SAS macros to provide options to the end user. In my past life as a data analyst, I was responsible for running models and formatting output for several lead researchers. Each of these investigators had their own preferences for formatting output. In particular, each had her own preference for p-values. Rather than post hocmanual adjustments to the output, I could simply program these preferences into my macros. The default option would leave the p-values as they came from SAS; however, simply changing an option could fix them to the required style.
R provides an identical mechanism for default function arguments. If the user does not wish to change a default value, they do not have to explicitly reference it in the function call.
SAS Code
R Code
df <- data.frame(pvalues = c(0.0128945, 0.001, 0.0682))
p <- function(dat, style = "None") {
if (style == "Mia") {
dat <- dat |>
dplyr::mutate(pvalues = format(pvalues, scientific = TRUE, digits = 3))
}
if (style == "Vicky") {
dat <- dat |>
dplyr::mutate(pvalues = round(pvalues, digits = 4))
}
return(dat)
}
p(df) # using default value
pvalues
1 0.0128945
2 0.0010000
3 0.0682000
p(df, "Mia") # p-values for Mia
pvalues
1 1.29e-02
2 1.00e-03
3 6.82e-02
p(df, "Vicky") # p-values for Vicky
pvalues
1 0.0129
2 0.0010
3 0.0682
So how are functions and macros different?
Until this point, SAS macros and R functions have seemed pretty similar in their structure and use; however, there are notable differences under the hood that tend to cause problems for SAS users. As discussed above, a SAS macro is simply a programming hack: it takes a set of parameters as input to write a program that is executed when called. Under the hood, this is no different than simply writing the same DATA and PROC steps repeatedly.
The consequence of this is that anything created by a SAS macro will persist in the programming environment after execution. In the below example, I’ve written a macro that simply creates three subsets of an input dataset. Afterwards, we can run PROC DATASETS to print all the objects in the WORK library and you can see that these subsets are available to use.
SAS Code
R Code
The R function here is identical to the SAS macro, but after running it, none of the intermediate data frames are returned to memory.
df <- data.frame(
month = c("January", "February", "March", "April", "May",
"June", "July", "August", "September", "October",
"November", "December")
)
fun <- function(dat) {
jan <- dat |>
dplyr::filter(month == "January")
feb <- dat |>
dplyr::filter(month == "February")
march <- dat |>
dplyr::filter(month == "March")
}
fun(df)
ls() # ls() displays all the objects in the environment
[1] "df" "fun"
This demonstrates a fundamental difference in how R and SAS perform under the hood. R functions are not just a programming trick, they are self-contained objects that create a unique operating environment to safely process data without side effects. Side effects change the state of the program environment, either by adding/removing objects from memory, or changing the value of an object. In short, anything object created and used within an R function ceases to exist after execution unless explicitly returned to the global environment.
I have found that this often frustrates SAS programmers who are accustomed to macros as programming shortcuts. Often times it is useful for a macro to create a bunch of new datasets and output for reference later in the program. A good macro will use parameters to label all this output and SAS will keep it nicely organized. When they try to replicate this in an R function, they don’t understand why their data aren’t returned.
It takes a little practice to get into the new mindset, but this is an important feature of R that I have come to rely on. An R function will only return what I explicitly return. R functions will not accidentally overwrite existing objects; they will not accidentally change a system option; and they will not litter system memory with unneeded garbage.
Return values
R functions are designed to return only a single object. Users can return this object implicitly by just calling it as the last step of the function or explicitly using the return() function.
Implicit return
fun <- function(x) {
x * 5
}
fun(x = 5)
[1] 25
Explicit return
fun <- function(x) {
return(x * 5)
}
fun(x = 5)
[1] 25
Return multiple objects
SAS users are accustomed to everything inside the macro being available after the macro runs. This is really useful! I would always construct my macros to provide me with a lot of output, even if I didn’t necessarily need it. I would run models and format the output into a table, but I would also want to save all the raw output from the models in case I needed to go back and review it.
We can do the same thing in R if we just organize our output into a list object and return that. An R list is a heterogeneous collection of objects. I like to think of it as a data bucket that I can throw anything into. I love this feature of R because it forces me to organize my output and drop the junk that I don’t need.
fun <- function(dat, y, x) {
f <- formula(paste0(y, "~", x))
fit <- lm(f, data = dat)
results <- summary(fit)$coef |>
data.frame()
final <- list(results = results, model_fit = fit)
return(final)
}
foo <- fun(mtcars, "mpg", "cyl")
names(foo)
[1] "results" "model_fit"
The second way is less preferred because it breaks R’s no side-effect policy. Users can use the <<- assignment operator to force the creation of a variable in the parent environment. In principle, programmers can use this operator to output multiple objects from a function but this is considered poor practice. As a general rule, a function without any side effects is a safe program that can be used across contexts. That said, there are times when using the <<- operator is helpful. For example, I often use the <<- to create a counter that records how many times a function gets called in a process. This is useful for logging, reproducing errors, and general housekeeping of my processes.
counter <- function() {
if (!exists("i")) i <- 0 # initialize my counter
i <<- i + 1
}
someFunction <- function() {
### do something useful
counter()
}
someFunction() # -> [1] 1
someFunction() # -> [1] 2
someFunction() # -> [1] 3
print(i)
[1] 3
Conclusions and best practices
In my experience, advanced SAS programmers accustomed to working in a macro environment will find the transition to R functions to be fairly intuitive once they have overcome the general differences between the two languages. When I started my career as a SAS programmer, my manager told me that the best way to learn macro programming was to do a lot of macro programming and I’d offer the same advice for R functions. Don’t be afraid to jump right in!
In conclusion, I’d like to offer a few tips for functional programming that SAS users should keep in mind when making their transition:
- Have a plan before you start typing: I like to think backwards. I explicitly define the details of my function output before outlining the steps needed to produce that output. Outlining those steps will tell me all the inputs that I need. I often write these steps out with # comments before programming the function. I’ve learned the hard way that working without a thorough plan will only produce a mess.
- Simple is better than complex: Follow the Unix philosophy with your functions. Each function should do exactly one thing well. It is better to string together a lot of simple functions than add features to one big complicated function. I’ve written 9000-line functions, they are rarely generalizable, can’t be reused, and are full of bugs that are difficult to find.
- Functions should be self-contained: Every input required for a function should be an argument to the function. Don’t rely on a function to just pull values from the global environment, explicitly name those parameters in your function definition.
- Think about the end user: A good function will include helpful messages to your users. Adding helpful message(), warning(), and stop() scripts in your functions will keep you aware of the limits of your function. You’ll use this function a year later and be glad you provided some guardrails.
- Documentation saves time: The roxygen package will create a skeleton documentation for each of your functions. This package mirrors the R documentation bundled with every package and it only takes a few minutes to fully document your function. You will never regret taking the time to do this.
- Don’t be afraid to jump right in: The best way to learn functional programming to to jump right in. Framing your workflows as repeatable functions is a mindset that can be learned. The more you practice, the more intuitive automation becomes.
Thank you for reading and keep an eye out for our next post in our SAS to R Guide series where we will be taking a look at the SAS Data Step.