Introduction
I spent most of my career as a lonely R programmer in a SAS group that was perpetually failing its R migration. There are always difficulties when an organization wants to change course, and in this case the major pushback came from non-technical study management staff who still needed SAS for their day-to-day workflows.
These staff managed all of the raw study data coming into our research program. They imported messy questionnaire data from vendors and prepared it for the scientific staff to analyze. The SAS DATA step was their primary domain and they were understandably sensitive to any changes that might risk our science.
I often thought that if I could simply translate their macros and educate them about vectors, there would be no more excuses and we could cut the cord to SAS. That strategy was never successful. I came to see that, for large swaths of data professionals, the DATA step is a safe place to do vital work.
I want to discuss three aspects of the SAS DATA step that form a mental model of data processing particular to SAS users and why this makes it difficult for them to transition to modern open-source tools.
- The tabular SAS data set is the only data structure available in SAS.
- These data sets store data in only two types
- These data sets are processed differently than any other data processing language
The SAS Data Set
The SAS data set is proprietary and is essentially the only data structure available in SAS. They conform to what most people think of as data: a 2-dimensional table with rows representing a particular observation and columns as variables in that observation. The SAS DATA step locks the user into this single tabular structure and all procedures use it as an input.
Unlike SAS, tabular data are just one of many structures necessary for a typical R workflow. The fundamental unit of analysis in R is the vector, a homogeneous set of values of any length. Matrices and arrays expand the vector into a multidimensional structure. Heterogeneous collections of vectors can be organized into data frames. Lists are collections of all sorts of data: vectors, matrices, data frames, even other lists. A cursory search will reveal additional data structures for more specialized applications.
SAS users may initially feel a little overwhelmed by this menagerie of objects. I often recommend that that they start their introduction to R through the tidyverse, a dialect of R that broadly emulates the SAS data step because it maintains the focus on a single data frame while retaining the processing to vectors. I would emphasis that it is important to extend the learning process to include traditional base R in order to take advantage of the other data structures necessary for a typical R workflow.
Data Types
The SAS DATA Step has exactly two data types: character and numeric. On its face, this seems like a limitation, but it has its advantages. This simple system includes reasonable defaults for coercion and type conversion, so users rarely worry about these details. This is in contrast to R that includes 6 atomic variable types that can be expanded and adapted across packages and use cases. The below table roughly maps the basic vector types in R to their corresponding SAS versions.
Formats
SAS applies formatting attributes to simulate many distinct types of variables. SAS dates, for instance, are simply a numeric value representing the number of days relative to January 1st, 1960. Users treat date values like any other number, but the added format displays them as a date. Dates work similarly in R, only the value is relative to January 1st, 1970. When working with a SAS data set that includes dates, I often start every program with a quick conversion:
dateVariables <- c("date1", "date2", "date3")
df[,dateVariables] <- lapply(df[,dateVariables], function(sasDate) {
as.Date(sasDate, origin = "1960-01-01")
}
Users can also distinguish continuous vs categorical variables by including a format for a numeric variable. Using PROC FORMAT will provide display values and order to a variable. SAS users take advantage of this to create ordered output or define reference groups in the CLASS statement for many models.
SAS users will find the R factor type to be the closest analog to a SAS format. The factor type is a categorical vector that can be ordered or unordered. Just like SAS, factor variables can be used for post hoc contrasts, statistical interactions, and ordered results. Unlike SAS, factor labels/levels are hard coded into the data; there is no need for a format library which makes R data more portable.
SAS Code
R Version
df |>
dplyr::mutate(var1 = factor(var1,
labels = c("No", "Yes", "Missing"))) |>
dplyr::count(var1)
var1 n
1 No 3
2 Yes 2
3 Missing 1
SAS is often smart enough to identify a number, even when it is coded as a character string. Coercion is a process where a variable is converted from one type to another and it is often a feature included by developers to make the user experience a bit more seamless. Due to the simple data types used in SAS, this is often implemented very intuitively and SAS users don’t often think about it until they see a gentle warning in their logs. Take the following example. This sample data set uses character variables, char1 and char2 to store numeric values, and I want to calculate a mean of them:
SAS Code
SAS is helpful and knows that char1 and char2 are numbers formatted as strings and automatically converts then for the mean()
R Version
In R, the mean() function does not coerce the character variables to numeric. The underlying assumption is that it is not meaningful to compute for character data, So the user must do this manually. In this case, R has set new to missing.
one <- data.frame(NAME = c("BRIAN", "DESIREE", "MARGOT", "OWEN", "RIPLEY"),
char1 = c("1", "2", "3", "1", "10"),
char2 = c("2", "3", "4", "5", "6"))
one |>
dplyr::rowwise() |>
dplyr::mutate(new = mean(char1, char2))
# A tibble: 5 × 4
# Rowwise:
NAME char1 char2 new
1 BRIAN 1 2 NA
2 DESIREE 2 3 NA
3 MARGOT 3 4 NA
4 OWEN 1 5 NA
5 RIPLEY 10 6 NA
As a general rule, R functions are designed to only work with a specific type of data: mean() will only work on numeric data and substr() will only work on characters. However, there are exceptions to the rule and this often seems like an illogical break in consistency to SAS users. R also supports generic functions that behave differently depending on the data. For Instance, the summary() function will provide numeric distributions for a vector of numbers, or structured statistic output if run on a glm object.
Package developers often incorporate a mixture of generic and type-specific functions because this kind of flexibility is a feature of R that is shared with many programming languages. It takes a bit of getting used to for SAS programmers, but ultimately allows users to spend more time understanding their data and less time thinking about how to understand their data.
DATA Step Mechanics
The SAS DATA processing operates on two sequential steps. In the compilation phase, SAS scans the code for syntax errors and translates it into machine language. It initializes a program data vector (PDV) in memory where is builds the output data set one observation at a time. The PDV will include the final variables with attributes requested by the DATA step.
The program is then executed as a loop, with each row of the input data set processed individually and output to the PDV. If no errors are found in the observation, the result is output to a final data set. This process repeats until SAS finishes with the last observation or an error is identified.
Variable attributes are defined at two stages of the DATA step. They can be defined explicitly through INPUT, LENGTH, LABEL or FORMAT. Alternatively, SAS can infer these attributes by, typically by the initial or final attributes fed to the PDV.
This observation-based processing trains SAS programmers to think of the observation as the basic unit of every SAS data set. As discussed above, the vector is the basic data structure in R and data frames are simply a collection of equal-length vectors and each step of the processing is a manipulation of these vectors.
Native R users may find the SAS DATA step to be a very inefficient process, but SAS programmers find it intuitive and it is difficult to break. Here are two examples that SAS users will find commonplace but work very differently in R.
Example 1: Summarizing Variables
A typical workflow includes deriving new variables based on the values of others. In this case, the user wants to calculate the sum of two variables. Since they are accustomed to SAS’s observation-based processing, they could choose the built-in sum() function to calculate the total, and this works as expected. However, in R, the sum() function is vectorized, which produces a very different result.
::: panel-tabset
SAS Code
The SAS code works as expected, the newvar variable is simply a row-wise sum of var1 and var2.
R Version
In R, the sum() function is vectorized, so the result is the sum of all the values in the vector. This is not what the user intended.
one |>
dplyr::mutate(newvar = sum(var1, var2))
Name var1 var2 newvar
1 Brian 1 2 37
2 Desiree 2 3 37
3 Margot 3 4 37
4 Owen 1 5 37
5 Ripley 10 6 37
Correct R Version
To get the desired result, the user can force R to treat the data frame similarly to SAS by using the dplyr::rowwise(). Remember to dplyr::ungroup() at the end of each code chunk. Note: dplyr::rowwise() also demonstrates that observation-based processing is painfully slow.
one |>
dplyr::rowwise() |>
dplyr::mutate(newvar = sum(var1, var2)) |>
dplyr::ungroup()
# A tibble: 5 × 4
Name var1 var2 newvar
1 Brian 1 2 3
2 Desiree 2 3 5
3 Margot 3 4 7
4 Owen 1 5 6
5 Ripley 10 6 16
Example 2: RETAIN Statement
The RETAIN statement is a common feature in the SAS DATA step. It is used to carry forward the value of a variable from one observation to the next. This is useful for creating lagged variables or for accumulating totals. The RETAIN statement works because of the observation-based processing in R. For each iteration of the PDV, the prior value is retained rather than initialized to missing. SAS programmers use this for all sorts of things, from creating sequential identifiers, to imputing missing values.
Let’s look at an example in SAS of how using the RETAIN statement can be used to calculate a cumulative total over years:
SAS Code
The SAS code works as expected, the newvar variable is simply a row-wise sum of var1 and var2.
R Version
These tools exist in R. The cumsum() function is a vectorized version of the RETAIN statement. The cumsum() function calculates the cumulative sum of a vector. These are known as window functions and are fully supported in the dplyr package. In other context such as time series work, R provides built in functions such as lag(x, k) where x is the variable on which to compute and k is the number of units to lag.
one |>
dplyr::mutate(cum_sum = cumsum(N))
year N cum_sum
1 2000 100 100
2 2001 42 142
3 2002 106 248
4 2003 225 473
5 2004 47 520
6 2005 68 588
7 2006 92 680
8 2007 136 816
9 2008 178 994
Migrating to a new technology almost always forces people out of their comfort zones and sometimes a new tool will require users to reconceptualize basic parts of the job. The big hurdle for SAS users who want to work in R is moving from the rigid, but comfortable and reassuring, SAS DATA step to work with algorithms that process vectorized data.