Master the Transition from SAS to R: Understanding the SAS DATA Step

Table of Contents

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

A modern split-screen illustration comparing the SAS DATA Step with R programming. The left side represents SAS with a clean, minimalist data table and streamlined icons for the SAS DATA Step process. The right side showcases the R environment with dynamic, abstract shapes and interconnected data structures like vectors, matrices, and data frames. The background features a smooth gradient transitioning between the two environments, using vibrant colors to highlight the contrast between traditional SAS DATA workflows and the flexible, innovative approach of R programming.

Introduction 

I spent most of my career as a lonely R programmer in a SAS group that was perpetually failing its R migration. There are always difficulties when an organization wants to change course, and in this case the major pushback came from non-technical study management staff who still needed SAS for their day-to-day workflows.  

These staff managed all of the raw study data coming into our research program. They imported messy questionnaire data from vendors and prepared it for the scientific staff to analyze. The SAS DATA step was their primary domain and they were understandably sensitive to any changes that might risk our science.  

I often thought that if I could simply translate their macros and educate them about vectors, there would be no more excuses and we could cut the cord to SAS. That strategy was never successful. I came to see that, for large swaths of data professionals, the DATA step is a safe place to do vital work.  

I want to discuss three aspects of the SAS DATA step that form a mental model of data processing particular to SAS users and why this makes it difficult for them to transition to modern open-source tools.  

  • The tabular SAS data set is the only data structure available in SAS. 
  • These data sets store data in only two types 
  • These data sets are processed differently than any other data processing language

 

The SAS Data Set 

The SAS data set is proprietary and is essentially the only data structure available in SAS. They conform to what most people think of as data: a 2-dimensional table with rows representing a particular observation and columns as variables in that observation. The SAS DATA step locks the user into this single tabular structure and all procedures use it as an input.  

Unlike SAS, tabular data are just one of many structures necessary for a typical R workflow. The fundamental unit of analysis in R is the vector, a homogeneous set of values of any length. Matrices and arrays expand the vector into a multidimensional structure. Heterogeneous collections of vectors can be organized into data frames.  Lists are collections of all sorts of data: vectors, matrices, data frames, even other lists. A cursory search will reveal additional data structures for more specialized applications.  

SAS users may initially feel a little overwhelmed by this menagerie of objects. I often recommend that that they start their introduction to R through the tidyverse, a dialect of R that broadly emulates the SAS data step because it maintains the focus on a single data frame while retaining the processing to vectors. I would emphasis that it is important to extend the learning process to include traditional base R in order to take advantage of the other data structures necessary for a typical R workflow. 

 

Data Types 

The SAS DATA Step has exactly two data types: character and numeric. On its face, this seems like a limitation, but it has its advantages. This simple system includes reasonable defaults for coercion and type conversion, so users rarely worry about these details. This is in contrast to R that includes 6 atomic variable types that can be expanded and adapted across packages and use cases. The below table roughly maps the basic vector types in R to their corresponding SAS versions.

SAS and R Atomic TypesFormats

SAS applies formatting attributes to simulate many distinct types of variables. SAS dates, for instance, are simply a numeric value representing the number of days relative to January 1st, 1960. Users treat date values like any other number, but the added format displays them as a date. Dates work similarly in R, only the value is relative to January 1st, 1970. When working with a SAS data set that includes dates, I often start every program with a quick conversion:

				
					dateVariables <- c("date1", "date2", "date3")

df[,dateVariables] <- lapply(df[,dateVariables], function(sasDate) {
  as.Date(sasDate, origin = "1960-01-01")
}
				
			

Users can also distinguish continuous vs categorical variables by including a format for a numeric variable. Using PROC FORMAT will provide display values and order to a variable. SAS users take advantage of this to create ordered output or define reference groups in the CLASS statement for many models

SAS users will find the R factor type to be the closest analog to a SAS format. The factor type is a categorical vector that can be ordered or unordered. Just like SAS, factor variables can be used for post hoc contrasts, statistical interactions, and ordered results. Unlike SAS, factor labels/levels are hard coded into the data; there is no need for a format library which makes R data more portable.

 

SAS Code

SAS Code

R Version

				
					df |>
  dplyr::mutate(var1 = factor(var1,
                              labels = c("No", "Yes", "Missing"))) |>
  dplyr::count(var1)
     var1 n
1      No 3
2     Yes 2
3 Missing 1
				
			

Coercion

SAS is often smart enough to identify a number, even when it is coded as a character string. Coercion is a process where a variable is converted from one type to another and it is often a feature included by developers to make the user experience a bit more seamless. Due to the simple data types used in SAS, this is often implemented very intuitively and SAS users don’t often think about it until they see a gentle warning in their logs. Take the following example. This sample data set uses character variables, char1 and char2 to store numeric values, and I want to calculate a mean of them:

 

SAS Code

SAS is helpful and knows that char1 and char2 are numbers formatted as strings and automatically converts then for the mean()

SAS Code

R Version

In R, the mean() function does not coerce the character variables to numeric. The underlying assumption is that it is not meaningful to compute for character data, So the user must do this manually. In this case, R has set new to missing. 

				
					one <- data.frame(NAME = c("BRIAN", "DESIREE", "MARGOT", "OWEN", "RIPLEY"),
                   char1 = c("1", "2", "3", "1", "10"),
                   char2 = c("2", "3", "4", "5", "6"))

one |>
  dplyr::rowwise() |>
  dplyr::mutate(new = mean(char1, char2))
# A tibble: 5 × 4
# Rowwise: 
  NAME    char1 char2   new
  <chr>   <chr> <chr> <dbl>
1 BRIAN   1     2        NA
2 DESIREE 2     3        NA
3 MARGOT  3     4        NA
4 OWEN    1     5        NA
5 RIPLEY  10    6        NA
				
			

As a general rule, R functions are designed to only work with a specific type of data: mean() will only work on numeric data and substr() will only work on characters. However, there are exceptions to the rule and this often seems like an illogical break in consistency to SAS users. R also supports generic functions that behave differently depending on the data. For Instance, the summary() function will provide numeric distributions for a vector of numbers, or structured statistic output if run on a glm object. 

Package developers often incorporate a mixture of generic and type-specific functions because this kind of flexibility is a feature of R that is shared with many programming languages. It takes a bit of getting used to for SAS programmers, but ultimately allows users to spend more time understanding their data and less time thinking about how to understand their data.

DATA Step Mechanics

The SAS DATA processing operates on two sequential steps. In the compilation phase, SAS scans the code for syntax errors and translates it into machine language. It initializes a program data vector (PDV) in memory where is builds the output data set one observation at a time. The PDV will include the final variables with attributes requested by the DATA step. 

The program is then executed as a loop, with each row of the input data set processed individually and output to the PDV. If no errors are found in the observation, the result is output to a final data set. This process repeats until SAS finishes with the last observation or an error is identified. 

Variable attributes are defined at two stages of the DATA step. They can be defined explicitly through INPUT, LENGTH, LABEL or FORMAT. Alternatively, SAS can infer these attributes by, typically by the initial or final attributes fed to the PDV. 

This observation-based processing trains SAS programmers to think of the observation as the basic unit of every SAS data set. As discussed above, the vector is the basic data structure in R and data frames are simply a collection of equal-length vectors and each step of the processing is a manipulation of these vectors. 

Native R users may find the SAS DATA step to be a very inefficient process, but SAS programmers find it intuitive and it is difficult to break. Here are two examples that SAS users will find commonplace but work very differently in R.

 

Example 1: Summarizing Variables

A typical workflow includes deriving new variables based on the values of others. In this case, the user wants to calculate the sum of two variables. Since they are accustomed to SAS’s observation-based processing, they could choose the built-in sum() function to calculate the total, and this works as expected. However, in R, the sum() function is vectorized, which produces a very different result.

::: panel-tabset

 

SAS Code

The SAS code works as expected, the newvar variable is simply a row-wise sum of var1 and var2.

SAS Code

R Version

In R, the sum() function is vectorized, so the result is the sum of all the values in the vector. This is not what the user intended.

				
					one |>
  dplyr::mutate(newvar = sum(var1, var2))
     Name var1 var2 newvar
1   Brian    1    2     37
2 Desiree    2    3     37
3  Margot    3    4     37
4    Owen    1    5     37
5  Ripley   10    6     37
				
			

Correct R Version

To get the desired result, the user can force R to treat the data frame similarly to SAS by using the dplyr::rowwise(). Remember to dplyr::ungroup() at the end of each code chunk. Note: dplyr::rowwise() also demonstrates that observation-based processing is painfully slow.

				
					one |>
  dplyr::rowwise() |>
  dplyr::mutate(newvar = sum(var1, var2)) |>
  dplyr::ungroup()
# A tibble: 5 × 4
  Name     var1  var2 newvar
  <chr>   <dbl> <dbl>  <dbl>
1 Brian       1     2      3
2 Desiree     2     3      5
3 Margot      3     4      7
4 Owen        1     5      6
5 Ripley     10     6     16
				
			

Example 2: RETAIN Statement

The RETAIN statement is a common feature in the SAS DATA step. It is used to carry forward the value of a variable from one observation to the next. This is useful for creating lagged variables or for accumulating totals. The RETAIN statement works because of the observation-based processing in R. For each iteration of the PDV, the prior value is retained rather than initialized to missing. SAS programmers use this for all sorts of things, from creating sequential identifiers, to imputing missing values.

Let’s look at an example in SAS of how using the RETAIN statement can be used to calculate a cumulative total over years:

 

SAS Code

The SAS code works as expected, the newvar variable is simply a row-wise sum of var1 and var2.

SAS Code

R Version

These tools exist in R. The cumsum() function is a vectorized version of the RETAIN statement. The cumsum() function calculates the cumulative sum of a vector. These are known as window functions and are fully supported in the dplyr package. In other context such as time series work, R provides built in functions such as lag(x, k) where x is the variable on which to compute and k is the number of units to lag.

				
					one |>
  dplyr::mutate(cum_sum = cumsum(N))
  year   N cum_sum
1 2000 100     100
2 2001  42     142
3 2002 106     248
4 2003 225     473
5 2004  47     520
6 2005  68     588
7 2006  92     680
8 2007 136     816
9 2008 178     994
				
			

Conclusions

Migrating to a new technology almost always forces people out of their comfort zones and sometimes a new tool will require users to reconceptualize basic parts of the job. The big hurdle for SAS users who want to work in R is moving from the rigid, but comfortable and reassuring, SAS DATA step to work with algorithms that process vectorized data.

Author

Keep reading

Dig deeper into data development by browsing our blogs…
ProCogia would love to help you tackle the problems highlighted above. Let’s have a conversation! Fill in the form below or click here to schedule a meeting.