Positron and Altair SLC

Table of Contents

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

A wide-angle view of a sleek digital workspace with several monitors showing different software interfaces. The main screen highlights the Positron IDE with code in SAS, R, and Python, while another screen showcases the integration of the Altair SAS Language Compiler (SLC) in a VS Code-like environment. The setup emphasizes seamless tool integration in a data science workflow.

Introduction

In an earlier post I argued that SAS is a major tool utilized by the data science community, especially in pharma, finance, and the public sector. R and Python are not rare in these industries, but SAS is still required. Data scientists have the choice between juggling between multiple IDEs or using SAS Viya, the only platform that integrates SAS, R, and Python into a single IDE. I certainly would prefer an alternative.

Posit has released a public beta of their new IDE, Positron, and I have spent the last couple months evaluating it. I’d like to share a neat little feature of Positron that hasn’t gotten any attention but may have enormous benefits to the data science community. Positron supports the integration of the wide variety of VS Code extensions, one of which is the SAS Language Compiler (SLC) available from Altair.

 

SAS in Data Analysis

SAS is the industry standard for data management and analysis with wider use than any open source analysis tools. The SAS ecosystem is deeply embedded in many industries and my consulting group works closely with companies who are migrating from SAS to R and Python. One of the main challenges I face is pushback from SAS users who argue that SAS is the best tool for certain data management tasks.

After years of working with these clients, I’ve come to agree that SAS is a superior platform for working with tabular data. Preparing, munging, and managing data is as vital to data analysis as the sexier statistical models, graphics capabilities, and advanced algorithms available in R and Python.

SAS understands the value of open-source analytics, and they have incorporated R and Python into the Viya platform. This has been the only commercial product that provides users with a single environment for managing SAS-R-Python workflows. This was a step in the right direction for SAS, but it still locks users into a proprietary ecosystem with an arguably clunky interface.

 

Altair SLC™

Altair SLC™ is an alternative that allows users to develop and run SAS code without buying into the SAS ecosystem. I’ve been using it for several months now and it is a remarkable system. It can read and write .sas7bdat files, execute the entirety of the DATA step, and all the major procedures. Altair acknowledges that their compiler is only about 90% complete and I have noticed a few procedure options missing here and there. But my overall impression is that this is a viable alternative to SAS for most of my clients.

Enter Positron

In the multilingual environment of data science, we often find ourselves juggling between R and Python daily. Recognizing the need for a more integrated workflow, Posit has introduced a variety of tools that seamlessly incorporate both languages into single environments. Their latest product is the open-beta release of the Positron IDE Positron provides first-class support for both R and Python, promising a more seamless integration between the two.

 

Positron and Altair SLC

I was curious to see whether I could integrate Positron with Altair SLC and I’m excited to share that the process was straightforward because Positron can install all the extensions available for VSCode, one of which is Altair SLC.

 

Integrating Positron with Altair SLC

Configuring this was surprisingly easy. You will first need to download SLC from Altair. Altair offers several licensing options, but there is a free community edition for evaluation and that is what I have been using. Once activated, you can search for SLC in the Positron Extension tab and click “install”. Once installed, you will need to “Start” the SLC session in Positron. You can now open a new SLC Notebook file and start running SAS code.

Altair SLC dashboard

Example Workflow

Before I wrote this post, I wanted to evaluate how this would work in a real-world workflow. I spent most of my career as an epidemiologist, so I decided to download a public dataset available from the Centers for Disease Control and Prevention to demonstrate how one would use SAS and R within a single project. I wanted to evaluate a simple epidemiological question: is current and former smoking associated with cancer? I’ve included this workflow into a public GitHub repository for review.

The National Health Interview Survey is a nationally representative sample of households that assesses demographics, health behaviors, and prevalent diseases. It has been collected annually since 1957 and is a valuable resource for epidemiologists.

 

Getting the Data into Positron

The data are provided in ASCII and CSV formats. Each file includes a SAS Program for importing these data into SAS that includes formats and labels for all variables.

Running 01. Import Data.sas file in the SLC Notebook will import these data into the SLC environment with few changes to the code. By default, the SLC will save data as a .wpd file, which is not compatible with R. You can change this option in the LIBNAME statement by explicitly requesting a .sas7bdat file. Once this change was made, the code provided by NHIS ran without error in Positron.

				
					%let nhisfolder= /users/briancarter/onedrive/myBlogs/positronBlog/data;
filename ASCIIDAT "&nhisfolder/adult23.dat";
libname NHIS sas7bdat "&nhisfolder";
data NHIS.&adultds;
    infile ASCIIDAT;
/* --- input code --- */
run;
				
			

The NHIS has many hundreds of variables and I don’t need them all. Moreover, the data are not always appropriate for the analysis I wanted to do. I needed to do a bit of cleanup and derivation prior to my analysis. 02. Data Subset.sas subsets the NHIS data to a more limited number of variables and I got to work.

The nice thing about SAS is that it is a near-perfect tool for managing tabular data. I am primarily an R user, but the added complexity of variable classes, NA values, dplyr vs base R: it requires a lot more expertise than creating a ggplot or running a GLM. I particularly like the IF-THEN-ELSE syntax for recoding variables, and I really like having it in Positron.

My most favorite feature of the SAS DATA step works fantastically in Positron-SLC. SAS Arrays are a programming shortcut for performing some data step on a group of similar variables. I can do the same thing in R using lapply() or purrr, but the simplicity of SAS arrays have always appealed to me. So I was excited to run the following code to clean up all the prevalent cancer variables:

				
					array orig (*) BLADDCAN_A BLOODCAN_A
    BONECAN_A BRAINCAN_A BREASCAN_A CERVICAN_A
    ESOPHCAN_A GALLBCAN_A LARYNCAN_A
    LEUKECAN_A LIVERCAN_A LUNGCAN_A LYMPHCAN_A
    MELANCAN_A MOUTHCAN_A OVARYCAN_A PANCRCAN_A
    PROSTCAN_A
    STOMACAN_A THROACAN_A THYROCAN_A
    UTERUCAN_A HDNCKCAN_A COLRCCAN_A OTHERCANP_A;

array derived (*) bladder blood bone brain breast cervix
              esoph gallbladder larynx leukemia
              liver lung lymphoma melanoma oral ovary
              pancreas prostate stomach throat
              thyroid uterine hodgkins crc other;
do i = 1 to dim(orig);
    if orig{i} = 1 then derived{i} = 1;
    else derived{i} = 0;
end;

				
			

Table 1

When I was a SAS programmer, I would create my output tables using PROC REPORT. Over the years I had developed some styling syntax that would make it look ok, but PROC REPORT is a clunky tool and I never liked it.

R provides data scientists with various tools for producing publication-ready tables that can be output in various formats. For Table 1, I used the flextable package to create a table that summarizes the NHIS data. Categorical variables must be converted to factors and each variable should have a label attribute.

				
					library(flextable)

subset <- df |>
  select(age,smokestatus, allCancer)

labels <- sapply(names(subset), function(x)  attr(subset[[x]], "label"))

subset |>
  summarizor(by = "allCancer") |>
  as_flextable(spread_first_col = TRUE, separate_with = "variable") |>
  bold(i = ~!is.na(variable), j = 1, bold = TRUE) |>
  set_caption("Table 1: Descriptive statistics of the NHIS 2023 data") |>
  labelizor(j = "stat", labels = labels)
				
			

table 1

Figure 1

Many of us got our start in R through the graphics capabilities. There are several packages that produce beautiful graphics, but ggplot2 is the most popular because it is easy to use and dovetails nicely into the tidyverse ecosystem. For Figure 1, I’ve used ggplot2 to create boxplots of the age distribution by cancer prevalence. Cancer is a disease that affects older people more than younger, and the boxplot clearly demonstrates this.

				
					library(ggplot2)
library(ggpubr)

one <- df |>
  filter(sex == "Male") |>
  ggplot(aes(x = allCancer, y = age)) +
  geom_boxplot(fill='#A4A4A4', color="black") +
  ggtitle("Men") +
  xlab("") +
  ylab("Age at interview") +
  theme(plot.title = element_text(hjust = 0.5))


two <- df |>
  filter(sex == "Female") |>
  ggplot(aes(x = allCancer, y = age)) +
  geom_boxplot(fill='#A4A4A4', color="black") +
  ggtitle("Women") +
  xlab("") +
  ylab(label = "Age at interview") +
  theme(plot.title = element_text(hjust = 0.5))

ggarrange(one, two, nrow = 1)

				
			

Table 3

To actually evaluate whether smoking is associated with cancers, I wrote a function that fit the data to a logistic regression model. The glm function is bundled with Base R to fit a variety of generalized linear models and I evaluated the association between smoking and all cancers, breast cancer, and lung cancer. My function pulls out the appropriate statistics, but I rely on the gt package to format the output into a table.

The gt package is a handy alternative to flextable and I like how it formats my output. The gt package has methods for automatically tabling output from a variety of models, but I chose to subset my output in the function prior to tabling it.

				
					allcancer <- myModels(df, "allCancer", "All Cancer Combined", "smokestatus")
breast <- myModels(df, "breast", "Breast cancer (women only)", "smokestatus")
lung <- myModels(df, "lung", "Lung cancer", "smokestatus")

rbind(allcancer, breast, lung) |>
  mutate(Outcome = tidyr::replace_na(Outcome, "")) |>
  gt() |>
  tab_header(
    title = "Association between smoking status and cancer",
    subtitle = "NHIS Data 2023") |>
  tab_footnote("Models adjust for age and race") |>
  opt_align_table_header(align = "left") |>
  opt_vertical_padding(scale = 0.5)
				
			

Table 1: Association between smoking status and cancer

Conclusion

 I moved from SAS to R because it provided me with access to the latest statistical models, big data manipulation, and graphics. But I always missed the simplicity and effectiveness of the SAS DATA step. I grew accustomed to doing these tasks in R because moving processes across multiple IDEs was a pain.

Now that I can integrate the Altair SLC with Posit’s new IDE, I can do everything I need to do within a single environment and choose the most appropriate tool for the job. I can’t wait to see what the future holds for Positron and the Altair SLC and I am excited to see if the SLC can be integrated into some of Posit’s server-based products as well.

Stay tuned…

Author

Keep reading

Dig deeper into data development by browsing our blogs…
ProCogia would love to help you tackle the problems highlighted above. Let’s have a conversation! Fill in the form below or click here to schedule a meeting.