Yellowstone: A case study of data science

Author

Bioinformatics
Data Science

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Recently, I visited Yellowstone National Park, which is well known for its geyser basins, lakes, scenic mountains, and variety of wildlife. As a data scientist, I was amazed by the park officials’ ability to predict the time and duration of geyser eruptions. My extreme curiosity drove me to explore how they did it. After reviewing a few articles, I learned about their prediction process and the data science behind it.

I specifically investigated the prediction of the most famous geyser, Old Faithful. While playing with the data, I found—to my surprise—that R Studio has a built-in data set for Old Faithful eruptions. Below is a sample of the data set.

Old Faithful is currently bimodal. It has two eruption durations, one that lasts more than four minutes and a second that occurs more rarely and lasts about two-and-a-half minutes. The objective is to figure out if it is a long or short eruption—a clear classification problem.

After plotting the data, I saw a linear relationship between the waiting time and duration, which implies that as the waiting time increases, the duration of the eruption increases. For this reason, I chose to employ a simple machine learning algorithm, linear regression. Although it was a classification problem, we can apply regression and obtain more accurate results. To convert them into two classes, we enforce a condition on the regression results. The data set already exists in R, so I performed the predictions using R programming.

Visualization provides a unique perspective and better understanding of the data set. It is clear from the box plot below that the range is around 40-100. Interpreting and understanding the quartiles, median and the range is important here.

Once the model is finalized, we would split the data into training and testing sets. The lm function in R applies the linear model to the data and gives us the estimates (i.e., the value of the intercept/constant).

y = mx + c

ExpectedDuration = m (waiting time) + c

Now, let’s assume the waiting time between eruptions is 75 minutes. We would use the above equation and have:

ExpectedDuration = 0.073901 (75) -1.792739 = 3.749836

Similarly, for any new waiting time, the duration would be predicted in the way shown above. Based on the results, we can conclude that as the waiting time increases, the geyser eruption will have a longer duration. Therefore, it would fall into the long duration class. On the other hand, if the waiting time decreases, the geyser eruption would fall into the short duration class.

In summary, we don’t necessarily need to apply a complex machine learning or deep learning algorithm to solve a data science problem. My interest to know how the geyser eruptions were predicted led me to play around with the data. Deriving these conclusions was organic and came out of sheer curiosity. By performing a simple linear regression model, we have achieved great results. So, applying a complicated machine learning algorithm does not always equate to great results. Sometimes, a machine learning algorithm is not necessary, and a simple algorithm, such as a linear regression, is actually more effective.

Author

Bill Carney

View all posts

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…

A wide landscape digital illustration for a blog titled "Turning AI Potential into Impactful Business Use Cases". The image features a futuristic, glowing blue cityscape representing a data-driven "frontier firm". In the foreground, a translucent human hand interacts with a holographic interface displaying data charts and AI icons, symbolizing the transition from human assistants to autonomous, agent-led operations.

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.

Take a deeper dive

Locate Us

Follow Us

Contact Us

Take a deeper dive

Locate Us

Follow Us

Contact Us

Yellowstone: A case study of data science

Author

Bill Carney

Table of Contents

Categories

Sign up for our newsletter

Author

Subscribe to our newsletter

Keep reading

Get in Touch