Yellowstone: A case study of data science

Table of Contents

Sign up for our newsletter

We care about the protection of your data. Read our Privacy Policy.

Recently, I visited Yellowstone National Park, which is well known for its geyser basins, lakes, scenic mountains, and variety of wildlife. As a data scientist, I was amazed by the park officials’ ability to predict the time and duration of geyser eruptions. My extreme curiosity drove me to explore how they did it. After reviewing a few articles, I learned about their prediction process and the data science behind it.

I specifically investigated the prediction of the most famous geyser, Old Faithful. While playing with the data, I found—to my surprise—that R Studio has a built-in data set for Old Faithful eruptions. Below is a sample of the data set.

Old Faithful is currently bimodal. It has two eruption durations, one that lasts more than four minutes and a second that occurs more rarely and lasts about two-and-a-half minutes. The objective is to figure out if it is a long or short eruption—a clear classification problem.

After plotting the data, I saw a linear relationship between the waiting time and duration, which implies that as the waiting time increases, the duration of the eruption increases. For this reason, I chose to employ a simple machine learning algorithm, linear regression. Although it was a classification problem, we can apply regression and obtain more accurate results. To convert them into two classes, we enforce a condition on the regression results. The data set already exists in R, so I performed the predictions using R programming.

Visualization provides a unique perspective and better understanding of the data set. It is clear from the box plot below that the range is around 40-100. Interpreting and understanding the quartiles, median and the range is important here.

Once the model is finalized, we would split the data into training and testing sets. The lm function in R applies the linear model to the data and gives us the estimates (i.e., the value of the intercept/constant).

y = mx + c

ExpectedDuration = m (waiting time) + c

Now, let’s assume the waiting time between eruptions is 75 minutes. We would use the above equation and have:

ExpectedDuration = 0.073901 (75) -1.792739 = 3.749836

Similarly, for any new waiting time, the duration would be predicted in the way shown above. Based on the results, we can conclude that as the waiting time increases, the geyser eruption will have a longer duration. Therefore, it would fall into the long duration class. On the other hand, if the waiting time decreases, the geyser eruption would fall into the short duration class.

In summary, we don’t necessarily need to apply a complex machine learning or deep learning algorithm to solve a data science problem. My interest to know how the geyser eruptions were predicted led me to play around with the data. Deriving these conclusions was organic and came out of sheer curiosity. By performing a simple linear regression model, we have achieved great results. So, applying a complicated machine learning algorithm does not always equate to great results. Sometimes, a machine learning algorithm is not necessary, and a simple algorithm, such as a linear regression, is actually more effective.

Author

Subscribe to our newsletter

Stay informed with the latest insights, industry trends, and expert tips delivered straight to your inbox. Sign up for our newsletter today and never miss an update!

We care about the protection of your data. Read our Privacy Policy.

Keep reading

Dig deeper into data development by browsing our blogs…
A diverse team of professionals at ProCogia collaborates in a modern office, analyzing complex data visualizations on a large digital screen. One person actively points at the screen while others engage in discussion, symbolizing end-to-end problem-solving, strategic planning, and teamwork. The high-tech setting reflects deep engagement in solving real-world challenges.

Delivering End-to-End Data Solutions That Drive Outcomes

In today’s rapidly evolving data landscape, businesses need more than just tools—they need comprehensive, end-to-end solutions that drive real impact. Too often, companies invest in data products without the right strategy, integration, or expertise to maximize their value. At ProCogia, we take a different approach: we embed ourselves in our clients’ ecosystems, ensuring that data engineering, pipelines, analytics, and AI solutions aren’t just implemented, but truly optimized for long-term success.

This blog explores why trust, deep collaboration, and tailored consulting are essential in transforming data into meaningful insights. Whether it’s breaking down silos in healthcare, refining AI-powered search engines, or enabling financial institutions to make smarter decisions, ProCogia’s approach ensures that technology aligns with business needs—not the other way around.

Get in Touch

Let us leverage your data so that you can make smarter decisions. Talk to our team of data experts today or fill in this form and we’ll be in touch.