In today’s rapidly evolving landscape, the intersection of sports and technology offers unprecedented opportunities for growth, efficiency, and enhanced experiences. This document explores the collaborative journey between Procogia and the Vancouver Whitecaps FC, a partnership forged to leverage cutting-edge technological solutions to solve one of the club’s major hurdles. This collaboration exemplifies how a forward-thinking sports organization can harness innovation to optimize performance, deepen fan connections, and streamline operations.
We will present the challenges our teams faced, and the solution implemented with its positive impact on athlete talent evaluation tasks. But what if talent management is just the beginning? We will also discuss how these same data science principles can be applied to other areas, highlighting the broader potential of data-driven decision-making in the world of professional sports.
Data analytics in sports: How did we get here?
Competition in modern sports is getting fiercer. Recent advances in training techniques, dietary schedules, medical treatments, and sensor technology have allowed human ability to exceed what was previously possible. One tangible aspect of this development is sports records lasting for shorter periods of time (1).
However, making athletes stronger and more resilient is not the only way to stand out from the competition. Choosing when and how to use team and player strengths is key to winning. With data, statistics and modeling, coaches and trainers can draw strategies to boost athletes’ performance and team synergy by focusing their efforts on the right areas.
The integration of data into sports strategy began with pioneers like Coach Jim Coleman, who introduced statistical analysis to volleyball in the 1960s, developing metrics that helped the USA Men’s team win major titles and are still used globally (2). This data-driven approach was further exemplified in baseball by sportswriter Bill James, whose empirical critiques of traditional player evaluation since the 1970s (3) inspired Sandy Alderson and Billy Beane to use data analytics to build a high-performing, cost-effective team, famously chronicled in “Moneyball” (4) (5).
Thanks to these early innovators, the sports world recognized the profound value of data. Since they made their contributions, data analytics in sports has evolved from a niche area to a critical part of success at the highest levels of competition. This development means that simply having access to data is no longer enough. To maximize performance and gain a competitive edge, establishing a dedicated data science department within a sports team has become the industry norm. Data scientists are now essential for not only collecting and organizing vast amounts of data, but also for applying sophisticated techniques to extract insights, ensuring their teams stay at the top.
Despite the clear benefits, sports organizations often face significant hurdles in implementing effective data science strategies.
Pain points
Some of the most critical issues faced in the industry today are data having high complexity; ensuring data quality; dealing with limited resources and providing effective communication across different organization levels. This section presents an overview of these pain points to illustrate why organizations frequently encounter substantial impediments when implementing data solutions in their systems.
Data complexity: The rapid evolution of sports data is marked by an exponential increase in both the volume and complexity of recorded information. Data now goes beyond simple markers: teams need to track arrays of metrics captured through wearable sensors, high-speed cameras, and other tracking systems. This presents opportunities as well as challenges since interpreting data requires significant expertise from analysts.
Figure 1. Multiple data sources relevant to an athlete
Data quality: More data also means more room for errors. Inconsistent data formats, missing values, and inaccuracies are issues to look out for. If the underlying data is flawed, any insights and models derived from it, no matter how sophisticated the analytical technique applied, will be unreliable, and potentially lead to wrong decisions.
Integration is another important aspect of data quality. Since the input data comes from varied sources, they often reside in different systems and formats. Integrating these data sources is a major hurdle, but essential to have correct assessments of player data. Consequently, a substantial part of a sports data analyst’s time is often dedicated to the painstaking process of data cleaning, integration, and validation.
Limited resources: The growing recognition of data analytics’ potential within sports organizations has, ironically, created a new set of challenges for data science teams. As various departments, such as coaching staff, player development, marketing, and player scouting, recognize the value of data-driven insights, their demands on the data science team have intensified. This increased demand for specialized reports, analyses, and visualizations puts immense pressure on already busy analysts. The need to cater to diverse stakeholders, each with their own specific questions and priorities, can stretch resources thin. Effectively managing these competing demands and prioritizing requests becomes a crucial task for data science leaders to ensure the team stays productive and focused on the most valuable projects.
Communication: Data analysts need to communicate their findings in a way that is easily understood and actionable by coaches, players, and management. Bridging the gap between technical analysis and practical application is crucial but often difficult. Coaches, players, and managers might not be familiar with statistical terms and have their own priorities when looking at a report. Being able to determine the key factors for each role and correctly communicating them is imperative for fostering a data-driven culture within a sports organization.
ProCogia’s roadmap to effective sport analytics
ProCogia is a global technology consulting firm specializing in data analytics, AI, and digital transformation solutions. With a proven track record of delivering innovative and impactful solutions across various industries, ProCogia is committed to empowering its partners with the tools and insights needed to thrive in a competitive environment.
At ProCogia we have laid out a clear, seven-step path to help organizations to overcome their challenges and transform their raw data into winning results. It is adaptable to each client’s unique needs, ensuring they get the most from every data point. Though this approach can be applied across various industries, this paper will specifically explore its impact on sports analytics.
Figure 2. ProCogia MLOps process flow
1. Project Discovery
We kick off every project by clearly defining its scope, exploring both the origins of the data and its ultimate impact. We map the “upstream” factors, pinpointing available data sources and access methods to proactively identify any limitations. Equally important is understanding the “downstream” effects: who will use the project’s results? Coaches? Managers? The players themselves? By knowing the end-users, we can tailor the output format for maximum impact, translating complex analyses into clear, concise language and intuitive visuals that resonate with each audience.
Figure 3. Project Discovery Phase
2. Data Strategy and Preparation
Once we have established the project’s goals and the available data, we select the analytical techniques, models, and specific analyses needed to achieve those objectives. This is also where we define the key performance indicators (KPIs) from each player that are needed. It is critical to assess whether the existing data is sufficient to generate the desired KPIs. If not, we proactively develop a plan to acquire the required information.
Having all data sources available, we begin the data preparation phase. This phase includes both data transformation and data cleaning. As discussed in the Pain Points section, one of the biggest issues within the analysis process is having the data cleaned. According to a study published by Forbes in 2016 (6), it is estimated that data scientists spend 60% of their time doing data cleaning tasks. This illustrates how daunting but also how important this task can be.
Figure 4. Data Preparation Phase
Therefore, meticulous data preparation, encompassing both collection and cleaning, forms the foundation upon which any successful sports analytics project is built, despite the considerable time investment it requires.
3. Project Design
We believe in seamless integration, not disruption. During the project design phase, our approach focuses on minimizing impact on your current systems and infrastructure, working with what the client already has. This means leveraging familiar tools, like existing cloud providers and programming languages, so their teams can effectively understand and maintain ProCogia’s solutions. We also carefully analyze the cost-benefit of every architectural decision, empowering our clients with criteria to choose the optimal solution that fits their specific needs and budget.
4. Model Training
The model training phase is one of the most exciting parts of the project. While the groundwork we have laid is essential, the model training phase is when we truly see things coming to life. We train or fine-tune a range of models with varying techniques and parameters. During training we also perform data analyses, helping to draw conclusions that will support the goals.
5. Testing and Validation
During the testing and validation, we compare the performance of all the different models trained. This step ensures that the proposed solution not only has the necessary metrics to address the technical task we are facing but it also offers the opportunity to ensure the model performs consistently and that the output adheres to the client’s business needs. There may be multiple models or solution architectures that are assessed and validated side by side and the model that best addresses our clients’ needs is selected for implementation in the production environment.
A crucial aspect of our model validation process is evaluating its robustness to imperfect data. Data sets often have missing values, incomplete entries, and noisy or erroneous information. We evaluate each model ability to handle imperfections by simulating scenarios, such as randomly introducing missing values, corrupting data points with noise, or providing inconsistent inputs. This process ensures that our solutions are dependable in real-world applications.
6. Deployment and Integration
Once the champion model is selected, the focus shifts to integrating it into the client’s existing workflows. This involves more than just deploying the model: it requires building wrappers and supporting infrastructure to ensure smooth recurrent model runs. These wrappers will function as intermediaries, translating data into the format the model expects and then interpreting the model’s output into a format usable by other systems.
We carefully design these structures to minimize disruption and maximize compatibility with the client’s current technology stack. The goal is not just to deliver a model, but to create a sustainable, integrated, and scalable solution that empowers the client to leverage the model’s insights continuously and efficiently.
7. Monitoring and Support
Once the project is deployed it will require ongoing monitoring. This includes tracking of input data to detect any significant drift that could impact model performance, as well as quality checks on model outputs to validate their continued relevance and effectiveness.
We complement model monitoring with comprehensive documentation detailing all technical aspects of the project and the business rationale behind every design decision, leaving our clients with a deep understanding of the system. Finally, we implement Continuous Integration and Continuous Deployment (CI/CD) pipelines including unit tests and automatic deployment to production environments to streamline updates and maintenance, ensuring project continuity and adaptability in the face of evolving data and business needs.
Figure 6. An example of a monitoring program designed to track the ML solution
This end-to-end development approach just described was central during the ProCogia partnership with the Vancouver Whitecaps. In the next section we will describe how our teams were able to collaborate to develop a solution that helped the Whitecaps Data Science team boost their efficiency.
The Whitecaps Partnership
The Vancouver Whitecaps Football Club is a professional soccer club with rich history and tradition, based on Vancouver, British Columbia, Canada. They have been competing in the Major League Soccer (MLS) since 2008, but their path started much earlier. Since the club was founded in 1973, it has collected many key victories and titles throughout the years, including four titles in the Canadian Championship in 2015, 2022, 2023 and 2024 (7) with the team’s current iteration.
The Whitecaps’ data science team was facing many of the pain points we listed before: a mountain of requests from different stakeholders, a lean staff and complex problems demanding immediate solutions. ProCogia stepped in, partnering with the Whitecaps to streamline their scouting process with a copilot that could make their reporting process much easier.
The challenge:
The Whitecaps have player scouting reports that help their coaches prepare strategies before matches as well as for scouting new talent to add to their player roster. These reports condense the strengths and weaknesses of the player in bite-size portions to aid coaches and management make asserted decisions. Producing one such report is not an easy task, with each player having around 150 attributes to be analyzed and correctly assessed. Among these parameters are physical aspects (such as average player speed), attacking parameters (such as player expected goals) and defending parameters (such as average defensive duels won). Connecting all these pieces requires significant expertise of how each metric can make contributions to the player value, with several attributes contributing to underlying compounds and different player roles having different importance for each attribute.
The solution:
With the aid of the experts from the Whitecaps, ProCogia developed a virtual assistant backed by a Large Language Model (LLM) that could replicate the scouting report process, significantly reducing the time needed to produce such reports and giving the Whitecaps scientists more time to focus on other tasks. The assistant’s inputs are all the many player performance metrics, which are summarized into a couple of small, digestible paragraphs in the output.
This assistant was named Raven (Recruitment Analytics Virtual ENgine). It was able to significantly reduce the time required to analyze player profiles and derive insights for the team. In the following sections we outline some of the key aspects of the partnership and the development process.
1. Laying the Foundation: How the Whitecaps Enabled the Raven Project
Before collaborating with ProCogia on the Raven project, the Vancouver Whitecaps had already laid a significant foundation to support AI-driven performance analysis. Over the past several seasons, the club’s data science and football operations teams developed a robust internal infrastructure for evaluating player performance, including a proprietary Scouting Index that blends statistical modelling with domain expertise from coaches, scouts, and video analysts.
The Whitecaps’ internal evaluation model is grounded in both data and football logic. It reflects real-world insights from match analysis and tactical planning, while also applying rigorous statistical principles to ensure accuracy and repeatability. This model now drives player assessments used across scouting, recruitment, match preparation, and player development.
However, applying this sophisticated system at scale had become a growing challenge. With hundreds of players to scout during each transfer window, and one to two matches per week requiring detailed opposition profiles, the volume and velocity of decisions had outpaced the team’s capacity to translate analytics into coach-friendly language manually. The bottleneck was not in the data, but in the translation layer.
That’s where the Raven project came in. With clean, structured data housed in a centralized warehouse, and with years of modeling and rating systems already in place, the final missing piece was a tool that could convert technical output into accessible insights, without losing the nuance behind the numbers.
By integrating Large Language Models into this workflow, the Raven project allowed the Whitecaps to scale their evaluation framework, ensuring that coaches and executives received timely, relevant summaries tailored to each player’s role. Importantly, the LLM doesn’t replace expert judgment, it reflects it. The system simply puts into words what the club’s internal model already understands, allowing key stakeholders to act faster without sacrificing context or quality.
The result is not just an efficiency gain — it’s a strategic unlock. Raven now enables the club to apply its performance insights more consistently across departments, aligning data science with football operations in a way that’s practical, interpretable, and built for speed.
2. Data Preparation
The Whitecaps’ data science team has cultivated a mature data infrastructure. Their data is well organized, eliminating the issues around missing information and inconsistent labeling. Moreover, all relevant player KPIs are already calculated, and all their information is consolidated on a centralized server, streamlining access and facilitating efficient development. This well-organized system provided a solid foundation for the project. The work required for data preparation was mostly limited to extracting existing information.
LLMs, like humans, need context to interpret raw numbers. Is 20 degrees Celsius hot or cold? Is 5km/h a good or bad player speed? To bridge this gap, we translate raw data into contextualized language. By analyzing the distribution of each player metric, we created descriptive categories (e.g., poor, average, good, excellent) for each performance range. Then, instead of seeing “Player X has average speed of 5km/h”, the LLM will see “Player X has an excellent average speed”, which is much easier to understand.
Figure 7. Contextualizing a player metric using its distribution
LLMs greatly benefit from seeing a small number of examples demonstrating the desired input-output relationship. These examples serve as context, allowing the LLM to quickly grasp the task and generate relevant responses, even for novel inputs. This approach of feeding examples is called few-shot prompting (8). Few-shot prompting is particularly useful when dealing with tasks that are difficult to formalize through explicit instructions. It is also an advantage over model fine-tuning that requires more capital and time investments.
To provide the LLM with meaningful examples, we have asked experts to select representative player profiles that best reflect how each player metric can be interpreted to create their scouting report. Each profile contains the player metrics and the scouting report that was generated, to be passed as examples in the few-shot prompt approach.
To ensure the LLM accurately interprets the player data, we provide detailed descriptions for each attribute. Abbreviations and technical jargon, like “xG” (Expected Goals), lack intuitive meaning and require context. Therefore, we compiled a comprehensive glossary, created in collaboration with domain experts, mapping each raw attribute name to its full, readily understandable description. This provides the LLM with the necessary background information to effectively analyze and contextualize the data.
3. Model Development
We opted for an LLM as the core engine of our sports analytics assistant because of its ability to generate human-like text. Our goal was to create an assistant that could not just compile numbers, but also communicate insights in a clear, natural, and engaging way for players and coaches.
The choice of the Large Language Model was made among options that are available for in-house hosting, mitigating any liabilities of sending proprietary data over the internet but also taking into consideration expected performances.
The LLM prompt is one of the parameters that can be tuned to influence the output and guide it towards desired responses. The prompt is important because the LLM pays great attention to it and even minor changes to its contents can greatly impact the generations. At this stage we looked, through a trial-and-error process, for the set of instructions within the prompt that could produce the best player summaries.
Player performance metrics must be evaluated in the context of their on-field role. An attacking player’s effectiveness, for example, is not defined by their defensive statistics. Conversely, a defensive player’s primary value is not found in their attacking contributions. To account for this, we have developed a method that informs the LLM of the most relevant metrics for a given player’s role, enabling it to generate more focused and meaningful performance summaries.
To further enhance the LLM’s performance, we developed an adaptive few-shot prompting mechanism. When summarizing a player, instead of overwhelming the LLM with all representative player profiles, the system intelligently searches for and selects only the most relevant examples. By focusing the LLM’s attention on profiles similar to the player being summarized, this targeted approach significantly improves the quality and relevance of the generated output.
Figure 8. Raven’s chain of thought when creating a player summary
4. Model Evaluation
During this stage, many of the LLM summaries generated were given to a group of experts from the Whitecaps team to be evaluated. The evaluators assessed accuracy, length, tone and wording of the responses given by the LLM with different prompts to choose the one that could fit their objectives.
In our internal review, Raven’s outputs closely align with scouts’ evaluations of a player’s on-ball ability. It has only been significantly misaligned with our internal interpretation of the data approximately 5% of the time.
Because Raven’s summaries are almost instantaneous, the VWFC could double the number of scouting reports they produce. They also have been able to extend the application on opposition scouting providing summaries of every opponent player for every match across the Whitecaps’ first and second Teams. Previously, the reports were only produced for a few selected players in first team matches. With matches once every 3 days quite often, having a small turn-around time for insights really helps the coaches and players when planning for the next game.
5. Model Deployment
To ensure seamless integration and empower the Whitecaps team to effectively use our solution, we prioritized using familiar tools and technologies. This meant aligning with their existing cloud service provider, leveraging their established infrastructure, and minimizing the need for new platforms. The entire data processing pipeline was built using either SQL queries or Python, languages already familiar to the VWFC data science team.
The LLM assistant is accessible through a dedicated endpoint, providing on-demand access to its AI-powered insights. This on-demand architecture minimizes the operational costs of running the LLM by activating it only when queried, optimizing resource usage and ensuring cost-effectiveness. A balanced approach that delivers the power of the LLM when needed, while keeping accessibility and long-term sustainability for the team.
Figure 9. Illustration of the Raven application
To demonstrate how Raven is able to convert complex metrics into a digestible summary, we present a fictional player, Dante Volta, a 31-year-old midfielder playing for the also fictional team Olympia FC. Table 1 illustrates some of Dante’s key metrics, which are used as input for Raven’s analysis. The resulting summary of Dante’s performance, generated by Raven, can be found in Figure 10.
Metric | Value |
Play Time – Minutes | 2536 |
Play Time – 90s Equivalent | 31 |
Passes Completed | 1262 |
Pass Attempts | 1550 |
Pass Completion Rate | 69.7 |
Progressive Pass Distance (yrds) | 6734 |
Short Pass Completion Rate | 90.1 |
Medium Pass Completion Rate | 74.2 |
Long Passes Completed | 138 |
Long Pass Attempts | 236 |
Long Pass Completion Rate | 59 |
Assists | 17 |
Expected Assisted Goals | 12.7 |
Key Passes | 102 |
Passes Into Penalty Area | 98 |
Crosses Into Penalty Area | 12 |
Progressive Passes | 232 |
Shot-Creating Actions | 225 |
Shot-Creating Actions Per 90m | 7.96 |
Goal-Creating Actions | 30 |
Goal-Creating Actions Per 90m | 0.98 |
Tackles | 37 |
Tackles Won | 27 |
Tackles in Defensive Third | 12 |
Tackles in Middle Third | 20 |
Dribblers Tackled | 33 |
Tackle Success Rate | 51.3 |
Total Blocks | 25 |
Blocked Passes | 27 |
Interceptions | 14 |
Tackles and Interceptions | 53 |
Clearances | 5 |
Total Touches | 2169 |
Touches in Defensive Third | 119 |
Touches in Middle Third | 879 |
Touches in Attacking Third | 1175 |
Touches in Attacking Penalty Area | 150 |
Dribble Attempts | 222 |
Successful Dribbles | 83 |
Successful Dribble Rate | 37.1 |
Total Carries | 1303 |
Progressive Carry Distance | 3487 |
Progressive Carries | 114 |
Carries Into Final Third | 105 |
Carries Into Penalty Area | 56 |
Total Receptions | 1519 |
Progressive Receptions | 253 |
Fouls Committed | 52 |
Aerial Duels Won | 8 |
Aerial Duel Success Rate | 45 |
Table 1. A sample of key performance metrics for Dante Volta, a fictional soccer player.
Figure 10. Raven’s summary for Dante Volta
Industry Insights – Next steps
During our partnership we noticed that the benefits of sports analytics go beyond evaluating professional players. Apart from the development the Vancouver Whitecaps and Procogia achieved through Raven, there are many other possible applications of machine learning and data science to boost operations. This section outlines some potential project ideas to continue expanding the domain of sports analytics.
One such application is injury prediction and prevention. By integrating historical injury data with training load information and biometric data captured through wearable technology, the teams could develop predictive models to identify players at elevated risk of injury. This would allow the team to proactively implement preventative measures, such as adjusted training schedules or targeted rehabilitation programs, minimizing downtime and maximizing player availability throughout the season.
Furthermore, other machine learning solutions could be applied to enhance fan engagement and marketing efforts. By analyzing data (including demographics, preferences, and purchase history) each fan can be identified, creating targeted marketing campaigns. Personalized experiences, tailored content, and optimized ticket offers could strengthen fan loyalty and drive revenue growth.
Another promising area is early talent identification and development. While our partnership focus was on professional scouting, the approach could be expanded to find and nurture young talent at earlier stages. By analyzing data from youth leagues, academies, and scouting combines, the team could map promising prospects and create personalized development plans to maximize their potential when they reach the bigger leagues.
Bibliography
- Why are sporting records always being broken? Olds, Tim. 5, s.l. : Australasian Science, 2016, Vol. 37.
- Dr James Coleman. International Volleyball Hall of Fame. [Link] [Cited: January 24, 2025.] https://www.volleyhall.org/dr-james-coleman.html.
- Sports Analytics Before Moneyball. The Jerome and Dorothy Lemelson Center for the Study of Invention and Innovation. [Link] [Cited: January 24, 2025.] https://invention.si.edu/invention-stories/sports-analytics-moneyball.
- Miller, Bennett. Moneyball. Columbia Pictures, 2011. [Link]
- Lewis, Michael. Moneyball: The Art of Winning an Unfair Game. United States : W. W. Norton & Company, 2023. 978-0-393-05765-2. [Link]
- Press, Gil. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes. [Link] March 23, 2016. [Cited: January 28, 2025..
- Wikipedia. [Link] Wikimedia. [Cited: January 29, 2025.]
- Gadesha, Vrunda. What is few shot prompting? IBM. [Link] IBM, September 25, 2024. [Cited: January 29, 2025.]
- History. Vancouver Whitecaps FC. [Link] [Cited: January 29, 2025.]



