Using machine learning to predict the 2020 NBA Champion
The day the world of sports stopped seems like forever ago. On March 11th, the NBA season suspended in the wake of the COVID-19 pandemic. Going to college in San Antonio and being a part-time employee for the San Antonio Spurs, I was excited to see if our team could snag a final spot in the playoffs. I was excited to work the game when it happened, but that became impossible. Like many other students, I was sent home for our safety. Due to my collegiate studies, I could not enjoy studying this season’s statistics thoroughly. However, during the summer, I have been staying productive through learning and developing Python scripts. Two weeks before the restart, I began working on a passion project.
In this project, I attempt to determine, using Python and machine learning, the 2020 NBA Champions. This prediction will be determined by predicting the highest number of playoff wins, through a linear regression model. At the end of the article, I will summarize and make the final prediction.
To restart the season, the NBA and the National Basketball Players Association (NBPA) agreed to meet in Walt Disney World. The players of 22 teams were tested, housed, and quarantined inside one of three hotels on the Disney campus. In these hotels, players have access to bowling, golf, and fishing. The bubble seems to be extremely safe. San Antonio Spurs head coach, Gregg Popovich, stated that “from an intellectual point of view and a medical point of view, I would have to say not probably, but I am safer here” (Young). Now it is unknown how long Coach Pop will be in the bubble given that after an eight-game schedule, six teams will be leaving the bubble.
Before the suspension of the season, the standings were as follows:
From the Eastern Conference, the Bucks through the Wizards are in the bubble, while for the Western Conference it’s the Lakers through the Suns. Based purely on the standings, the final two would be the Lakers and the Bucks. However, we never truly know the outcome in professional leagues until the playoffs end. Now, I am going to try my best to defeat this idea by using machine learning.
Acquiring the Data
For the first step in my project I needed to gather my data. Using the Sports Reference API, I knew I needed to import the NBA teams and their schedules over the desired number of years. Sports Reference provided a Python API that is easy to use and access datasets for multiple collegiate and professional sports. Specifically, the API provides NBA data from as far back as the 1946–1947 season.
However, the base dataset from Sports Reference’s Schedule and Teams classes do not provide playoff wins and losses. To provide playoff data, I referenced Alex Muhr’s GitHub repository for the article “Beating the Odds” (Muhr). His repository detailed how to develop the missing data. I adjusted the code slightly to source the necessary data into two CSV files. One file was data from 1980 to 2019, and the other was data from the 2020 season. The separation of the files was needed to train the machine learning system with the 1980–2019 season’s data, and then took what it learned and applied it to 2020. To speed up the process, using Microsoft Excel, I produced a few calculated fields. Those fields were Margin of Victory (MOV), Effective FG% (eFG%), Turnover % (TOV%), and Win-Loss % (W-L%). The eFG% and TOV% fields were also repeated for regular season opponents. I selected these fields as they are standard NBA advanced statistics that were not provided directly from the API. To produce the most accurate playoff predictions, I wanted to have as many relevant fields as possible. Once these final fields were calculated, I had my working datasets.
In the notebook, I loaded the two files as ‘data’ and ‘pred_data’:
Cleaning the data
To ensure the most accurate predictions, I filtered out any variables with an R² value lower than 0.25. Anything less than an R² of 0.25 does not have a strong enough correlation between the variable and playoff wins to be considered relevant. After the filtering process, the index variables were:
Using the index above, I cleansed both datasets of all other fields. At first, I was shocked to see how the well-watched variables (i.e., points, three-point FG%, personal fouls) were not in the featured index, as the news follows them consistently. Given the Golden State Warriors’ dynasty has helped lead the league to reliance on three-pointers, it seemed reasonable to expect a similar stat would have had more playoff influence. However, the correlation argues otherwise. Following the development of the final datasets, it was time to explore.
Exploratory Data Analysis
Given the final variables, a few things need to be understood. First, an important factor to understand is the effect a difference in one statistic has on one playoff win to the next. This difference is important because it helps us understand why one team might have one more playoff win than the next. To understand this relationship, I used the Seaborn library to produce a heat map. The heat map accounted for the relationships between each mean statistic and the number of playoff wins, using data from every team in the dataframe.
Several key observations can be reached from this chart. First, opponent blocks seem to have notable differences between playoff wins. For example, teams that win more than 12 playoff games are blocked in the regular season less than 360 times. However, this statistic is concerning because of scattered results between seven and twelve playoff wins.
The next notable statistic comes from regular season wins. Through the map, there’s likely a strong, positive correlation between regular season wins and playoff wins. For teams that have less than 45 wins, a single playoff win seems out of reach. To win over five playoff wins a team likely needs at least 50 regular season wins. Lastly, to achieve over 13 playoff wins, 55 regular season wins are needed. The reverse is also true and losses show a strong negative correlation to playoff wins. These findings lead me to believe a team’s W-L% will also have a strong correlation, as the higher the W-L% value dictates a team has more wins compared to their losses.
Lastly, the Margin of Victory (MOV) variable indicates strong positive growth towards playoff wins. The MOV was calculated by subtracting total regular season points against a team from total points scored by the team. The MOV is important due to its ability to show the difference in points allowed and points scored, which should show the best teams on offense and defense. Indicated by the map, teams with the highest regular season MOV will produce the most playoff wins.
The second thing I wanted to explore was a multicollinearity check between all variables. Through Seaborn, I produced a heat-map to cross-examine all variable correlations.
Upon reviewing the map, the assumptions of strong correlations between playoff wins and regular-season wins, losses, W-L%, and MOV were true. These have the most significant correlations outside of playoff games, which is irrelevant as it is not a regular-season statistic. Following this exploration of the data, I was ready to begin my linear regression and prediction model.
Making the Prediction
Now, the fun part. Using Sklearn, I trained the model.
Following this step, I developed the linear model and regression. Following the regression’s completion, it produced a mean absolute error between 2.25 and 2.36. Mean absolute error (MAE) is a measure of the average errors between paired observations expressing the same occurrence. With an MAE between 2.25 and 2.36, on average the prediction was off by between 2.25 and 2.36 playoff wins. There are reasons for an error of this size due to the concerns with the prediction’s accuracy.
I want to recognize two small concerns regarding the accuracy of my predictions: the undefined playoff teams and the lack of a “strong” correlation in regular-season variables. A correlation is usually considered strong if the R² value is more significant than 0.7 (Mindrila). Several are close, but none are above 0.7 (excluding playoff games). Without a strong correlation value, my confidence is a bit shaken. We may see increased correlations if data from before 1980 were used or if more statistics were used.
As the bubble holds six teams that will be eliminated, my predictions will need to be readjusted after they are sent home. In the calculations, I included all 22 teams. Once the official playoff games begin and the final 16 teams are decided, the predictions may change. A new, final list of teams should be used to create more accurate predictions.
Your NBA Champion is…
Based on the linear regression model, the 2020 NBA Champions will be the Milwaukee Bucks! The final standings will be the Bucks, followed by the Lakers in second, and the Raptors in third. Here is the complete chart of playoff wins by all NBA bubble teams:
While this chart shows my preferred teams, the San Antonio Spurs and Dallas Mavericks, will not win the championship this year, the model does provide a plausible prediction. Based on ESPN’s rankings, highlight reels, and other ways to judge teams, Milwaukee and the Los Angeles Lakers seem the most likely final two. Interestingly though that Toronto has a higher place than several teams, specifically the Clippers. The Clippers were my original favorite team for the second or third place spot, given Kawhi Leonard and Paul George on their roster. However, the Raptors have performed extremely well during the regular season. Overall, the final predicted standings are plausible, and it will be exciting to reevaluate the results following the start of the NBA playoffs.
In conclusion, through Python’s machine learning capabilities and linear regression, the Milwaukee Bucks will be your 2020 NBA Champions. I will reassess the model when the final sixteen teams are determined and hope to improve upon my playoff win values’ accuracy. Please check back once the final 16 teams have been determined for potentially a new prediction.
“Basketball Statistics and History.” Basketball, www.basketball-reference.com/.
Mindrila, Diana, and Phoebe Balentyne. Scatterplots and Correlation. West Georgia University, www.westga.edu/academics/research/vrc/assets/docs/scatterplots_and_correlation_notes.pdf.
Muhr, Alex. “Beating the Odds.” Medium, Towards Data Science, April 28th. 2020, towardsdatascience.com/beating-the-odds-8d26b1a83f1b.
Young, Royce. “Spurs Coach Gregg Popovich Says NBA Bubble Is Safest Place to Be.” ESPN, ESPN Internet Ventures, July 11th, 2020, www.espn.com/nba/story/_/id/29448111/spurs-coach-gregg-popovich-says-nba-bubble-safest-place-be.