1: Intent and description of the metrics utilized

The intent of the notebook that is presented here is to explore the role that advanced stats have in determining how far an NBA team makes it in the playoffs. Additionally, we will be using a machine learning model to predict how far teams make into the playoffs, therefore, this qualifies as a supervised classification exercise. For this purpose we use data from the website Basketball-Reference.com encompassing numbers from the 2004-2005 season to the 2022-2023 season. The reason that no older data is used lies in the fact that before that year there were fewer teams in the NBA, so our data set would not be symmetrical if we were to expand the years of reference.

1.1 The metrics

As stated before, the main purpose of this exercise is to determine the relevancy of advanced stats of the regular season during the NBA playoffs, hence, it is necessary to describe thoroughly the construction of these metrics. A more detailed description of these advanced stats is available in the appendix.

  • Offensive Effective Field Goal Percentage (eFG%): This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.

  • Offensive Turnover Percentage (oTOV%): An estimate of turnover per 100 plays.

  • Offensive Rebound Percentage (oRB%): an estimate of the percentage of available offensive rebounds a player grabbed while he was on the floor.

  • Offensive Free Throws per Field Goal Attempted (oFTFGA): Free throws per Field Goals Attempted.

  • Defensive Effective Field Goal Percentage (dEFG): Opponent’s effective field goal percentage.

  • Defensive Turnover Percentage (dTOV%): Opponent’s turnover percentage.

  • Defensive Rebound Percentage (dRB%): an estimate of the percentage of available defensive rebounds a player grabbed while he was on the floor.

  • Defensive Free Throws per Field Goal Attempted doFTFGA): Opponent’s free throws per Field Goals Attempted.

In general, the relevance of these stats lies in that they integrate a multitude of simple statistics and transform them into a percentage for the whole season.

1.2 The Objective Variable

As stated before, the main purpose of this notebook is to find the relation between these metrics and the playoff run of each team. Therefore, we define 6 different possible results for a playoff run. Each run describes the final round a team achieved:

  1. No playoffs (NA)

  2. Playoffs (POF), basically only making it to the first round.

  3. Second Round (2RD)

  4. Conference Finals (CF)

  5. Finals (FNL)

  6. Championship (CHA)

2. The Data

The data was collected by a rudimentary method of reading excel files, which were downloaded through the discussed website. Each metric was then read individually and put together by adding indexes and pivoting each observation and then joined using indexes as the link between them.

To illustrate this, here is a piece of the code used for each metric:

#Reading dFTFG data and using pivot function
dFTFG <- read_xlsx("/Users/arturoavalos/Documents/tesis/NBA/dFTFG.xlsx") %>% dplyr::select(1:20) %>%
  pivot_longer(cols = 2:20,names_to = "season",values_to = "dFTFG")
#Adding indexes
dFTFG <- tibble::rowid_to_column(dFTFG, "index")
dFTFG <- dFTFG %>% dplyr::select(-Team,-season)

Also, the method used to join each individual metric:

#Joining all data bases
df_E_data <- playoffs %>%
  left_join(oEFG, by = "index") %>%
    left_join(oTOV,by = "index") %>%
      left_join(oRB,by = "index") %>%
        left_join(oFTFG,by = "index") %>%
  left_join(dEFG,by = "index") %>%
    left_join(dTOV,by = "index") %>%
      left_join(dRB,by = "index") %>%
        left_join(dFTFG,by = "index")

#Making sure that response variable run is ordered
df_E_data$playoff_run <-  factor(df_E_data$playoff_run, ordered = TRUE, levels = c("NA", "POF","2RD","CF","FNL","CHA"))

Now we explore the content of our Data Frame:

str(df_E_data)
## tibble [570 × 12] (S3: tbl_df/tbl/data.frame)
##  $ index      : int [1:570] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Team       : chr [1:570] "ATL" "ATL" "ATL" "ATL" ...
##  $ season     : chr [1:570] "2005" "2006" "2007" "2008" ...
##  $ playoff_run: Ord.factor w/ 6 levels "NA"<"POF"<"2RD"<..: 1 1 1 2 3 3 3 2 2 2 ...
##  $ oEFG       : num [1:570] 0.464 0.486 0.471 0.483 0.504 0.506 0.501 0.5 0.517 0.515 ...
##  $ oTOV       : num [1:570] 14.9 14.7 15.1 14 12.5 11.4 13.5 13.4 14.2 14.3 ...
##  $ oRB        : num [1:570] 30.7 31.4 29.2 29.7 26 28.2 23.4 23.9 22.2 21 ...
##  $ oFTFG      : num [1:570] 0.212 0.255 0.263 0.263 0.238 0.213 0.209 0.192 0.174 0.208 ...
##  $ dEFG       : num [1:570] 0.513 0.513 0.503 0.501 0.494 0.496 0.495 0.48 0.496 0.51 ...
##  $ dTOV       : num [1:570] 14 14.1 14.5 12.9 13.2 13.2 12.3 14.4 14.2 14 ...
##  $ dRB        : num [1:570] 72.1 69.5 70.9 71.7 71.6 72.7 74.6 74.4 73.6 74.4 ...
##  $ dFTFG      : num [1:570] 0.289 0.275 0.268 0.217 0.21 0.208 0.211 0.186 0.181 0.196 ...

As it is visible, the data frame consists of 570 observations, one for each of the 30 teams during the 19 seasons studied. It also contains 12 features: the observation index, the team, season, the playoff run and the 8 advanced statistics.

To get a better sense of the composition of the objective variable we can count the times present of each individual class:

df_E_data %>%
    group_by(playoff_run) %>% count()
## # A tibble: 6 × 2
## # Groups:   playoff_run [6]
##   playoff_run     n
##   <ord>       <int>
## 1 NA            266
## 2 POF           152
## 3 2RD            76
## 4 CF             38
## 5 FNL            19
## 6 CHA            19

As it is clear, the most common result is to not make the playoffs. As the post season advances, one team is eliminated from contention, therefore every step get cut down by half, leaving 19 champions and teh 19 runner ups in the finals.

2.1 Visualizing the data

Given our goal, it is important to visually understand the relationship between the objective variable and the advanced stats:

  • As it is seen from the top left graph, the teams that advanced deeper into the playoffs have a higher median offensive effective field goal percentage during the regular season. Therefore, it would be wise to consider using this metric for the predictive model.

  • Regarding the offensive turnover percentage, there is no clear tendency when its related to the playoff run. Although it’s clear that the median decreases from NA to the conference finals, there is a considerable spike in teams that reach the finals and/or become champions. This stat should be further analyzed to determine it’s validity for the models.

  • There appears to be close to no variation in the offensive rebound percentage. Further test should be applied to determine this stat’s relevance for the predictive models.

  • There seems to be low variation among the different stages of the playoffs for the FT per FG. Nevertheless, there seems to be a drastic decrease in this stat when it comes to the championship teams. Further analysis are necessary.

  • For the effective field goal percentage, there is a clear decrease from teams that don’t make the playoffs to those who do. Additionally, for those teams who reach the second round and forward, the median remains close to constant.

  • For the defensive turnover percentage there appears to be no clear tendency in the different sets. It is necessary to validate the relevance of this stat in the following steps.

  • For the defensive rebound percentage there appears to be a constant increase from NA all the way up to the Conference Finals, after which there is visible a slight decrease.

  • Finally, for the defensive free throw per field goal there are mixed messages. The numbers suggest that this stat is lower for those teams who did make the playoffs than for those who didn’t. Nevertheless, the differences become less clear the deeper the teams go into the playoffs.

2.2 Validating the data

As stated in the previous section, it is necessary to validate the difference in distribution of certain stats among the different playoff runs.

The first stat to investigate is the offensive effective field goal percentage. An ANOVA (analysis of variance) test can tell us if there are significant differences between groups, but for this test to work we need to validate that the data is normally distributed. We can visualize this with a Q-Q plot, ff the points lie along a straight line, the data is likely normally distributed.

qqnorm(df_E_data$oEFG)
qqline(df_E_data$oEFG)

As we can see, the data appears to be not normally distributed. As a safety measure, we conduct a Shapiro-Wilk normality test, where if the p-value from this test is greater than a chosen significance level (e.g., 0.05), we fail to reject the null hypothesis and consider the data to be approximately normally distributed.

shapiro.test(df_E_data$oEFG)
## 
##  Shapiro-Wilk normality test
## 
## data:  df_E_data$oEFG
## W = 0.98677, p-value = 4.899e-05

Given the p-value, we reject the null hypothesis and conclude the data is not normally distributed. Given this fact, we cannot use an ANOVA test, so we differ to a Kruskal-Wallis test for non-normally distributed data:

kruskal.test(oEFG ~ playoff_run, data = df_E_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  oEFG by playoff_run
## Kruskal-Wallis chi-squared = 92.223, df = 5, p-value < 2.2e-16

We reject the null hypothesis and conclude differences in effective field goal between the groups.

We proceed likewise with the offensive turnover percentage, first checking if the data is normally distributed.

qqnorm(df_E_data$oTOV)
qqline(df_E_data$oTOV)

As we can see, the data appears to be normally distributed. As a safety measure, we conduct a Shapiro-Wilk normality test, where if the p-value from this test is greater than a chosen significance level (e.g., 0.05), we fail to reject the null hypothesis and consider the data to be approximately normally distributed.

shapiro.test(df_E_data$oTOV)
## 
##  Shapiro-Wilk normality test
## 
## data:  df_E_data$oTOV
## W = 0.99822, p-value = 0.8307

Since the p-value this test is greater than every conventional significance level, we fail to reject the null hypothesis and therefore the data is approximately normally distributed. Now, we can proceed with the ANOVA test:

summary(aov(oTOV ~ playoff_run, data = df_E_data))
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## playoff_run   5   27.4   5.476   5.326 8.57e-05 ***
## Residuals   564  579.9   1.028                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Given that the p-value of the test is significant, we reject the null hypothesis and conclude difference between the groups. Therefore, this stat will be useful for our predictive models.

Now we apply the same methods for the rest of the advanced stats whose variance between groups is uncertain.

Let’s consider the offensive rebound percentage and make sure it is approximately normally distributed:

qqnorm(df_E_data$oRB)
qqline(df_E_data$oRB)

It does not seem that the data is normally distributed. Let’s do a hypothesis test:

shapiro.test(df_E_data$oRB)
## 
##  Shapiro-Wilk normality test
## 
## data:  df_E_data$oRB
## W = 0.99218, p-value = 0.004358

Since the p-values is lower that any desired significance level, we reject the null hypothesis, therefore the data is not normally distributed. Given this, we differ again to a Kruskal-Wallis test for non-normally distributed data:

kruskal.test(oRB ~ playoff_run, data = df_E_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  oRB by playoff_run
## Kruskal-Wallis chi-squared = 3.0946, df = 5, p-value = 0.6854

Given the value of the p-value we fail to reject the null hypothesis and conclude that there is no statistically significant difference between groups in the offensive rebound category. Therefore, there is no need to include this variable to our predictive models.

For the last of the offensive advanced stats, we verify that offensive free throws per field goal is normally distributed:

qqnorm(df_E_data$oFTFG)
qqline(df_E_data$oFTFG)

Just by looking at the graph it is clear that this data is not normally distributed, therefore to test for differences among groups we apply again a Kruskal-Wallis test:

kruskal.test(oFTFG ~ playoff_run, data = df_E_data)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  oFTFG by playoff_run
## Kruskal-Wallis chi-squared = 12.267, df = 5, p-value = 0.03131

The p-value lies under the 0.05 significance level but over the 0.01, for practical purposes we used the 0.05 as the standard, so we reject the null hypothesis and conclude that there are significant differences between groups.

Now diving into the defensive turnover percentage. First we check if this data is normally distributed:

shapiro.test(df_E_data$dTOV)
## 
##  Shapiro-Wilk normality test
## 
## data:  df_E_data$dTOV
## W = 0.99658, p-value = 0.267

Given the p-value, we fail to reject the null hypothesis and conclude the data is approximately normally distributed. Now we can perform the ANOVA test.

summary(aov(dTOV ~ playoff_run, data = df_E_data))
##              Df Sum Sq Mean Sq F value Pr(>F)
## playoff_run   5   10.3   2.051   1.656  0.144
## Residuals   564  698.7   1.239

This result tells us there is no significant statistical difference between groups. Therefore, there is no need to include this variable in the predictive model.

3. The Predictive Model

To predict the result of the NBA playoffs we will a classification mechanisms: a Linear Discriminant Analysis model (LDA).

3.1 Linear Discriminant Analysis

The LDA method is a statistical and machine learning technique used for dimensionality reduction and classification. The main objective is to find a combination of features that best separates (or discriminates) between different classes on a data set.

In this particular case, the features to be used are the advanced stats previously discussed, and the classes are the playoff run of each team.

The first step into applying an LDA is to scale the data, since one of the key assumptions of this method is that each predictor variable have the same variance. To meet this assumption each variable can be transformed to have a mean of 0 and a standard deviation of 1.

#Save original data set and Scale each predictor 
df_E_unscaled <- df_E_data
df_E_data[5:12] <- scale(df_E_data[5:12])

We verify that each predictor has mean 0 and standard deviation of 1.

apply(df_E_data[5:12],2,mean)
##          oEFG          oTOV           oRB         oFTFG          dEFG 
## -1.826280e-15 -6.988811e-16  3.751990e-16 -1.773682e-16 -6.651982e-16 
##          dTOV           dRB         dFTFG 
## -4.744522e-16 -2.259567e-15  5.908940e-17
apply(df_E_data[5:12],2,sd)
##  oEFG  oTOV   oRB oFTFG  dEFG  dTOV   dRB dFTFG 
##     1     1     1     1     1     1     1     1

The next step is to separate the data set into a training and testing samples. We randomize the separation, keeping 70% of the data for training and the rest for testing.

#Use 70% of dataset as training set and remaining 30% as testing set
set.seed(5)
sample <- sample(c(TRUE, FALSE), nrow(df_E_data), replace=TRUE, prob=c(0.7,0.3))
train <- df_E_data[sample, ]
test <- df_E_data[!sample, ]

Alternatively, it may be useful to separate the train and test data not by a random process, but by season, to try to predict the run of each team during the last few seasons. If we still decide to keep 70% of the data as a training set, that would round up being from the first season 2005, to 2018. Therefore, the test set would be composed of the observation from the 2019-2023 seasons.

#Use 70% of dataset as training set and remaining 30% as testing set
train_old <- df_E_data %>% filter(season<2019)
test_new <- df_E_data %>% filter(season >2019)

Now that we have separated the data set we fit the LDA model with the training sets we can proceed to formulate the model using the predictors. Let’s take into consideration that neither the offensive rebound percentage nor the defensive turnover percentage appeared to be statistically significant among classes, therefore they won’t be used in the construction of the model. Therefore the model takes shape following the expression:

\[ Run_{it} = oEFG_{it} + oFTFG_{it} + oTOV_{it} + dEFG_{it} + dRB_{it} + dFTFG_{it} + \beta \] Where \(\beta\) symbolizes a random element which the model does not account for.

We can see the prior probabilities of each classification in the training data, the means by group and the coefficients of linear discriminants (which display the linear combination of predictor variables that are used to form the decision rule of the LDA model). The proportion of trace displays the percentage separation achieved by each linear discriminant function

model <- lda(playoff_run~oEFG + oFTFG + oTOV + dEFG + dRB + dFTFG, data=train)

Now that we have constructed a model we can proceed to make predictions.

#use LDA model to make predictions on test data
predicted <- predict(model, test)
#to be able to compare we gotta order the factors
predicted$class <- factor(predicted$class, ordered = TRUE, levels = c("NA", "POF","2RD","CF","FNL","CHA"))

And we can perceive the accuracy of the model, which rounds about 60% accuracy on the test set.

#find accuracy of model
mean(predicted$class==test$playoff_run) # 0.6011561
## [1] 0.6011561

It is ideal to create a visualization of the predictions:

#Create matrix
table(test$playoff_run, predicted$class, dnn = c("Actual class", "Predicted class"))
##             Predicted class
## Actual class NA POF 2RD CF FNL CHA
##          NA  66  14   0  0   0   0
##          POF 18  34   6  0   0   0
##          2RD  0  14   2  0   1   1
##          CF   0   5   4  0   0   1
##          FNL  0   2   0  0   0   1
##          CHA  0   0   2  0   0   2
#define data to plot
results <- cbind(predicted$class,test) %>%
    mutate(acurrate = ifelse(predicted$class==playoff_run,1,0), acurrate = as.factor(acurrate))

results$run <-  factor(results$playoff_run, ordered = TRUE, levels = c("NA", "POF","2RD","CF","FNL","CHA"))

#create plot
ggplot(results,aes(x=playoff_run,y=predicted$class,color = acurrate))+
  geom_jitter(width = 0.25,height = 0.25) + 
  labs(title = "Predicted classes vs actual classes",x="Actual playoff run",
       y="Predicted playoff run",color="Accuracy")

As we can observe in the graph, the model had a decent prediction pattern for the first stages of the playoffs, specially predicted whether the team made the post season or not, but it fail to predict the actual playoff run accurately after the second round. Nevertheless, the model managed to correctly predict two fo the for champions in the testing set.

We can reproduce this graph but now using the data from the 2019-2023 seasons as a training set.

model_2 <- lda(playoff_run~oEFG + oFTFG + oTOV + dEFG + dRB + dFTFG, data=train_old)
#use LDA model to make predictions on test data
predicted_2 <- predict(model_2, test_new)

#to be able to compare we gotta order the factors
predicted_2$class <- factor(predicted_2$class, ordered = TRUE, levels = c("NA", "POF","2RD","CF","FNL","CHA"))

As we can see, the accuracy of the prediction decreases slightly, about 2.5%.

#find accuracy of model
mean(predicted_2$class==test_new$playoff_run) # 0.575
## [1] 0.575

Graphically:

#Create matrix
table(test_new$playoff_run, predicted_2$class, dnn = c("Actual class", "Predicted class"))
##             Predicted class
## Actual class NA POF 2RD CF FNL CHA
##          NA  47   9   0  0   0   0
##          POF 10  21   0  0   1   0
##          2RD  3   9   1  0   1   2
##          CF   1   7   0  0   0   0
##          FNL  1   2   1  0   0   0
##          CHA  0   3   1  0   0   0
#define data to plot
results_2 <- cbind(predicted_2$class,test_new) %>%
    mutate(acurrate = ifelse(predicted_2$class==playoff_run,1,0), acurrate = as.factor(acurrate))

results_2$run <-  factor(results_2$playoff_run, ordered = TRUE, levels = c("NA", "POF","2RD","CF","FNL","CHA"))

#create plot
ggplot(results_2,aes(x=playoff_run,y=predicted_2$class,color = acurrate))+
  geom_jitter(width = 0.25,height = 0.25) + 
  labs(title = "Predicted classes vs actual classes",x="Actual playoff run",
       y="Predicted playoff run",color="Accuracy")

Once again we can see a similar pattern as the one observed with the previous training set, the models seems to be able to classify the run in the very first stages of the playoffs, but struggles to do it later on.

Conclusions and limitations

A possible explanation for the reason why the model looses precision in the latest stages is that, as stages in the playoffs develop, less teams reach those levels, therefore the volume of observations is heavily skewed towards the first stages of the playoffs. Another aspect that should be considered is that it is possible the advanced stats are good features to use to predict whether a team makes the playoffs or not, but they are not that reliable when it comes to measure winning in the playoffs. This could be fixed by changing either the model or the data used. Nevertheless, this is beyond the scope of this exercise.

Appendix.

\[ eFG\% = (FG + 0.5 * 3P) / FGA \]

Where \(FG\) refers to Field Goals made, \(3P\) refers to three points made and \(FGA\) refers to total field goals attempted.

\[ oTOV\% = 100 * TOV / (FGA + 0.44 * FTA + TOV) \]

Where \(TOV\) refers to Turnovers and \(FTA\) refers to free throws attempted.

\[ oRB\% = 100 * (ORB * (Tm MP / 5)) / (MP * (Tm ORB + Opp DRB) \]

Where \(ORB\) refers to the total offensive rebounds and \(TmMP\) refers to the team’s minutes played. \(TmORB\) refers to the team’s offensive rebounds and \(OppDRB\) means the opponent’s defensive rebounds.

This formula is self explanatory.

\[ oRB\% = 100 * (DRB * (Tm MP / 5)) / (MP * (Tm DRB + Opp ORB)) \]

Where \(DRB\) refers to the total defensive rebounds and \(TmMP\) refers to the team’s minutes played. \(TmDRB\) refers to the team’s defensive rebounds and \(OppORB\) means the opponent’s offensive rebounds.