View on GitHub

Project-1

Ice Ice Baby: Exploring the NHL Franchise API Using jsonlite and ggplot2

Christine Marie Heubusch June 12, 2020

Understanding JSON Data

What is JSON data?

JavaScript Object Notation (JSON) is a format of storing data that is “lightweight” (that is, tends to be a small file size) and integrates well with most programming languages, though it is technically a subset of JavaScript. These properties make it a favorite format for application programming interfaces (APIs) and is often used to “serialize and transfer data over a network connection”. In JSON, data can be stored in a variety of different ways, including strings, numbers, booleans (with the true/false operators), arrays, and null. While objects are wrapped with a left and right curly brace ({}), arrays are enclosed in brackets ([]). Objects contain key-value pairs, with the key being a string, a value being one of the aformed JSON data types, and a colon separating the two; each key-value pair is separated with a comma. Values may be nested in a JSON data object, which makes it both convenient and easy to read.

Packages Used for Reading JSON Data

Three R packages can be utilized for working with JSON data:

All three packages offer options for converting to/from JSON. RJSONIO was originally introduced as a faster alternative to rjson, given concerns about the latter’s slower speeds, though there are some indications that rjson may now actually be faster in certain circumstances. I have selected jsonlite, which debuted in 2014 and originally started as a branch of RJSONIO, as my package of choice. Documentation on the package is extensive, as evidenced by the package’s thorough vignette; and a wide variety of useful functions and options are within. Interestingly, only RJSONIO and jsonlite give control over vector simplification, an important consideration when parsing JSON data.

Contacting the NHL API

When I first tried converting the data from JSON, “data.” appeared at the beginning of each column name; ostensibly, this points to the object within which the array of data was stored in JSON. However, thinking this a bit cumbersome for usage in later code, per Subhankar’s recommendation, I played around with using str_remove to remove the string, and found it to be useful for keeping the column names short and more enjoyable to work with.

Creating a Function for Accessing Franchise Data

api_function_franchise <- function(x) {
  base_url <- "https://records.nhl.com/site/api" 
  x <- "franchise"
  full_url <- paste0(base_url, "/", x) 
  franchise_get <- GET(full_url) 
  franchise_txt <- content(franchise_get, "text") 
  franchise_json <- fromJSON(franchise_txt, flatten=TRUE, simplifyDataFrame=TRUE)
  franchise_df <- as.data.frame(franchise_json)
  colnames(franchise_df) <- str_remove(colnames(franchise_df), "data.") 
  return(franchise_df)
}
franchise_data <- api_function_franchise(x)
#view(franchise_data)
head(franchise_data)
##   id firstSeasonId lastSeasonId mostRecentTeamId teamCommonName teamPlaceName
## 1  1      19171918           NA                8      Canadiens      Montréal
## 2  2      19171918     19171918               41      Wanderers      Montreal
## 3  3      19171918     19341935               45         Eagles     St. Louis
## 4  4      19191920     19241925               37         Tigers      Hamilton
## 5  5      19171918           NA               10    Maple Leafs       Toronto
## 6  6      19241925           NA                6         Bruins        Boston
##   total
## 1    38
## 2    38
## 3    38
## 4    38
## 5    38
## 6    38

Creating a Function for Accessing Franchise Team Totals

api_function_totals <- function(x) {
  base_url <- "https://records.nhl.com/site/api" 
  x <- "franchise-team-totals"
  full_url <- paste0(base_url, "/", x) 
  team_totals_get <- GET(full_url) 
  team_totals_txt <- content(team_totals_get, "text") 
  team_totals_json <- fromJSON(team_totals_txt, flatten=TRUE)
  team_totals_df <- as.data.frame(team_totals_json)  
  colnames(team_totals_df) <- str_remove(colnames(team_totals_df), "data.") 
  return(team_totals_df)
}
team_totals_data <- api_function_totals(x)
#view(team_totals_data)
head(team_totals_data)
##   id activeFranchise firstSeasonId franchiseId gameTypeId gamesPlayed
## 1  1               1      19821983          23          2        2937
## 2  2               1      19821983          23          3         257
## 3  3               1      19721973          22          2        3732
## 4  4               1      19721973          22          3         272
## 5  5               1      19261927          10          2        6504
## 6  6               1      19261927          10          3         515
##   goalsAgainst goalsFor homeLosses homeOvertimeLosses homeTies homeWins
## 1         8708     8647        507                 82       96      783
## 2          634      697         53                  0       NA       74
## 3        11779    11889        674                 81      170      942
## 4          806      869         46                  1       NA       84
## 5        19863    19864       1132                 73      448     1600
## 6         1436     1400        103                  0        1      137
##   lastSeasonId losses overtimeLosses penaltyMinutes pointPctg points roadLosses
## 1           NA   1181            162          44397    0.5330   3131        674
## 2           NA    120              0           4266    0.0039      2         67
## 3           NA   1570            159          57422    0.5115   3818        896
## 4           NA    124              0           5356    0.0147      8         78
## 5           NA   2693            147          85564    0.5125   6667       1561
## 6           NA    263              0           8132    0.0000      0        160
##   roadOvertimeLosses roadTies roadWins shootoutLosses shootoutWins shutouts
## 1                 80      123      592             79           78      193
## 2                  0       NA       63              0            0       25
## 3                 78      177      714             67           82      167
## 4                  0       NA       64              0            0        9
## 5                 74      360     1256             66           78      403
## 6                  0        7      107              0            0       44
##   teamId           teamName ties triCode wins total
## 1      1  New Jersey Devils  219     NJD 1375   104
## 2      1  New Jersey Devils   NA     NJD  137   104
## 3      2 New York Islanders  347     NYI 1656   104
## 4      2 New York Islanders   NA     NYI  148   104
## 5      3   New York Rangers  808     NYR 2856   104
## 6      3   New York Rangers    8     NYR  244   104

Creating a Function for Accessing Season Records

api_function_season <- function(ID) {
  base_url <- "https://records.nhl.com/site/api" 
  full_url <- paste0(base_url, "/", "franchise-season-records?cayenneExp=franchiseId=", ID) 
  season_get <- GET(full_url) 
  season_txt <- content(season_get, "text") 
  season_json <- fromJSON(season_txt, flatten=TRUE)
  season_df <- as.data.frame(season_json)
  colnames(season_df) <- str_remove(colnames(season_df), "data.") 
  return(season_df)
}
seasonCaps <- api_function_season(24) #Testing the function with Washington Capitals data, since I'm originally from the DC area!
#view(seasonCaps)

Creating a Function for Accessing Goalie Records

api_function_goalie <- function(ID) {
  base_url <- "https://records.nhl.com/site/api" 
  full_url <- paste0(base_url, "/", "franchise-season-records?cayenneExp=franchiseId=", ID) 
  goalie_get <- GET(full_url) 
  goalie_txt <- content(goalie_get, "text") 
  goalie_json <- fromJSON(goalie_txt, flatten=TRUE)
  goalie_df <- as.data.frame(goalie_json)
  colnames(goalie_df) <- str_remove(colnames(goalie_df), "data.")
  return(goalie_df)
}
goalieCanes<- api_function_goalie(26) #Testing the function with Carolina Hurricanes data, to honor my new home. 
#view(goalieCanes)

Creating a Function for Accessing Skater Records

api_function_skater <- function(ID) {
  base_url <- "https://records.nhl.com/site/api" 
  full_url <- paste0(base_url, "/", "franchise-season-records?cayenneExp=franchiseId=", ID) 
  skater_get <- GET(full_url) 
  skater_txt <- content(skater_get, "text") 
  skater_json <- fromJSON(skater_txt, flatten=TRUE)
  skater_df <- as.data.frame(skater_json)
  colnames(skater_df) <- str_remove(colnames(skater_df), "data.")
  return(skater_df)
}
skaterPenguins <- api_function_skater(17) #Testing the function - I'll look at some Penguins data too! 
#view(skaterPenguins)

Exploratory Data Analysis

Is a Team With More Recorded Shutouts More Likely to Still be an Active Franchise?

In a shutout game, a hockey team’s defense manages to prevent the other team from scoring any points throughout the entirety of the game - surely, a sign of the franchise’s strong performance! To begin, I decided to created a categorical variable called shutoutRange, usuing the mutate function to create ranges for total number of shutouts and ifelse to assign each value to the appropriate level of the new variable. I then created a contingency table to get a quick glimpse of how many active (and inactive) franchises fell within each category. Not too surprisingly, none of the teams that had a history of over a 100 shutouts were inactive. To visualize this data, I created a barplot, filling the bars with color based upon whether or not the franchise is active.

team_totals_data <- mutate(team_totals_data, shutoutRange = 
                             ifelse(shutouts %in% 0:50, "0-50",
                                ifelse(shutouts %in% 51:100, "51-100",
                                ifelse(shutouts %in% 101:150, "101-150", 
                                ifelse(shutouts >150, "151+", "")))))
team_totals_data$shutoutRange <- factor(team_totals_data$shutoutRange, levels=c("0-50", "51-100", "101-150","151+")) #Changed these values to factors so that they could be easily ordered for the barplot below. 
shutoutTable <- table(team_totals_data$activeFranchise, team_totals_data$shutoutRange)
kable(shutoutTable, caption="Table of Active or Non-Active Franchise vs. Total Number of Shutouts")
  0-50 51-100 101-150 151+
0 15 3 0 0
1 54 4 12 16

Table of Active or Non-Active Franchise vs. Total Number of Shutouts

shutoutGraph<- ggplot(data=team_totals_data, aes(x=shutoutRange))
shutoutGraph + geom_bar(aes(fill=as.factor(activeFranchise))) + 
  labs(x="Total Number of Shutouts", title="Bar Plot of Number of Shutouts") + 
  scale_fill_discrete(name="Active Franchise?", labels=c("No","Yes")) + 
  coord_flip()

Are Teams more Likely to Win at Home than Away?

We’ve heard of home-“field” advantage - let’s see if the same can be said of the rink. I began by creating another two variables:

For this purposes of this variable, I did not include ties.

team_totals_data <- mutate(team_totals_data, roadWinLossRatio = roadWins / roadLosses)
team_totals_data <- mutate(team_totals_data, homeWinLossRatio = homeWins / homeLosses)

team_totals_data %>%
  select(teamName, roadWins, roadLosses, roadWinLossRatio, homeWins, homeLosses, homeWinLossRatio) %>%
  arrange(teamName) %>%
  head()
##            teamName roadWins roadLosses roadWinLossRatio homeWins homeLosses
## 1     Anaheim Ducks       38         39        0.9743590       51         34
## 2     Anaheim Ducks      422        463        0.9114471      551        341
## 3   Arizona Coyotes       86        128        0.6718750      104        108
## 4    Atlanta Flames      107        156        0.6858974      161        104
## 5    Atlanta Flames        0          9        0.0000000        2          6
## 6 Atlanta Thrashers      159        233        0.6824034      183        204
##   homeWinLossRatio
## 1        1.5000000
## 2        1.6158358
## 3        0.9629630
## 4        1.5480769
## 5        0.3333333
## 6        0.8970588
roadWLR <- summary(team_totals_data$roadWinLossRatio)
roadWLR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4384  0.7130  0.6497  0.8788  1.2340
homeWLR <- summary(team_totals_data$homeWinLossRatio)
homeWLR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.9069  1.3024  1.2541  1.5573  3.0000       1

Looking at the mean and median Win-Loss ratios for on the road (or “away”) and home games, we see that both the home mean and median ratios are nearly double that of away games - perhaps indicating that a team is more likely to win when it’s playing on its own ice. We can then look at boxplots of these ratios, as a visual representation of these summary statistics. The distributions appear slightly skewed, with the Road Win-Loss ratio being slightly skewed left, and the Home Win-Loss Ratio appearing skewed right, likely due to its outlier. Looking at the data, we can identify this outlier as the Toronto Arenas franchise (with a game type ID of 2); however, we also note that they only played a total of 20 games at home, winning 15 - giving them their extraordinary Win-Loss Ratio of 3.0.

penalty_plot <- ggplot(team_totals_data, aes(x=roadWinLossRatio,y=""))
penalty_plot + 
  geom_boxplot(fill="#cc0000") +
  ggtitle("Boxplot of Ratio of Road Wins to Road Losses") +
  labs(y="")

penalty_plot <- ggplot(team_totals_data, aes(x=homeWinLossRatio,y=""))
penalty_plot + 
  geom_boxplot(fill="#cc0000") +
  ggtitle("Boxplot of Ratio of Home Wins to Home Losses") +
  labs(y="")

#team_totals_data %>% select(teamName, homeWinLossRatio) %>% view()
#We can see that the outlier, with a homeWinLossRatio of 3 is the Toronto Arenas. 
roadWinsLossesHist <- ggplot(team_totals_data, aes(x=roadWinLossRatio))
roadWinsLossesHist +
  geom_histogram(aes(y=..density..), binwidth=0.25) + 
  geom_density(kernel="gaussian")

  ggtitle("Histogram of Road Wins to Road Losses Ratio")
## $title
## [1] "Histogram of Road Wins to Road Losses Ratio"
## 
## attr(,"class")
## [1] "labels"
homeWinsLossesHist <- ggplot(team_totals_data, aes(x=homeWinLossRatio))
homeWinsLossesHist +
  geom_histogram(aes(y=..density..), binwidth=0.25) + 
  geom_density(kernel="gaussian") + 
  ggtitle("Histogram of Home Wins to Home Losses Ratio")

Do Aggressive Teams Tend to Win More?

Do teams that spend more time in the penalty box tend to win more? To help explore this question, I created another new variable called penaltyGamesRatio - the total number of minutes spent in the penalty box, divided by the total number of games played. This variable is intended to display the average amount of time per game a team’s players spend in penalty - perhaps a larger number is indicative of a more “aggressive”, edgy playing style.

team_totals_data <- mutate(team_totals_data, penaltyGamesRatio = penaltyMinutes / gamesPlayed)

team_totals_data %>%
  arrange(desc(penaltyGamesRatio)) %>%
  select(teamName, penaltyGamesRatio) %>%
  head()
##               teamName penaltyGamesRatio
## 1     Quebec Nordiques          30.42500
## 2    Atlanta Thrashers          28.75000
## 3       Atlanta Flames          27.11765
## 4     Hartford Whalers          25.97959
## 5       Toronto Arenas          25.14286
## 6 Winnipeg Jets (1979)          24.93548

I then calculated an overall Win-Loss ratio, by dividing total number of wins by total number of losses. My hope here was to create a standardized measure, controlling for the amount of time that a team had been in operation, so that new teams and old teams could be compared equitably. Upon printing the first six rows of the data, we see that the Vegas Golden Knights - a team that just began in 2017 - has the highest win-to-loss ratio.

team_totals_data <- mutate(team_totals_data, winLossRatio = wins / losses)
team_totals_data %>%
  select (teamName, winLossRatio) %>%
  arrange(desc(winLossRatio)) %>%
  head()
##               teamName winLossRatio
## 1 Vegas Golden Knights     1.662500
## 2      Edmonton Oilers     1.514286
## 3   Montréal Canadiens     1.512056
## 4         Dallas Stars     1.475661
## 5 Vegas Golden Knights     1.454545
## 6  Philadelphia Flyers     1.437369

I then created a categorical variable called MoreWinsOrLosses, using ifelse. If a team won more than it lost (that is, its winLossRatio > 1), then the value of MoreWinsOrLosses was designated as “More Wins”. Otherwise, it was assigned a value of “More Losses”. (For the purposes of this project, I designated the scenario of an equal number of wins and losses as “More Losses”.)

team_totals_data <- mutate(team_totals_data, moreWinsOrLosses = ifelse(winLossRatio > 1, "More Wins", "More Losses"))
team_totals_data %>%
  select(teamName, winLossRatio, moreWinsOrLosses) %>%
  arrange(teamName) %>%
  head(10)
##              teamName winLossRatio moreWinsOrLosses
## 1       Anaheim Ducks    1.2191781        More Wins
## 2       Anaheim Ducks    1.2101990        More Wins
## 3     Arizona Coyotes    0.8050847      More Losses
## 4      Atlanta Flames    1.0307692        More Wins
## 5      Atlanta Flames    0.1333333      More Losses
## 6   Atlanta Thrashers    0.7826087      More Losses
## 7   Atlanta Thrashers    0.0000000      More Losses
## 8       Boston Bruins    1.3439464        More Wins
## 9       Boston Bruins    0.9907407      More Losses
## 10 Brooklyn Americans    0.5517241      More Losses

I also created two contingency tables using the new variable. Winning teams appear to be both more likely to still be an active franchise, as well as more likely to achieve shutout games.

shutoutTable <- table(team_totals_data$moreWinsOrLosses, team_totals_data$activeFranchise)
kable(shutoutTable, caption="Table of More Wins or Losses vs. Active or Non-Active Franchise")
  0 1
More Losses 15 44
More Wins 3 42

Table of More Wins or Losses vs. Active or Non-Active Franchise

kable(table(team_totals_data$moreWinsOrLosses, team_totals_data$shutoutRange), caption="Contingency Table of More Wins or Losses vs. Shutout Range")
  0-50 51-100 101-150 151+
More Losses 53 2 2 2
More Wins 16 5 10 14

Contingency Table of More Wins or Losses vs. Shutout Range

Looking at the resulting boxplot (below), we can see that the median amount of time spent in the penalty box by a “winning” team is greater than that of a “losing” team. Yet interestingly, the winning team range is much, much narrower than that of a losing team, and two losing teams prove to be outliers, with one team spending an average of over 30 minutes a game in the penalty box.

penalty_plot <- ggplot(team_totals_data, aes(x=penaltyGamesRatio,y=moreWinsOrLosses))
penalty_plot + 
  geom_boxplot(fill="#cc0000") +
  ggtitle("Boxplot of Average Time Spent in Penalty Box") +
  labs (x="Ratio of Total Minutes in Penalty Box to Total Number of Games Played", y="'Winning' or 'Losing' Team?")

Looking at the scatterplot, there does not appear to be a straight-line relationship between the amount of time that a team spends in the penalty box and its Win-Loss ratio. Looking at the distribution of the points, it seems that there may be a “sweet spot” - perhaps indicating that making certain illegal or more aggressive plays that could be risky, but rewarding.

ratiosScatter <- ggplot(team_totals_data, aes(x=penaltyGamesRatio, y=winLossRatio))
ratiosScatter +
  geom_point() +
  geom_smooth() + 
  ggtitle("Scatterplot of Average Time Spent in Penalty Box vs. Win-Loss Ratio") + 
  labs(x="Average Time Spent in Penalty Box", y="Win-Loss Ratio")

Do Newer Teams Tend to be More Aggressive?

For this brief analysis, I filtered the dataset twice: once for years after or equal to 2000 (season stored as 20002001), then for years before 2000. I then took summary statistics for each of the ratios of Penalty to Games. In this case, we see that the median and mean for teams before 2000 is actually higher than that of teams that began in or after 2000, perhaps indicating that newer teams are actually less likely to be in the penalty box.

post2000Penalties<- team_totals_data %>% 
  filter(firstSeasonId >= 20002000) %>% 
  select(teamName, penaltyGamesRatio, firstSeasonId) %>% 
  arrange(desc(penaltyGamesRatio))
summary(post2000Penalties$penaltyGamesRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.272   8.894  10.581  10.339  11.010  13.222
pre2000Penalties <- team_totals_data %>% 
  filter(firstSeasonId < 20002000) %>% 
  select(teamName, penaltyGamesRatio, firstSeasonId) %>% 
  arrange(desc(penaltyGamesRatio))
summary(pre2000Penalties$penaltyGamesRatio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.50   11.91   14.20   14.83   17.11   30.43

How are Weak Goalies Affecting Teams’ Performances?

The Goals Against Average indicates “the number of goals a goaltender allows per 60 minutes of playing time” - that is, per one game. The column goalsAgainst in the dataset is a cumulative measure, indicating the total number of Goals Against for the team since its inception. We can average this as a “per game” value by dividing the goalsAgainst by the total number of games played. After creating that new variable, I then plotted this value against the total Win-Loss ratio. Of course, there appears to be a weak negative correlation between the two values, with r= -0.421: the larger the number of goals that a team’s goalies have let in, the lower their win-loss ratio. I set the colors of the values on the scatterplot by whether or not a team is still an active franchise - indeed, the greater the number of goals are let in, the lower the number of wins… and seemingly, the less likely they are to still be an active franchise. The active franchises appear to have a much lower Goals Against Ratio, and a much higher Win-Loss ratio.

team_totals_data$activeFranchise <- as.logical(team_totals_data$activeFranchise)
team_totals_data <- mutate(team_totals_data, goalsAgainstRatio = goalsAgainst / gamesPlayed)
correlation <- cor(team_totals_data$goalsAgainstRatio, team_totals_data$winLossRatio)
correlation
## [1] -0.4209071
goalsAgainstLossesPlot <- ggplot(team_totals_data, aes(x=goalsAgainstRatio, y=winLossRatio, color=activeFranchise))
goalsAgainstLossesPlot + 
  geom_point() + 
  geom_smooth(method=lm) +
  ggtitle("Scatterplot of Goals Against vs. Win-Loss Ratio") +
  labs(x="Goals Against Ratio", y="Win-Loss Ratio") +
  scale_color_discrete(name="Active?", labels=c("No","Yes"))