How To: Getting Minor League Play-by-Play Data, Part I
Step one in obtaining minor league play-by-play data using R
Good morning! Today’s tutorial will walk through the process of obtaining minor league play-by-play data. Full disclosure, most of this code was written by Scott before his departure, so hopefully our explanation is clear and easy to understand. The tutorial posts are for premium subscribers only, but we’ll occasionally post them for everyone like we did a few months ago. You can find that post here.
Important note: The tutorial posts are going to assume that the reader has at least a basic understanding of R. If you’ve never written a line of code or worked with baseball data, there are many great introductory resources out there. A great place to start is R for Data Science by Hadley Wickham & Garrett Grolemund. It covers everything you’ll need to know to begin working with data, performing analysis, and building visualizations. RStudio (now Posit) also puts out many great resources like this guide for getting started in R.
Getting Started…
Quite a few readers and folks on Twitter have asked about how to find and use the data we write about in the newsletter. Well, today’s post will help you get started so that you can do it yourself.
We’ll start by loading a few packages you’ll need in order to complete this tutorial. If you don’t have these packages installed already you can usually get them on CRAN — install.packages(“tidyverse”)
library(tidyverse)
library(baseballr)
library(lubridate)
library(gt)
After we load our packages up, we’ll start by leveraging some of the excellent functions created by Bill Petti in the baseballr package. We can use the get_game_pks_mlb
function to obtain a list of all the games we want play-by-play (pbp) data for from last season. But first, let’s start simple and just get games from one day and level, then walk through the process of querying for pbp data for that day.
# level 12 corresponds to Double-A
games = get_game_pks_mlb(date = "2022-06-01", level_ids = 12)
print(games, n = 10)
We now have a tibble that is 16 by 59 — 16 rows that represent each game and 59 columns. These are all the Double-A games from June 1st, 2022. Here we’ve printed just the top 10 rows in the data using print(games, n = 10)
.
Here is that same data in a nicer table:
Now if we want pbp data for just one game, we can choose any game_pk
from our game list and use the get_pbp_mlb
function from the baseballr
package to query the data for that game. Something like this…
Keep reading with a 7-day free trial
Subscribe to Down on the Farm to keep reading this post and get 7 days of free access to the full post archives.