4 84 CHAPTER 4 Figure 1. The code of the data generating system is written in R version 4.0.2 and available online3 . The observations originating from this simulator are defined as Ôi = (W, A, Y) ∼ P s , in which W = ( W1, W2, W3) are the confounders and A ∈ {0,1} is an indicator variable indicating whether a substitution happened in the previous period. Ps is the simulation probability distribution from which the simulation observations Ô were sampled4. The subscript i indicates a specific simulation observation Ôi ∈ Ô. 3.2.2. Observed data We retrospectively collected the in-match position tracking data from 302 competitive professional soccer matches between 18 teams during the Dutch premier league ‘Eredivisie’ 2018–2019 season. The players’ time, position, speed, and acceleration were detected and recorded by the SportsVU optical tracking system (SportsVU, STATS LLC, Chicago, IL, USA). Linke et al. (2018) tested the SportsVU optical tracking system and rated the system as being adequately reliable[29]. For our analysis, two matches with erroneous and missing data were excluded. We only used the second half of the matches expecting the substitution being most effective. Additionally, the extra time at the end of the second half and goalkeepers were excluded from the dataset. The effect of substitution on the match was controlled by identifying both entire-match players and substitutes. Thus, entire-match players played the entire match, while the substitutes entered the match at a later stage. The dataset was divided into periods of five minutes and contained a total of N = 5226 observations ( On). As an illustration of the data, Figure 2 shows the increasing number 259 of substitutes during the second half. The influence of a substitution in a previous 260 period on the total distance of the team compared to no substitution in the previous 261 period is visualized in Figure 3. Each observation Oi ∈ On is considered mutually independent5. Each of these On is defined as Oi = (W, A, Y) ∼ P0, in which W = ( W1, W2, W3) are the confounders, and A ∈ {0,1} is an indicator variable indicating whether a substitution happened in the previous period, P0 is the unknown real underlying probability distribution from which O n was sampled, and Y is the total distance of the team in meters. In the remainder of the work, we will refer to Pn as the empirical distribution of the data. The observed 3 Available at https://github.com/dijkhuist/Entropy-TMLE-Substitutions. 4 The hat ( ˆ ) signifies that this is data from the simulator. 5 Note that the data we deal with in this case study possibly has a time dimension stronger than what we are currently showing in our causal model. In fact, Y at time t could potentially influence W3, or even A and Y itself at time t + 1. As our aim with this paper is to introduce TMLE and causal inference in sports, we will not detail on the time dimensionality of the data. For more information on time series analysis in Targeted Learning, please see [30].
RkJQdWJsaXNoZXIy MjY0ODMw