Decision Analysis and Trees in Python — The Sase of the Oakland A’s | by Giovanni Malloy | May, 2023


Using decision trees in Python to extract insight into the A’s decision to move to Las Vegas

Photo by Rick Rodriguez on Unsplash

Just recently, the owner of the Oakland Athletics baseball team, John Fischer, announced that the team had purchased close to 50 acres of land in Las Vegas, Nevada. [1] This puts the future of Oakland’s last remaining professional sports team in jeopardy. In the last 5 years, Oakland has seen the Golden State Warriors (NBA) and Las Vegas Raiders (NFL) depart for newer shinier stadiums in other cities (although, Golden State just went across the Bay Bridge to San Francsico). While the decision-making process in the Oakland A’s front office remains a mystery to me, data science and decision analysis in tandem can reveal a great deal about John Fischer’s motives to move to Las Vegas.

Decision analysis is important for all data scientists to understand because it is the bridge between the highly technical work of probability and statistical models and business decisions. Understanding how business decisions are made can help frame our work and the presentation of our findings to non-technical audiences as we provide actionable recommendations and findings. The Institute for Operations Research and Management Science (INFORMS) even has an entire society dedicated to Decision Analysis.

Additionally, machine learning can help generalize the results of decision analysis by unlocking insight from probabilistic sensitivity analyses. After initially constructing a model dissecting the Oakland vs. Las Vegas scenarios using decision analysis, we will use machine learning to mine for patterns that could help reveal actionable recommendations for the A’s should the circumstances of the decision change.

What is decision analysis?

Decision analysis is the field of study dedicated to a “systematic, quantitative, and visual approach to addressing and evaluating important choices.” [2] It can be a powerful tool in a low data environment and help individuals use a mix of subject matter expertise and prior knowledge to improve the expected value of complex decisions. It is used in a wide range of fields, such as economics, management, and policy analysis.

Generally, in the world of decision analysis, we take a Bayesian perspective of the world. The fundamental theorem of Bayes is the following:

Image created by the author.

Where P(A) is the probability of event A occurring, P(B) is the probability of event B occurring, P(A|B) is the probability of event A occurring given that event B occurred, and P(B|A) is the probability of event B occurring given that event A occurred. Typically, P(A) represents a prior belief about the chance of A occurring and B represents some new data. P(A|B) is an updated posterior belief about the chance of A occurring after you observed B.

For example, let’s suppose that we go to the Oakland-Alameda County Coliseum to watch a ballgame, but we haven’t been keeping track of player statistics. We start with the knowledge that outfielders get on base with probability 0.35, infielders get on base with probability 0.25, and designated hitters get on base with probability 0.4. Let A be the event that the next batter is an outfielder, B be the event that the next batter is an infielder, and C be the event that the next batter is the designated hitter. Since we know the roster of a baseball team, we already know that P(A) = 0.33, P(B) = 0.56, and P(C) = 0.11. Now, that next batter comes up to the plate, and to our delight, gets on base (event D)! From our earlier knowledge of baseball, we know P(D|A) = 0.35, P(D|B) = 0.25, and P(D|C) = 0.4. Using the law of total probability, we can calculate P(D) = P(D|A)P(A) + P(D|B)P(D) + P(D|C)P(C) = 0.3. Now, we can update our beliefs about which type of player the batter was: P(A|D) = 0.39, P(B|D) = 0.47, and P(C|D) = 0.15. After seeing the player get on base, we now are more likely to believe the player was not an infielder. Now that you are in the right frame of mind, let’s continue.

The key tool in decision analysis is a decision tree (not to be confused with the machine learning algorithm by the same name). [3] The decision tree features two basic components: a decision node and a choice node. [3] In this blog, I’m going to show you how to construct a decision tree, evaluate it in Python, and understand the Oakland A’s decision to move to Las Vegas.

What is the decision?

Howard and Abbas define a decision as “a choice between two or more alternatives that involves an irrevocable allocation of resources.” [3] This is a broad definition, but in our example of the Oakland A’s the decision is: should the Athletics baseball team stay in Oakland or move to Las Vegas? In this case, the decision is irrevocable because they will be building a new stadium regardless of the city chosen.

What are the uncertainties?

Uncertainties surround every decision we make. In the decision of whether to stay in Oakland or move to Las Vegas, the A’s are uncertain about new stadium costs and subsequent operating revenues: 1) how much public money they will receive to fund their new stadium, 2) how much revenue they will generate in ticket sales, and 3) how much revenue they will generate in local television deals.

The A’s are currently hoping to build a $1.5 billion stadium in Las Vegas. [1] Back in 2021, the organization had asked for $855 million in public money to help build their new stadium in Oakland despite previously agreeing with the city and county that the new stadium in Oakland would be privately funded. [1] Therefore, we can reasonably assume, the cost of building the stadium is roughly the same in both localities. The only uncertainty here is how much taxpayer money will go towards funding the stadium.

Estimated revenue from ticket sales varies tremendously between teams from $27 million — $131 million with a median of around $75 million. [4] Oakland was estimated to have a ticket revenue of approximately $55 million. [4]

Television revenues in the MLB are evenly distributed for national television deals negotiated by the MLB. However, an important component of individual team television revenue comes through regional sports networks (RSNs). The teams get to keep much of the revenue from local television deals, although there is still a great deal of revenue sharing. After revenue sharing, television contract revenue from RSNs varied from $36 million — $131 million with all but the most valuable teams making less than $60 million. [4]

Thanks to the Raiders’ (NFL) move to Las Vegas from Oakland several years ago, we know that the city of Las Vegas was willing to provide $750 million in public funds to build a brand-new football stadium. [5] We also know that both locals and tourists alike are prepared to join in and cheer on a new professional team, as the Raiders led the NFL in ticket revenue in 2021 at $119 million for the year. [6]

There are methods that are beyond the scope of this blog to solicit the decision maker prior beliefs on likely outcomes of these uncertainties and the probabilities of each. Additionally, I doubt John Fischer is prepared to comment for my blog. So, in the meantime, I will use the information I have pulled together from these web sources to provide several possible scenarios for each uncertainty.

Image created by author.

What is our decision time horizon?

Of course, revenues are annual figures, and the stadium should last much longer than a year. Time horizons can differ by the context of the decision and how the decision maker views the likelihood of change over the landscape. In data science terms, this is congruent to data drift where the data used to train the model is different than current data. For now, let’s assume that these estimates will remain fairly steady over the decade and use a 10-year time horizon with a 3% discount rate on our annual costs.

What does the decision tree look like?

Now that we have defined all of the components of our decision tree, it is time to build the tree. Conceptually, here is what it looks like:

Image created by author.

The square node is the decision node, the circular nodes are the chance nodes, and the triangular nodes are the terminal nodes. Due to space limitations, the entire tree is not visible in the image, but each node has an associated probability and value, as well.

How do we build the model in Python?

In decision analysis, after establishing the construction of our decision tree, we can identify the best decision by “rolling back” the tree. In this example, we assume a rational (aka expected value) decision maker. So, we start by tabulating the value associated with the terminal state, if applicable. This will become our running total or expected value. In this case, it is not applicable, so we start with a total of $0. Then, we iteratively calculate the expected value of each set of nodes to the left of the terminal nodes and add it to the running total or expected value. In the end, we will end up with one expected value of the decision to stay in Oakland and one expected value of the decision to move to Las Vegas.

Let’s start with a simple setup of our base case scenario. We are going to take the approach of creating a data frame of all possible combinations of decision, public money, ticket sales, and RSN revenue scenarios.

import numpy as np
import pandas as pd

# Create data frame of all possible outcomes
decision_list = ['Oakland', 'Las Vegas']

# First Node
chance_node_stadium_money_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_stadium_money_probabilities_oakland = [0.1, 0.3, 0.6]
chance_node_stadium_money_probabilities_vegas = [0.5, 0.4, 0.1]
chance_node_stadium_money_values = [855, 500, 0]

#Second Node
chance_node_ticket_sales_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_ticket_sales_probabilities_oakland = [0.2, 0.2, 0.6]
chance_node_ticket_sales_probabilities_vegas = [0.3, 0.4, 0.3]
chance_node_ticket_sales_values_per_year = [80, 55, 27]

# Third Node
chance_node_rsn_revenue_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_rsn_revenue_probabilities_oakland = [0.15, 0.5, 0.35]
chance_node_rsn_revenue_probabilities_vegas = [0.1, 0.3, 0.6]
chance_node_rsn_revenue_values_per_year = [60, 45, 36]

# Convert annual values to NPV of 10 year time horizon
time_horizon = 10 # years
discount_rate = 0.03 # per year
chance_node_ticket_sales_values = [val * (1 - (1/((1 + discount_rate)**time_horizon)))/discount_rate for val in chance_node_ticket_sales_values_per_year]
chance_node_rsn_revenue_values = [val * (1 - (1/((1 + discount_rate)**time_horizon)))/discount_rate for val in chance_node_rsn_revenue_values_per_year]

# Create data frame of all possible scenarios
decision_list_list_for_df = []
chance_node_stadium_money_list_for_df = []
chance_node_stadium_money_probability_list_for_df = []
chance_node_stadium_money_value_list_for_df = []
chance_node_ticket_sales_list_for_df = []
chance_node_ticket_sales_probability_list_for_df = []
chance_node_ticket_sales_value_list_for_df = []
chance_node_rsn_revenue_list_for_df = []
chance_node_rsn_revenue_probability_list_for_df = []
chance_node_rsn_revenue_value_list_for_df = []

for i in decision_list:
for j in range(len(chance_node_stadium_money_scenarios)):
for k in range(len(chance_node_rsn_revenue_scenarios)):
for m in range(len(chance_node_rsn_revenue_scenarios)):
decision_list_list_for_df.append(i)
chance_node_stadium_money_list_for_df.append(chance_node_stadium_money_scenarios[j])
chance_node_stadium_money_value_list_for_df.append(chance_node_stadium_money_values[j])
chance_node_ticket_sales_list_for_df.append(chance_node_ticket_sales_scenarios[k])
chance_node_ticket_sales_value_list_for_df.append(chance_node_ticket_sales_values[k])
chance_node_rsn_revenue_list_for_df.append(chance_node_rsn_revenue_scenarios[m])
chance_node_rsn_revenue_value_list_for_df.append(chance_node_rsn_revenue_values[m])

if i == 'Oakland':
chance_node_stadium_money_probability_list_for_df.append(chance_node_stadium_money_probabilities_oakland[j])
chance_node_ticket_sales_probability_list_for_df.append(chance_node_ticket_sales_probabilities_oakland[k])
chance_node_rsn_revenue_probability_list_for_df.append(chance_node_rsn_revenue_probabilities_oakland[m])
elif i == 'Las Vegas':
chance_node_stadium_money_probability_list_for_df.append(chance_node_stadium_money_probabilities_vegas[j])
chance_node_ticket_sales_probability_list_for_df.append(chance_node_ticket_sales_probabilities_vegas[k])
chance_node_rsn_revenue_probability_list_for_df.append(chance_node_rsn_revenue_probabilities_vegas[m])

decision_tree_df = pd.DataFrame(list(zip(decision_list_list_for_df, chance_node_stadium_money_list_for_df,
chance_node_stadium_money_probability_list_for_df,
chance_node_stadium_money_value_list_for_df,
chance_node_ticket_sales_list_for_df,
chance_node_ticket_sales_probability_list_for_df,
chance_node_ticket_sales_value_list_for_df,
chance_node_rsn_revenue_list_for_df,
chance_node_rsn_revenue_probability_list_for_df,
chance_node_rsn_revenue_value_list_for_df)),
columns = ['Decision',
'Stadium_Money_Result', 'Stadium_Money_Prob', 'Stadium_Money_Value',
'Ticket_Sales_Result', 'Ticket_Sales_Prob', 'Ticket_Sales_Value',
'RSN_Revenue_Result', 'RSN_Revenue_Prob', 'RSN_Revenue_Value'])

Now, if you print your decision tree, you’ll get a pandas dataframe of 54 rows and 10 columns. We can easily roll back the decision tree with a creative use of groupby and merge functions. Let’s start with tabulating the expected value from RSN revenue for every combination of decision, stadium money, and ticket sales:

decision_tree_df['RSN_EV'] = decision_tree_df['RSN_Revenue_Prob'] * decision_tree_df['RSN_Revenue_Value']

# Consolidate the RSN_EV values
RSN_rollback_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])['RSN_EV'].sum().reset_index()

# Keep the rest of the columns
decision_tree_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])['Stadium_Money_Value', 'Ticket_Sales_Value'].mean().reset_index()

# merge two dataframes
decision_tree_df = pd.merge(decision_tree_df, RSN_rollback_df, on = ['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])

The resulting table has shrunk, and now you can visually see the expected value of the rolled back RSN revenue nodes.

Image created by author

Repeating the process with ticket sales. We have the following code:

decision_tree_df['Ticket_Sales_RSN_EV'] = decision_tree_df['Ticket_Sales_Prob'] * decision_tree_df['Ticket_Sales_Value'] + decision_tree_df['RSN_EV']

# Consolidate the Ticket Sales and RSN_EV values
ticket_sales_rollback_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])['Ticket_Sales_RSN_EV'].sum().reset_index()

# Keep the rest of the columns
decision_tree_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])['Stadium_Money_Value'].mean().reset_index()

# merge two dataframes
decision_tree_df = pd.merge(decision_tree_df, ticket_sales_rollback_df, on = ['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])

And output:

Image created by author.

Finally, repeating for the stdium public money contribution:

decision_tree_df['Stadium_Money_Ticket_Sales_RSN_EV'] = decision_tree_df['Stadium_Money_Prob'] * decision_tree_df['Stadium_Money_Value'] + decision_tree_df['Ticket_Sales_RSN_EV']

# Consolidate the Stadium Money, Ticket Sales, and RSN_EV values
decision_tree_df = decision_tree_df.groupby(['Decision'])['Stadium_Money_Ticket_Sales_RSN_EV'].sum().reset_index()

Image created by author

Here, we can see that the model calculates that over a 10-year time horizon, the expected value of staying put in Oakland is $4.7 billion while the expected value of moving to Las Vegas is $5.2 billion.

How can we generalize the model?

Of course, there is uncertainty in both our data and our model, and there are many different scenarios which we could test. Naturally, we may look to define some thresholds or scenarios at which the decision changes from staying in Oakland to moving to Las Vegas (or vice versa). These decision points can serve as a helpful set of “business rules” for decision makers and can help us as data scientists extract actionable recommendations from our analysis.

There are many ways to achieve this end, but in this blog, we will use machine learning meta-modeling. Meta-modeling involves developing a faster (and sometimes simpler) model of an original mathematical or simulation model which takes the same inputs and produces very similar outputs [7]. In this case, we will use probabilistic sensitivity analysis to test a large parameter space of the decision analysis decision tree and note the resulting decision of each parameter set. Then we will train a machine learning decision tree classification model using the parameter set as the features and the resulting decision as the labels for our machine learning model. The benefit of the machine learning model is that it will uncover complex relationships for us that would be difficult to decipher with multivariate sensitivity analysis alone. The hope is that we can get enough accuracy from a shallow tree to describe the scenarios in which the A’s should stay in Oakland versus move to Las Vegas.

First, we start by designing a probabilistic sensitivity analysis. For this example, we will assume that the dollar values of the chance nodes will remain the same, but the probabilities of the various outcomes will vary. Since we know that probabilities will vary between values of 0 and 1, we will assume that all scenario probabilities are equally likely and model them using a uniform distribution with minimum value 0 and maximum value 1. After sampling three times from the uniform distribution (one for each optimistic, neutral, and pessimistic scenario), we will normalize the results such that the sum of the three probabilities adds to 1.

# Number of simulations
n_sim = 5000

# Track scenarios
oakland_stadium_money_probabilities_optimistic_list = []
oakland_stadium_money_probabilities_neutral_list = []
oakland_stadium_money_probabilities_pessimistic_list = []

oakland_ticket_sales_probabilities_optimistic_list = []
oakland_ticket_sales_probabilities_neutral_list = []
oakland_ticket_sales_probabilities_pessimistic_list = []

oakland_rsn_revenue_probabilities_optimistic_list = []
oakland_rsn_revenue_probabilities_neutral_list = []
oakland_rsn_revenue_probabilities_pessimistic_list = []

vegas_stadium_money_probabilities_optimistic_list = []
vegas_stadium_money_probabilities_neutral_list = []
vegas_stadium_money_probabilities_pessimistic_list = []

vegas_ticket_sales_probabilities_optimistic_list = []
vegas_ticket_sales_probabilities_neutral_list = []
vegas_ticket_sales_probabilities_pessimistic_list = []

vegas_rsn_revenue_probabilities_optimistic_list = []
vegas_rsn_revenue_probabilities_neutral_list = []
vegas_rsn_revenue_probabilities_pessimistic_list = []

oakland_EV_list = []
vegas_EV_list = []

decision_list = []

# Create data frame of all possible outcomes
decision_list = ['Oakland', 'Las Vegas']

# First Node
chance_node_stadium_money_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_stadium_money_values = [855, 500, 0]

#Second Node
chance_node_ticket_sales_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_ticket_sales_values_per_year = [80, 55, 27]

# Third Node
chance_node_rsn_revenue_scenarios = ['Optimistic', 'Neutral', 'Pessimistic']
chance_node_rsn_revenue_values_per_year = [60, 45, 36]

# Convert annual values to NPV of 10 year time horizon
time_horizon = 10 # years
discount_rate = 0.03 # per year
chance_node_ticket_sales_values = [val * (1 - (1/((1 + discount_rate)**time_horizon)))/discount_rate for val in chance_node_ticket_sales_values_per_year]
chance_node_rsn_revenue_values = [val * (1 - (1/((1 + discount_rate)**time_horizon)))/discount_rate for val in chance_node_rsn_revenue_values_per_year]

# Run the probabilistic sensitivity analysis n_sim times
for n in range(n_sim):

## Set up tree
#First node
chance_node_stadium_money_probabilities_oakland = np.random.uniform(0,1,3)
chance_node_stadium_money_probabilities_oakland = chance_node_stadium_money_probabilities_oakland / np.sum(chance_node_stadium_money_probabilities_oakland)

chance_node_stadium_money_probabilities_vegas = np.random.uniform(0,1,3)
chance_node_stadium_money_probabilities_vegas = chance_node_stadium_money_probabilities_vegas / np.sum(chance_node_stadium_money_probabilities_vegas)

#Second Node
chance_node_ticket_sales_probabilities_oakland = np.random.uniform(0,1,3)
chance_node_ticket_sales_probabilities_oakland = chance_node_ticket_sales_probabilities_oakland / np.sum(chance_node_ticket_sales_probabilities_oakland)

chance_node_ticket_sales_probabilities_vegas = np.random.uniform(0,1,3)
chance_node_ticket_sales_probabilities_vegas = chance_node_ticket_sales_probabilities_vegas / np.sum(chance_node_ticket_sales_probabilities_vegas)

# Third Node
chance_node_rsn_revenue_probabilities_oakland = np.random.uniform(0,1,3)
chance_node_rsn_revenue_probabilities_oakland = chance_node_rsn_revenue_probabilities_oakland / np.sum(chance_node_rsn_revenue_probabilities_oakland)

chance_node_rsn_revenue_probabilities_vegas = np.random.uniform(0,1,3)
chance_node_rsn_revenue_probabilities_vegas = chance_node_rsn_revenue_probabilities_vegas / np.sum(chance_node_rsn_revenue_probabilities_vegas)

# Evaluate Tree
# Create data frame of all possible scenarios
decision_list_list_for_df = []
chance_node_stadium_money_list_for_df = []
chance_node_stadium_money_probability_list_for_df = []
chance_node_stadium_money_value_list_for_df = []
chance_node_ticket_sales_list_for_df = []
chance_node_ticket_sales_probability_list_for_df = []
chance_node_ticket_sales_value_list_for_df = []
chance_node_rsn_revenue_list_for_df = []
chance_node_rsn_revenue_probability_list_for_df = []
chance_node_rsn_revenue_value_list_for_df = []

for i in decision_list:
for j in range(len(chance_node_stadium_money_scenarios)):
for k in range(len(chance_node_rsn_revenue_scenarios)):
for m in range(len(chance_node_rsn_revenue_scenarios)):
decision_list_list_for_df.append(i)
chance_node_stadium_money_list_for_df.append(chance_node_stadium_money_scenarios[j])
chance_node_stadium_money_value_list_for_df.append(chance_node_stadium_money_values[j])
chance_node_ticket_sales_list_for_df.append(chance_node_ticket_sales_scenarios[k])
chance_node_ticket_sales_value_list_for_df.append(chance_node_ticket_sales_values[k])
chance_node_rsn_revenue_list_for_df.append(chance_node_rsn_revenue_scenarios[m])
chance_node_rsn_revenue_value_list_for_df.append(chance_node_rsn_revenue_values[m])

if i == 'Oakland':
chance_node_stadium_money_probability_list_for_df.append(chance_node_stadium_money_probabilities_oakland[j])
chance_node_ticket_sales_probability_list_for_df.append(chance_node_ticket_sales_probabilities_oakland[k])
chance_node_rsn_revenue_probability_list_for_df.append(chance_node_rsn_revenue_probabilities_oakland[m])
elif i == 'Las Vegas':
chance_node_stadium_money_probability_list_for_df.append(chance_node_stadium_money_probabilities_vegas[j])
chance_node_ticket_sales_probability_list_for_df.append(chance_node_ticket_sales_probabilities_vegas[k])
chance_node_rsn_revenue_probability_list_for_df.append(chance_node_rsn_revenue_probabilities_vegas[m])

decision_tree_df = pd.DataFrame(list(zip(decision_list_list_for_df, chance_node_stadium_money_list_for_df,
chance_node_stadium_money_probability_list_for_df,
chance_node_stadium_money_value_list_for_df,
chance_node_ticket_sales_list_for_df,
chance_node_ticket_sales_probability_list_for_df,
chance_node_ticket_sales_value_list_for_df,
chance_node_rsn_revenue_list_for_df,
chance_node_rsn_revenue_probability_list_for_df,
chance_node_rsn_revenue_value_list_for_df)),
columns = ['Decision',
'Stadium_Money_Result', 'Stadium_Money_Prob', 'Stadium_Money_Value',
'Ticket_Sales_Result', 'Ticket_Sales_Prob', 'Ticket_Sales_Value',
'RSN_Revenue_Result', 'RSN_Revenue_Prob', 'RSN_Revenue_Value'])
decision_tree_df['RSN_EV'] = decision_tree_df['RSN_Revenue_Prob'] * decision_tree_df['RSN_Revenue_Value']

# Consolidate the RSN_EV values
RSN_rollback_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])['RSN_EV'].sum().reset_index()

# Keep the rest of the columns
decision_tree_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])['Stadium_Money_Value', 'Ticket_Sales_Value'].mean().reset_index()

# merge two dataframes
decision_tree_df = pd.merge(decision_tree_df, RSN_rollback_df, on = ['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob', 'Ticket_Sales_Result', 'Ticket_Sales_Prob'])

decision_tree_df['Ticket_Sales_RSN_EV'] = decision_tree_df['Ticket_Sales_Prob'] * decision_tree_df['Ticket_Sales_Value'] + decision_tree_df['RSN_EV']

# Consolidate the Ticket Sales and RSN_EV values
ticket_sales_rollback_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])['Ticket_Sales_RSN_EV'].sum().reset_index()

# Keep the rest of the columns
decision_tree_df = decision_tree_df.groupby(['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])['Stadium_Money_Value'].mean().reset_index()

# merge two dataframes
decision_tree_df = pd.merge(decision_tree_df, ticket_sales_rollback_df, on = ['Decision', 'Stadium_Money_Result', 'Stadium_Money_Prob'])

decision_tree_df['Stadium_Money_Ticket_Sales_RSN_EV'] = decision_tree_df['Stadium_Money_Prob'] * decision_tree_df['Stadium_Money_Value'] + decision_tree_df['Ticket_Sales_RSN_EV']

# Consolidate the Stadium Money, Ticket Sales, and RSN_EV values
decision_tree_df = decision_tree_df.groupby(['Decision'])['Stadium_Money_Ticket_Sales_RSN_EV'].sum().reset_index()

# Fill out lists for meta-model inputs
oakland_stadium_money_probabilities_optimistic_list.append(chance_node_stadium_money_probabilities_oakland[0])
oakland_stadium_money_probabilities_neutral_list.append(chance_node_stadium_money_probabilities_oakland[1])
oakland_stadium_money_probabilities_pessimistic_list.append(chance_node_stadium_money_probabilities_oakland[2])

oakland_ticket_sales_probabilities_optimistic_list.append(chance_node_ticket_sales_probabilities_oakland[0])
oakland_ticket_sales_probabilities_neutral_list.append(chance_node_ticket_sales_probabilities_oakland[1])
oakland_ticket_sales_probabilities_pessimistic_list.append(chance_node_ticket_sales_probabilities_oakland[2])

oakland_rsn_revenue_probabilities_optimistic_list.append(chance_node_rsn_revenue_probabilities_oakland[0])
oakland_rsn_revenue_probabilities_neutral_list.append(chance_node_rsn_revenue_probabilities_oakland[1])
oakland_rsn_revenue_probabilities_pessimistic_list.append(chance_node_rsn_revenue_probabilities_oakland[2])

vegas_stadium_money_probabilities_optimistic_list.append(chance_node_stadium_money_probabilities_vegas[0])
vegas_stadium_money_probabilities_neutral_list.append(chance_node_stadium_money_probabilities_vegas[1])
vegas_stadium_money_probabilities_pessimistic_list.append(chance_node_stadium_money_probabilities_vegas[2])

vegas_ticket_sales_probabilities_optimistic_list.append(chance_node_ticket_sales_probabilities_vegas[0])
vegas_ticket_sales_probabilities_neutral_list.append(chance_node_ticket_sales_probabilities_vegas[1])
vegas_ticket_sales_probabilities_pessimistic_list.append(chance_node_ticket_sales_probabilities_vegas[2])

vegas_rsn_revenue_probabilities_optimistic_list.append(chance_node_rsn_revenue_probabilities_vegas[0])
vegas_rsn_revenue_probabilities_neutral_list.append(chance_node_rsn_revenue_probabilities_vegas[1])
vegas_rsn_revenue_probabilities_pessimistic_list.append(chance_node_rsn_revenue_probabilities_vegas[2])

oakland_EV_list.append(decision_tree_df['Stadium_Money_Ticket_Sales_RSN_EV'][0])
vegas_EV_list.append(decision_tree_df['Stadium_Money_Ticket_Sales_RSN_EV'][1])

print(n)

Now we can put the results into a new data frame which we can use to train our machine learning model:

decision_tree_psa_data_df = pd.DataFrame(list(zip(oakland_stadium_money_probabilities_optimistic_list, 
oakland_stadium_money_probabilities_neutral_list,
oakland_stadium_money_probabilities_pessimistic_list,
oakland_ticket_sales_probabilities_optimistic_list,
oakland_ticket_sales_probabilities_neutral_list,
oakland_ticket_sales_probabilities_pessimistic_list,
oakland_rsn_revenue_probabilities_optimistic_list,
oakland_rsn_revenue_probabilities_neutral_list,
oakland_rsn_revenue_probabilities_pessimistic_list,
vegas_stadium_money_probabilities_optimistic_list,
vegas_stadium_money_probabilities_neutral_list,
vegas_stadium_money_probabilities_pessimistic_list,
vegas_ticket_sales_probabilities_optimistic_list,
vegas_ticket_sales_probabilities_neutral_list,
vegas_ticket_sales_probabilities_pessimistic_list,
vegas_rsn_revenue_probabilities_optimistic_list,
vegas_rsn_revenue_probabilities_neutral_list,
vegas_rsn_revenue_probabilities_pessimistic_list,
oakland_EV_list, vegas_EV_list)),
columns = ['oakland_stad_mon_prob_optimistic',
'oakland_stad_mon_prob_neutral',
'oakland_stad_mon_prob_pessimistic',
'oakland_ticket_sales_prob_optimistic',
'oakland_ticket_sales_prob_neutral',
'oakland_ticket_sales_prob_pessimistic',
'oakland_rsn_rev_prob_optimistic',
'oakland_rsn_rev_prob_neutral',
'oakland_rsn_rev_prob_pessimistic',
'vegas_stad_mon_prob_optimistic',
'vegas_stad_mon_prob_neutral',
'vegas_stad_mon_prob_pessimistic',
'vegas_ticket_sales_prob_optimistic',
'vegas_ticket_sales_prob_neutral',
'vegas_ticket_sales_prob_pessimistic',
'vegas_rsn_rev_prob_optimistic',
'vegas_rsn_rev_prob_neutral',
'vegas_rsn_rev_prob_pessimistic',
'oakland_EV', 'vegas_EV'])

# Add decision based on EV
decision_tree_psa_data_df['decision'] = 'Oakland'
decision_tree_psa_data_df.loc[decision_tree_psa_data_df['vegas_EV'] > decision_tree_psa_data_df['oakland_EV'],'decision'] = 'Las Vegas'

We will now train a basic machine learning decision tree using the sci-kit learn package. Since the input data are probabilities between 0 and 1 and we are using a tree-based model, we won’t have to do any feature scaling or engineering. For visualization purposes for the blog, I restricted the tree to a depth of 3. However, the deeper the tree depth, the more likely you will be to achieve greater accuracy.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import tree

#Features
X = decision_tree_psa_data_df.drop(['oakland_EV', 'vegas_EV', 'decision'], axis = 1)
#labels
y = decision_tree_psa_data_df['decision']

# split into train (70%) and test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=32)

# Create decision tree model with maximum depth of 3 to keep recommendation managable
dec_tree_model = tree.DecisionTreeClassifier(random_state=32, max_depth = 3, class_weight = 'balanced')
dec_tree_model = dec_tree_model.fit(X_train, y_train)

Our model ended up with a decent but not perfect AUC of almost 0.8. (AUC is a way of measuring model accuracy based on the true and false positive rates. For more on model accuracy measures, check out my previous blog on assessing the accuracy of ESPN fantasy football predicted scores here.) This is respectable enough for us to continue with the excercise. Of course, there are a number of ways to make the decision tree classifier more accurate including increasing the maximum depth, hyperparamter tuning, or running more simulations to increase the quantity of data.

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, dec_tree_model.predict_proba(X_test)[:, 1])

Now that we are satisfied with performance, we can visually examine the train decision tree. Each split in the tree represents another dimension of a set of business rules. In each box (or leaf) of the printed tree, the first line will represent the rule the model used to split the data, the second line is the Gini Index which describes the distribution of classes in the leaf (where 0.5 represents an equal number of each class and 0 or 1 represents only one class), the third line shows the number of samples of each class, and the fourth line shows the label that the model assigns to all the samples in that leaf. We can print out the resulting tree below:

# Plot decision tree results to see how decisions were made
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (14,14))
tree.plot_tree(dec_tree_model, filled = True, feature_names = X.columns, fontsize = 8, class_names = ['Las Vegas', 'Oakland'])
plt.show()
Image created by the author.

From our machine learning decision tree, we can see that the classification of whether the A’s should stay put in Oakland or move to Las Vegas rested first on the probability of optimistic RSN revenue and then on the probabilities associated with Oakland ticket sales.

Las Vegas is likely the preferred destination when:

  • The probability of optimistic RSN revenue in Las Vegas is greater than 0.4 (except when the probability of optimistic RSN revenue in Oakland is grater than 0.341 AND the probability of optimistic ticket sales in Oakland is greater than 0.355)
  • OR the probability of optimistic RSN revenue in Oakland is less than or equal to 0.468 AND the probability of optimistic ticket sales in Las Vegas is greater than 0.438.

Interestingly, despite all of the media chatter about public or private funding for a new stadium, our model points to RSN revenue and ticket sales. The difference could be due to our 10 year time horizon or could be due to the organization looking for an MLB-approved excuse to leave Oakland. Either way, this approach highlights an important insight that a data science team can take to decision makers in order to inform business strategy. Methods like this can take your model from interesting theoretical exercise to changing minds in the C-suite.

How do we validate the machine learning model?

Given that we are trying to inform a hugely important decision, it is important to make sure our model was robust to differences in the input data or an imbalanced class set of labels. To account for the latter, you will notice, we included class_weight = ‘balanced’ in the creation of our machine learning model. To account for the former and for model validation, we can use the cross-validation score to see what other train/test split performance metrics would be:

# 10-fold cross-validation scores
cross_val_score(dec_tree_model, X, y, cv=10)

The output is the following: array([0.724, 0.722, 0.718, 0.72 , 0.722, 0.708, 0.732, 0.726, 0.76, 0.702]) which tells us that across 10 different possible train/test splits, our model had similar performance.

What did we learn?

With that, we have gone from a business question about the relocation of the A’s baseball team to rolling back a decision analysis decision tree model to unveil why the A’s might be heading to Las Vegas to leveraging a machine learning decision tree to generalize our results into digestible business rules that management can use to decide whether or not to re-locate. Hopefully, you’ll be able to utilize a similar methodology or approach in order to inform decision makers in your own organization or in your daily life.

References

[1] Sutelan, E, Athletics Las Vegas relocation timeline: Stadium stumbles, funding failures on road to A’s Oakland departure (2023), The Sporting News

[2] Kenton, W, Decision Analysis (DA): Definition, Uses, and Examples (2022), Investopedia

[3] Howard, R. and Abbas, A, Foundations of Decision Analysis (2014)

[4] Morss, E., Major League Baseball Finances: What the Numbers Tell Us (2019), Morss Global Finance

[5] Greer, J., Why did the Raiders move to Las Vegas? Explaining franchise’s 2020 shift from Oakland to Sin City (2020), The Sporting News

[6] Andre, D. Report: Raiders first in 2021 NFL ticket revenue (2022), Fox 5 Las Vegas

[7] Malloy, G. and Brandeau, M. When Is Mass Prophylaxis Cost-Effective for Epidemic Control? A Comparison of Decision Approaches (2022), Medical Decision Making



Source link

Leave a Comment