Want to share your content on python-bloggers? click here.
I recently came across The Perron-Frobenius and the Ranking of Football Teams, an interesting paper in which the author describes four different methods to rank teams in uneven paired competition. He goes on to show how each of these methods depends in some way on the Perron-Frobenius theorem. The Perron-Frobenius theorem provides key insights into the structure of non-negative matrices, especially in terms of their largest eigenvalue and associated eigenvector. For irreducible non-negative matrices, the theorem guarantees the existence of a dominant eigenvalue that is real, simple, and larger than all others in magnitude, with a corresponding non-negative eigenvector.
An uneven paired competition is one in which the outcome of competition between pairs of teams is known, but the pairings are not evenly matched, meaning the competition is not a round robin in which each team is paired with every other team an equal number of times. A good example is regular season football in-conference play for any of the major NCAA Division I conferences: For the 2023 season, the Big 12 had 14 teams, but each team had only 9 conference games.
Here we focus on the first ranking method, which the author refers to as the “direct method”. The direct method formulates the ranking approach as a linear eigenvalue problem which makes direct use of the Perron-Frobenius theorem. For each team under consideration, the goal is to assign a score to each team based on its interactions with other teams, with the goal that the assigned score reflect both the interactions as well as the strength of opponents. We will then compare our data-driven ranking approach with the final regular season standings and assess how they line up. A similar exercise will be performed focusing on the 2021 MLB regular season.
Creating the Adjacency Matrix
It is first necessary to construct the adjacency matrix in order to encode interactions between teams. Big 12 2023 regular season football results were obtained here. Within the matrix, the value in cell
The regular season rankings and adjacency matrix can be downloaded from GitHub (links available in the next cell):
%load_ext watermark import matplotlib.pyplot as plt import numpy as np import pandas as pd import networkx as nx from numpy.linalg import eig np.set_printoptions(suppress=True, precision=5) pd.options.mode.chained_assignment = None pd.set_option('display.max_columns', None) pd.set_option('display.width', None) big12_matrix_url = "https://gist.githubusercontent.com/jtrive84/b9b4ff8620f90045a0377f27ec4eb50f/raw/e6923583530edadbe9da1a1f4821e415d8a7e6f2/2023-big-12-adjacency.csv" big12_rankings_url = "https://gist.githubusercontent.com/jtrive84/0207b8fd18a05e096a89498290b08d4a/raw/462d2b1bef52d96ae20e077f55501bfa23951ae4/2023-big-12-rankings.csv" # ------------------------------------------------------------------------------ %watermark --python --conda --hostname --machine --iversions
Python implementation: CPython Python version : 3.11.10 IPython version : 8.28.0 conda environment: py311 Compiler : MSC v.1941 64 bit (AMD64) OS : Windows Release : 10 Machine : AMD64 Processor : Intel64 Family 6 Model 170 Stepping 4, GenuineIntel CPU cores : 22 Architecture: 64bit Hostname: JTRIZPC11 numpy : 2.1.0 matplotlib: 3.9.2 pandas : 2.2.2 networkx : 3.3
Big 12 regular season rankings for 2023:
ranks_big12 = pd.read_csv(big12_rankings_url) ranks_big12.head(15)
team | conf_wins | conf_losses | overall_wins | overall_losses | conf_win_pct | overall_win_pct | |
---|---|---|---|---|---|---|---|
0 | Texas | 8 | 1 | 12 | 2 | 0.889 | 0.857 |
1 | Oklahoma State | 7 | 2 | 10 | 4 | 0.778 | 0.714 |
2 | Oklahoma | 7 | 2 | 10 | 3 | 0.778 | 0.769 |
3 | Iowa State | 6 | 3 | 7 | 6 | 0.667 | 0.538 |
4 | Kansas State | 6 | 3 | 9 | 4 | 0.667 | 0.692 |
5 | West Virginia | 6 | 3 | 9 | 4 | 0.667 | 0.692 |
6 | Texas Tech | 5 | 4 | 7 | 6 | 0.556 | 0.538 |
7 | Kansas | 5 | 4 | 9 | 4 | 0.556 | 0.692 |
8 | UCF | 3 | 6 | 6 | 7 | 0.333 | 0.462 |
9 | TCU | 3 | 6 | 5 | 7 | 0.333 | 0.417 |
10 | Houston | 2 | 7 | 4 | 8 | 0.222 | 0.333 |
11 | BYU | 2 | 7 | 5 | 7 | 0.222 | 0.417 |
12 | Baylor | 2 | 7 | 3 | 9 | 0.222 | 0.250 |
13 | Cincinnati | 1 | 8 | 3 | 9 | 0.111 | 0.250 |
The adjacency matrix considers only conference play (non-conference games excluded):
adj_big12 = pd.read_csv(big12_matrix_url) adj_big12.head(15)
Unnamed: 0 | Baylor | BYU | Cincinnati | Houston | Iowa State | Kansas | Kansas State | Oklahoma | Oklahoma State | TCU | Texas | Texas Tech | UCF | West Virginia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Baylor | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | BYU | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | Cincinnati | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Houston | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | Iowa State | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
5 | Kansas | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
6 | Kansas State | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
7 | Oklahoma | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
8 | Oklahoma State | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
9 | TCU | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | Texas | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
11 | Texas Tech | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
12 | UCF | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
13 | West Virginia | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
For each row in adj_big12
, 1 indicates that team at row
The sum of each row in the adjacency matrix represents the number of regular season wins in conference play for a given team. Texas was 8-1 in 2023 regular season conference play, therefore the sum of the Texas row is 8. The columnar sum represents the number of losses for a given team (for Texas, this is 1).
We can use NetworkX to visualize the relationships encoded in the adjacency matrix (each node label corresponds to the alphabetical enumeration of teams: 0=Baylor, 1=BYU, … 13=West Virginia). Edges indicate whether team
import networkx as nx import matplotlib.pyplot as plt # Create adjacency matrix as Numpy array. # team_names = dfadj["Unnamed: 0"].values A = adj_big12.drop("Unnamed: 0", axis=1).values G = nx.from_numpy_array(A) fig, ax = plt.subplots(1, 1, figsize=(7.5, 5), tight_layout=True) ax.set_title( "2023 Big-12 Regular Season Football Matchups", color="#000000", loc="center", weight="normal", fontsize=9 ) nx.draw_networkx( G, node_color="#E02C70", node_size=350, ax=ax, with_labels=True, edge_color="grey", width=.25, pos=nx.spring_layout(G, seed=516) )
The adjacency matrix,
A
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0], [0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0], [1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1], [0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1], [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0], [1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0], [0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])
If we create an initial ranking vector
r0 = np.ones(14) win_pcts = A @ r0 / 9 pairs = zip(adj_big12.columns[1:], win_pcts.tolist()) for tt in pairs: print(tt)
('Baylor', 0.2222222222222222) ('BYU', 0.2222222222222222) ('Cincinnati', 0.1111111111111111) ('Houston', 0.2222222222222222) ('Iowa State', 0.6666666666666666) ('Kansas', 0.5555555555555556) ('Kansas State', 0.6666666666666666) ('Oklahoma', 0.7777777777777778) ('Oklahoma State', 0.7777777777777778) ('TCU', 0.3333333333333333) ('Texas', 0.8888888888888888) ('Texas Tech', 0.5555555555555556) ('UCF', 0.3333333333333333) ('West Virginia', 0.6666666666666666)
This aligns with values in the conf_win_pct column from ranks_big12
.
The
If
The steps for deriving our rankings are outlined below:
-
Construct the adjacency matrix
, in which entry represents the number of times team defeated team . -
Perform the eigendecomposition of
, factoring the matrix into its eigenvalues and eigenvectors. -
Identify the index of the largest eigenvalue.
-
Extract the eigenvector at the index identified in step 3. If using Numpy and the maximum eigenvalue is found at index
, the corresponding eigenvector will be located in column . -
Take the absolute value of
. The value at index in represents the score for the team at the same index (for the Big 12 example, index 0 = Baylor, index 1 = BYU, …). -
Sort the eigenvector scores in decreasing order; higher performing teams will have a larger value, poorer performing teams will have a smaller value.
Keep in mind that using a binary encoding scheme in a football setting, where each team may only compete once per season, overlooks information that could enhance the encoding. As it stands, a victory by 80 points for team A over team B is treated the same as a victory in triple overtime. In sports where teams face each other multiple times in a season,
The next cell demonstrates how to implement the ranking procedure using Numpy.
from numpy.linalg import eig # Adjacency matrix as Numpy array. A = adj_big12.drop("Unnamed: 0", axis=1).values.astype(float) # Perform eigendecomposition of A. e_vals, e_vecs = eig(A) # Identify index of largest eigenvalue. e_val1_indx = np.argmax(e_vals) # Extract real part of eigenvector at index e_val1_indx. e_vec1 = np.abs(e_vecs[:, e_val1_indx]) # Get indices associated with each team. indices = np.argsort(e_vec1)[::-1] # Associate ranks with teams. teams = adj_big12.columns[1:] ranked_teams = teams[indices] for team in ranked_teams: print(team)
Texas Oklahoma State Oklahoma Kansas Iowa State Kansas State Texas Tech West Virginia UCF Houston BYU TCU Baylor Cincinnati
We can compare actual vs. predicted rankings to see how well the direct method performed:
for jj, team in enumerate(ranked_teams): actual_rank = ranks_big12[ranks_big12.team==team].index.item() print(f"{team}: actual/predicted : {actual_rank}/{jj}")
Texas: actual/predicted : 0/0 Oklahoma State: actual/predicted : 1/1 Oklahoma: actual/predicted : 2/2 Kansas: actual/predicted : 7/3 Iowa State: actual/predicted : 3/4 Kansas State: actual/predicted : 4/5 Texas Tech: actual/predicted : 6/6 West Virginia: actual/predicted : 5/7 UCF: actual/predicted : 8/8 Houston: actual/predicted : 10/9 BYU: actual/predicted : 11/10 TCU: actual/predicted : 9/11 Baylor: actual/predicted : 12/12 Cincinnati: actual/predicted : 13/13
There are a few discrepancies, but the ranks are largely consistent. An interesting discrepancy is Kansas, having an actual rank of 7 vs. a predicted rank of 3. It’s difficult to say why Kansas is given such a high rank, but it may have to do with strength of schedule.
For the bottom five teams, the direct method does a good job. Three of the five teams have the same in-conference winning percentage, therefore slight out-of-orderings aren’t of concern.
Next let’s look at a more substantial example: All games for the 2021 MLB regular season.
MLB Example
A Major League Baseball dataset with game results from 2016-2021 is available on Kaggle. The games.csv dataset has information about each contest that can be used to build an adjacency matrix. We load the dataset and inspect the first few records:
import numpy as np import pandas as pd np.set_printoptions(suppress=True, precision=5) pd.options.mode.chained_assignment = None pd.set_option('display.max_columns', None) pd.set_option('display.width', None) df = pd.read_csv("C:/Users/jtriv/datasets/MLB/games.csv") print(f"df.shape: {df.shape}") df.head(3)
df.shape: (13439, 43)
Game | away | away-record | awayaway-record | home | home-record | homehome-record | away-score | home-score | postseason info | Walks Issued – Away | Walks Issued – Home | Stolen Bases – Away | Stolen Bases – Home | Strikeouts Thrown – Away | Strikeouts Thrown – Home | Total Bases – Away | Total Bases – Home | Stadium | Date | Location | Odds | O/U | Attendance | Capacity | Duration | Umpires | WIN – Pitcher – Stats | WIN – Pitcher – Id | WIN – Pitcher – Name | WIN – Pitcher – AbbrName | WIN – Pitcher – Record | LOSS – Pitcher – Stats | LOSS – Pitcher – Id | LOSS – Pitcher – Name | LOSS – Pitcher – AbbrName | LOSS – Pitcher – Record | SAVE – Pitcher – Stats | SAVE – Pitcher – Id | SAVE – Pitcher – Name | SAVE – Pitcher – AbbrName | SAVE – Pitcher – Record | Extra Innings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 360403123 | STL | 0-1 | 0-1 Away | PIT | Jan-00 | 1-0 Home | 1.0 | 4.0 | NaN | 5.0 | 5.0 | 0.0 | 0.0 | 5.0 | 14.0 | 5.0 | 13.0 | \n\n\n\n\n\t\t\t\t\t\t\tPNC Park\n\t\t\t\t\t\t… | 2016-04-03T17:00Z | Pittsburgh, Pennsylvania\n\t\t\t\t\t\t\t\n\t\t… | NaN | NaN | 39,500 | 38,362 | 3:02 | Home Plate Umpire – Jerry Layne, First Base Um… | 6.0 IP, 0 ER, 10 K, 5 BB | 6211.0 | Francisco Liriano | F. Liriano | (1-0) | 6.0 IP, 3 ER, 3 K, 3 BB | 5403.0 | Adam Wainwright | A. Wainwright | (0-1) | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 360403130 | TOR | Jan-00 | 1-0 Away | TB | 0-1 | 0-1 Home | 5.0 | 3.0 | NaN | 1.0 | 3.0 | 0.0 | 0.0 | 7.0 | 16.0 | 11.0 | 11.0 | \n\n\n\n\n\t\t\t\t\t\t\tTropicana Field\n\t\t\… | 2016-04-03T20:00Z | St. Petersburg, Florida\n\t\t\t\t\t\t\t\n\t\t\… | NaN | NaN | 31,042 | 31,042 | 2:51 | Home Plate Umpire – Mike Everitt, First Base U… | 8.0 IP, 3 ER, 5 K, 1 BB | 32815.0 | Marcus Stroman | M. Stroman | (1-0) | 5.0 IP, 2 ER, 12 K, 3 BB | 31003.0 | Chris Archer | C. Archer | (0-1) | 1.0 IP, 0 ER, 2 K, 0 BB | 32693.0 | Roberto Osuna | R. Osuna | -1.0 | NaN |
2 | 360403107 | NYM | 0-1 | 0-1 Away | KC | Jan-00 | 1-0 Home | 3.0 | 4.0 | NaN | 2.0 | 6.0 | 0.0 | 1.0 | 3.0 | 9.0 | 8.0 | 9.0 | \n\n\n\n\n\t\t\t\t\t\t\tKauffman Stadium\n\t\t… | 2016-04-04T00:30Z | Kansas City, Missouri\n\t\t\t\t\t\t\t\n\t\t\t\… | NaN | NaN | 40,030 | 37,903 | 3:13 | Home Plate Umpire – Gerry Davis, First Base Um… | 6.0 IP, 0 ER, 5 K, 3 BB | 6401.0 | Edinson Volquez | E. Volquez | (1-0) | 5.2 IP, 3 ER, 2 K, 2 BB | 31214.0 | Matt Harvey | M. Harvey | (0-1) | 1.0 IP, 0 ER, 2 K, 1 BB | 28957.0 | Wade Davis | W. Davis | -1.0 | NaN |
It is first necessary to filter down to 2021 regular season games. If “postseason info” column is null, we assume the game is a regular season matchup. The “Date” column is used to extract the year.
In order to create the adjacency matrix only “away”, “home”, “away-score” and “home-score” need be retained. All other columns are removed:
df["yyyy"] = pd.to_datetime(df["Date"]).dt.year df21 = ( df[(pd.isnull(df["postseason info"])) & (df["yyyy"]==2021)] .dropna(subset=["away", "home", "away-score", "home-score"]) .rename({"away-score": "away_score", "home-score": "home_score"}, axis=1) [["away", "home", "away_score", "home_score"]] .reset_index(drop=True) ) print(f"df21.shape: {df21.shape}") df21.head(15)
df21.shape: (2310, 4)
away | home | away_score | home_score | |
---|---|---|---|---|
0 | TOR | NYY | 3.0 | 2.0 |
1 | CLE | DET | 2.0 | 3.0 |
2 | MIN | MIL | 5.0 | 6.0 |
3 | PIT | CHC | 5.0 | 3.0 |
4 | ATL | PHI | 2.0 | 3.0 |
5 | ARI | SD | 7.0 | 8.0 |
6 | LAD | COL | 5.0 | 8.0 |
7 | STL | CIN | 11.0 | 6.0 |
8 | TB | MIA | 1.0 | 0.0 |
9 | TEX | KC | 10.0 | 14.0 |
10 | CHW | LAA | 3.0 | 4.0 |
11 | HOU | OAK | 8.0 | 1.0 |
12 | SF | SEA | 7.0 | 8.0 |
13 | BAL | BOS | 3.0 | 0.0 |
14 | TB | MIA | 6.0 | 4.0 |
All 30 MLB teams are represented in the “home” and away” columns. An empty DataFrame is created with columns and rows indexed using the 30 teams ordered alphabetically.
# Create empty DataFrame with rows and columns indexed by the 30 MLB teams. mlb_teams = sorted(df21["home"].unique().tolist()) dfadj = pd.DataFrame(index=mlb_teams, columns=mlb_teams) dfadj.loc[:,:] = 0 dfadj
ARI | ATL | BAL | BOS | CHC | CHW | CIN | CLE | COL | DET | HOU | KC | LAA | LAD | MIA | MIL | MIN | NYM | NYY | OAK | PHI | PIT | SD | SEA | SF | STL | TB | TEX | TOR | WSH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ARI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ATL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
BAL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
BOS | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CHC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CHW | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CIN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CLE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
COL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DET | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
HOU | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
KC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
LAA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
LAD | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MIA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MIL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
MIN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NYM | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NYY | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
OAK | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
PHI | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
PIT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SD | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SEA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SF | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
STL | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
TB | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
TEX | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
TOR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
WSH | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We iterate over df21
, updating values in dfadj
according to the following rules:
- If team in row
defeated team in column , is incremented by 1. - If team in row
lost to team in column , is incremented by 1. - If the contest resulted in a tie,
and are incremented by , but there are no ties indf21
.
A dictionary dresults
tracking wins and losses for each team is also created, in order to use regular season winning percentage as a proxy to compare against our direct method rankings.
dresults = {kk: {"wins": 0, "losses": 0} for kk in dfadj.columns} for rr in df21.itertuples(index=False): ii, jj, ii_score, jj_score = rr.away, rr.home, rr.away_score, rr.home_score if ii_score == jj_score: dfadj.at[ii, jj]+=.5 dfadj.at[jj, ii]+=.5 elif ii_score > jj_score: dfadj.at[ii, jj]+=1 dresults[ii]["wins"]+=1 dresults[jj]["losses"]+=1 else: dfadj.at[jj, ii]+=1 dresults[jj]["wins"]+=1 dresults[ii]["losses"]+=1 dfadj
ARI | ATL | BAL | BOS | CHC | CHW | CIN | CLE | COL | DET | HOU | KC | LAA | LAD | MIA | MIL | MIN | NYM | NYY | OAK | PHI | PIT | SD | SEA | SF | STL | TB | TEX | TOR | WSH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ARI | 0 | 1 | 0 | 0 | 2 | 0 | 6 | 0 | 9 | 0 | 1 | 0 | 0 | 3 | 2 | 1 | 0 | 1 | 0 | 0 | 4 | 4 | 8 | 2 | 2 | 1 | 0 | 1 | 0 | 3 |
ATL | 4 | 0 | 3 | 1 | 5 | 0 | 4 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 11 | 3 | 0 | 9 | 1 | 0 | 10 | 4 | 4 | 0 | 3 | 5 | 1 | 0 | 0 | 12 |
BAL | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 2 | 0 | 2 | 3 | 4 | 2 | 0 | 2 | 0 | 2 | 1 | 8 | 3 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | 4 | 5 | 3 |
BOS | 0 | 3 | 13 | 0 | 0 | 3 | 0 | 4 | 0 | 3 | 2 | 5 | 3 | 0 | 2 | 0 | 3 | 4 | 10 | 3 | 3 | 0 | 0 | 4 | 0 | 0 | 8 | 2 | 8 | 3 |
CHC | 4 | 2 | 0 | 0 | 0 | 1 | 8 | 1 | 2 | 2 | 0 | 0 | 0 | 2 | 1 | 4 | 2 | 4 | 0 | 0 | 2 | 14 | 5 | 0 | 1 | 9 | 0 | 0 | 0 | 4 |
CHW | 0 | 0 | 5 | 2 | 5 | 0 | 3 | 9 | 0 | 9 | 2 | 8 | 2 | 0 | 0 | 1 | 12 | 0 | 0 | 4 | 0 | 3 | 0 | 3 | 0 | 2 | 3 | 5 | 4 | 0 |
CIN | 1 | 3 | 0 | 0 | 11 | 1 | 0 | 3 | 5 | 1 | 0 | 2 | 0 | 3 | 5 | 9 | 2 | 3 | 0 | 0 | 3 | 12 | 1 | 0 | 1 | 9 | 0 | 0 | 0 | 5 |
CLE | 0 | 0 | 5 | 2 | 3 | 7 | 2 | 0 | 0 | 12 | 1 | 13 | 5 | 0 | 0 | 0 | 7 | 0 | 3 | 2 | 0 | 1 | 0 | 3 | 0 | 2 | 1 | 4 | 1 | 0 |
COL | 10 | 4 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 1 | 6 | 4 | 2 | 0 | 1 | 0 | 1 | 5 | 4 | 10 | 2 | 3 | 3 | 0 | 4 | 0 | 4 |
DET | 0 | 0 | 5 | 3 | 1 | 7 | 2 | 5 | 0 | 0 | 4 | 8 | 1 | 0 | 0 | 3 | 6 | 0 | 3 | 1 | 0 | 1 | 0 | 5 | 0 | 3 | 4 | 6 | 3 | 0 |
HOU | 2 | 0 | 3 | 5 | 0 | 5 | 0 | 6 | 2 | 1 | 0 | 3 | 13 | 2 | 0 | 0 | 3 | 0 | 2 | 11 | 0 | 0 | 2 | 11 | 1 | 0 | 4 | 14 | 4 | 0 |
KC | 0 | 0 | 3 | 2 | 3 | 9 | 1 | 3 | 0 | 11 | 4 | 0 | 2 | 0 | 0 | 4 | 10 | 0 | 2 | 2 | 0 | 3 | 0 | 4 | 0 | 1 | 2 | 2 | 2 | 0 |
LAA | 3 | 0 | 4 | 3 | 0 | 5 | 0 | 1 | 2 | 6 | 6 | 4 | 0 | 3 | 0 | 0 | 4 | 0 | 4 | 4 | 0 | 0 | 2 | 8 | 1 | 0 | 1 | 11 | 3 | 0 |
LAD | 16 | 4 | 0 | 0 | 3 | 0 | 3 | 0 | 13 | 0 | 2 | 0 | 3 | 0 | 3 | 4 | 0 | 6 | 0 | 2 | 4 | 6 | 12 | 3 | 9 | 4 | 0 | 2 | 0 | 7 |
MIA | 5 | 8 | 2 | 0 | 5 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 4 | 0 | 3 | 0 | 9 | 0 | 0 | 9 | 2 | 3 | 0 | 3 | 0 | 1 | 0 | 0 | 8 |
MIL | 6 | 3 | 0 | 0 | 13 | 2 | 10 | 3 | 5 | 1 | 0 | 0 | 0 | 3 | 3 | 0 | 2 | 3 | 0 | 0 | 2 | 13 | 5 | 0 | 4 | 8 | 0 | 0 | 0 | 3 |
MIN | 0 | 0 | 4 | 2 | 2 | 5 | 2 | 10 | 0 | 10 | 4 | 9 | 1 | 0 | 0 | 4 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 1 | 3 | 4 | 3 | 0 |
NYM | 5 | 7 | 3 | 0 | 3 | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 9 | 1 | 0 | 0 | 3 | 0 | 6 | 2 | 4 | 0 | 1 | 1 | 0 | 0 | 2 | 8 |
NYY | 0 | 3 | 11 | 7 | 0 | 5 | 0 | 4 | 0 | 3 | 4 | 4 | 2 | 0 | 3 | 0 | 5 | 1 | 0 | 4 | 2 | 0 | 0 | 5 | 0 | 0 | 8 | 6 | 7 | 2 |
OAK | 4 | 0 | 3 | 3 | 0 | 3 | 0 | 4 | 2 | 6 | 8 | 5 | 15 | 1 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 | 2 | 4 | 2 | 0 | 4 | 10 | 2 | 0 |
PHI | 3 | 9 | 2 | 3 | 5 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 9 | 5 | 0 | 9 | 2 | 0 | 0 | 4 | 4 | 0 | 2 | 4 | 0 | 0 | 1 | 12 |
PIT | 2 | 3 | 0 | 0 | 5 | 1 | 6 | 2 | 0 | 3 | 0 | 1 | 0 | 0 | 5 | 4 | 2 | 3 | 0 | 0 | 3 | 0 | 3 | 0 | 4 | 7 | 0 | 0 | 0 | 2 |
SD | 11 | 0 | 0 | 0 | 1 | 0 | 6 | 0 | 7 | 0 | 4 | 0 | 2 | 7 | 4 | 2 | 0 | 3 | 0 | 2 | 2 | 4 | 0 | 3 | 8 | 3 | 0 | 3 | 0 | 5 |
SEA | 4 | 0 | 1 | 3 | 0 | 4 | 0 | 4 | 2 | 1 | 8 | 3 | 11 | 1 | 0 | 0 | 4 | 0 | 2 | 15 | 0 | 0 | 0 | 0 | 2 | 0 | 6 | 13 | 4 | 0 |
SF | 17 | 3 | 0 | 0 | 6 | 0 | 6 | 0 | 14 | 0 | 2 | 0 | 3 | 10 | 4 | 3 | 0 | 5 | 0 | 4 | 4 | 3 | 11 | 1 | 0 | 2 | 0 | 3 | 0 | 4 |
STL | 6 | 0 | 0 | 0 | 8 | 1 | 8 | 2 | 4 | 1 | 0 | 5 | 0 | 3 | 6 | 11 | 2 | 4 | 0 | 0 | 3 | 12 | 3 | 0 | 4 | 0 | 0 | 0 | 0 | 2 |
TB | 0 | 2 | 18 | 11 | 0 | 3 | 0 | 4 | 0 | 3 | 2 | 4 | 6 | 0 | 5 | 0 | 3 | 3 | 11 | 3 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 11 | 1 |
TEX | 3 | 0 | 3 | 4 | 0 | 1 | 0 | 2 | 2 | 1 | 5 | 4 | 8 | 1 | 0 | 0 | 3 | 0 | 1 | 9 | 0 | 0 | 0 | 6 | 1 | 0 | 4 | 0 | 2 | 0 |
TOR | 0 | 6 | 12 | 7 | 0 | 3 | 0 | 4 | 0 | 3 | 2 | 3 | 2 | 0 | 4 | 0 | 4 | 1 | 10 | 5 | 2 | 0 | 0 | 2 | 0 | 0 | 8 | 3 | 0 | 1 |
WSH | 4 | 5 | 3 | 0 | 3 | 0 | 3 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 11 | 1 | 0 | 6 | 1 | 0 | 5 | 4 | 3 | 0 | 1 | 4 | 3 | 0 | 3 | 0 |
Convert dresults
into a DataFrame and add win_pct column:
dfresults = ( pd.DataFrame().from_dict(dresults, orient="index") .reset_index(drop=False, names="team") ) # Compute winning percentage. dfresults["win_pct"] = dfresults["wins"] / (dfresults["wins"] + dfresults["losses"]) # Sort values by win_pct. dfresults = ( dfresults.sort_values("win_pct", ascending=False) .reset_index(drop=True) ) # Add win_pct rank column. dfresults["rank0"] = dfresults.index + 1 dfresults.head(30)
team | wins | losses | win_pct | rank0 | |
---|---|---|---|---|---|
0 | SF | 105 | 53 | 0.664557 | 1 |
1 | LAD | 106 | 54 | 0.662500 | 2 |
2 | TB | 98 | 62 | 0.612500 | 3 |
3 | HOU | 94 | 66 | 0.587500 | 4 |
4 | MIL | 89 | 65 | 0.577922 | 5 |
5 | BOS | 86 | 64 | 0.573333 | 6 |
6 | NYY | 86 | 67 | 0.562092 | 7 |
7 | ATL | 84 | 66 | 0.560000 | 8 |
8 | TOR | 82 | 65 | 0.557823 | 9 |
9 | CHW | 82 | 66 | 0.554054 | 10 |
10 | SEA | 88 | 71 | 0.553459 | 11 |
11 | STL | 85 | 69 | 0.551948 | 12 |
12 | OAK | 84 | 76 | 0.525000 | 13 |
13 | PHI | 80 | 74 | 0.519481 | 14 |
14 | CLE | 74 | 73 | 0.503401 | 15 |
15 | CIN | 80 | 79 | 0.503145 | 16 |
16 | SD | 77 | 82 | 0.484277 | 17 |
17 | LAA | 75 | 82 | 0.477707 | 18 |
18 | DET | 71 | 79 | 0.473333 | 19 |
19 | COL | 70 | 79 | 0.469799 | 20 |
20 | MIN | 69 | 79 | 0.466216 | 21 |
21 | KC | 70 | 85 | 0.451613 | 22 |
22 | NYM | 61 | 76 | 0.445255 | 23 |
23 | CHC | 68 | 86 | 0.441558 | 24 |
24 | MIA | 66 | 93 | 0.415094 | 25 |
25 | WSH | 62 | 89 | 0.410596 | 26 |
26 | TEX | 60 | 100 | 0.375000 | 27 |
27 | PIT | 56 | 97 | 0.366013 | 28 |
28 | BAL | 51 | 103 | 0.331169 | 29 |
29 | ARI | 51 | 110 | 0.316770 | 30 |
rank0 will be used to compare our results against. Let’s visualize the regular season matchup network:
import networkx as nx import matplotlib.pyplot as plt # Create adjacency matrix as Numpy array. A = dfadj.values.astype(float) G = nx.from_numpy_array(A) fig, ax = plt.subplots(1, 1, figsize=(8.5, 6), tight_layout=True) ax.set_title( "2021 MLB Regular Season Matchups", color="#000000", loc="center", weight="normal", fontsize=9 ) nx.draw_networkx( G, node_color="#32cd32", node_size=200, ax=ax, with_labels=True, edge_color="blue", width=.15, pos=nx.spring_layout(G, seed=516) )
Next we perform the same steps carried out for the Big 12 analysis:
from numpy.linalg import eig # Adjacency matrix as Numpy array. A = dfadj.values.astype(float) # Perform eigendecomposition of A. e_vals, e_vecs = eig(A) # Identify index of largest eigenvalue. e_val1_indx = np.argmax(e_vals) # Extract real part of eigenvector at index e_val1_indx. e_vec1 = np.abs(e_vecs[:, e_val1_indx]) # Get indices associated with each team. indices = np.argsort(e_vec1)[::-1] # Associate ranks with teams. teams = dfadj.columns ranked_teams = teams[indices] for team in ranked_teams: print(team)
HOU TB LAD SF SEA NYY OAK BOS TOR CHW MIL LAA STL SD DET CLE ATL KC CIN MIN PHI COL TEX CHC MIA WSH BAL NYM PIT ARI
Again comparing actual vs. predicted ranks, using regular season winning percentage as a proxy for actual rank:
for jj, team in enumerate(ranked_teams, start=1): actual_rank = dfresults[dfresults.team==team]["rank0"].item() print(f"{team}: actual/predicted : {actual_rank}/{jj}")
HOU: actual/predicted : 4/1 TB: actual/predicted : 3/2 LAD: actual/predicted : 2/3 SF: actual/predicted : 1/4 SEA: actual/predicted : 11/5 NYY: actual/predicted : 7/6 OAK: actual/predicted : 13/7 BOS: actual/predicted : 6/8 TOR: actual/predicted : 9/9 CHW: actual/predicted : 10/10 MIL: actual/predicted : 5/11 LAA: actual/predicted : 18/12 STL: actual/predicted : 12/13 SD: actual/predicted : 17/14 DET: actual/predicted : 19/15 CLE: actual/predicted : 15/16 ATL: actual/predicted : 8/17 KC: actual/predicted : 22/18 CIN: actual/predicted : 16/19 MIN: actual/predicted : 21/20 PHI: actual/predicted : 14/21 COL: actual/predicted : 20/22 TEX: actual/predicted : 27/23 CHC: actual/predicted : 24/24 MIA: actual/predicted : 25/25 WSH: actual/predicted : 26/26 BAL: actual/predicted : 29/27 NYM: actual/predicted : 23/28 PIT: actual/predicted : 28/29 ARI: actual/predicted : 30/30
The Houston Astros are considered the best team based on the direct method, which is encouraging as they ultimately reached the 2021 World Series. One of the biggest discrepancies is with the Atlanta Braves, who were 8th in terms of regular season winning percentage but 17th in terms of the direct method. They went on to win the 2021 World Series. Nonetheless, the modeled ranking are reasonable, and it is clear that the direct might be able to provide further insight into how teams rank looking beyond winning percentage.
Want to share your content on python-bloggers? click here.