Mon, 04 Apr 2011
The Red Sox Aren't Doomed
The baseball season has started, and the Red Sox are off to an 0-3 start. A lot of Red Sox fans will be feeling down about this, so I'm going to cheer them up the only way I know how: statistics.
In this post, I'm going to look at how well teams records through the first 3 games of their seasons predict their record on the season. I'm going to examine this by looking at the correllation between the number of wins in the first three games of each team's season and their final win percentage for that season. I've used every team-season since 1962, the year in which the NL adopted the 162-game schedule that the AL started using a year earlier. This includes strike years, as they didn't occur to me until after I'd finished making all the graphs.
Let's start out by looking at the distribution of wins in the first three games:
The numbers break down as follows:
| 0 Wins | 1 Win | 2 Wins | 3 Wins | Total |
|---|---|---|---|---|
| 148 | 473 | 475 | 152 | 1248 |
Unsurprisingly, there are a lot more 2-1 and 1-2 records, with a large tail-off for 3-0 and 0-3. It should be noted, however, that the 2011 Red Sox are part of the most exclusive club as far as records in the first 3 games go.
Now let's look at the distributions of full-season records against the number of wins in the first 3 games:
Looks pretty flat, doesn't it. Let's add in a linear trend line to see just how flat it really is:
That confirms that it is actually pretty flat. There is a tiny amount of correlation between the first 3 games and the full-season record, but only at r=0.17, with r squared at a miniscule 0.03.
For those not in the know, the r of two variables ranges between -1 and 1, where 1 means that there is perfect positive correlation (i.e. when one variable increases, the other always increases) and -1 means that there is a perfect negative correlation (i.e. when one variable increases, the other always decreases). 0 means that there is no relation between the values.
r squared tells us how well our r fits our model. It is pretty much arrived at by comparing what r (the black line in our graph) predicts with the actual data (all the blue points). If the black line represented our data well, r squared would be near 1. As it does not represent our data well, it is very near 0.
So, to summarise, we have looked at how good a predictor the first 3 games of each team's season from 1962 to 2009 were for their overall record in that season, and have concluded that there is a relationship between the two, but that it is of no great significance.
Red Sox fans, take heart!
Sources and Scripts
All of these conclusions were drawn using game logs freely available from Retrosheet.
The following shell script was used to generate summaries of games in each year:
for n in $(seq 1871 2009);
do
cut -d, -f6,9,4,7,10,11 < GL$n.TXT > summary$n.txt
done
The following Python script was used to process the summarised game logs into win-loss records for the first 3 games and the whole of each season:
#!/usr/bin/env python
from csv import DictReader
first_three_wins = {}
first_three_losses = {}
wins = {}
losses = {}
for n in range(1962,2010):
d = DictReader(file('summary%d.txt' % (n,)),
fieldnames=['away','away_no','home','home_no','away_score',
'home_score'])
for r in d:
home_name = "%d%s" % (n, r['home'])
away_name = "%d%s" % (n, r['away'])
first_three_wins.setdefault(home_name, 0)
first_three_losses.setdefault(home_name, 0)
wins.setdefault(home_name, 0)
losses.setdefault(home_name, 0)
first_three_wins.setdefault(away_name, 0)
first_three_losses.setdefault(away_name, 0)
wins.setdefault(away_name, 0)
losses.setdefault(away_name, 0)
away_win = r['away_score'] > r['home_score']
if int(r['home_no']) <= 3:
if away_win:
first_three_losses[home_name] += 1
else:
first_three_wins[home_name] += 1
if int(r['away_no']) <= 3:
if away_win:
first_three_wins[away_name] += 1
else:
first_three_losses[away_name] += 1
if away_win:
losses[home_name] += 1
wins[away_name] += 1
else:
wins[home_name] += 1
losses[away_name] += 1
for team in wins:
w = float(wins[team])
l = float(losses[team])
w3 = float(first_three_wins[team])
print "%s,%f,%f" % (team,w3,w/(w+l))
Insofar as it matters for such a short snippet, this script should be considered to be in the public domain.
Posted: Mon, 04 Apr 2011 23:03 | | Comments: 17 |