I’ve been meaning for awhile to write a post about the Lahman Database. If you’re not already familiar with this database, I encourage you to take a look because it’s a great baseball statistics resource. The current version of the database (5.6) contains MLB pitching, hitting, and fielding statistics from 1871 through 2008. An annual update is usually released not long after the conclusion of the World Series.
To give a brief history of the database, Sean Lahman started this project in 1992 in an effort to make baseball statistics freely available to the general public. Now, a team of researchers works tirelessly to maintain the database and release the annual updates.
Sean Forman extended the Lahman Database for easy use on the web as an online encyclopedia at “baseball-reference.com.” Since 2001, Sean Lahman and Sean Forman have led a group of researchers who volunteered to maintain and update the database, known as the Baseball Databank.
The reason that I give this background information is twofold. First, I’d like to give Sean Lahman, Sean Forman, and their team of researchers proper credit for their extraordinarly work. Secondly, it helps to understand the various websites where you will find references to the Lahman Database.
http://www.baseball1.com/ – This is Sean Lahman’s website. You can download the most recent version of the database from this site.
http://www.baseball-databank.org/ – This is Sean Forman’s website. You can also download the most recent version of the database from this site.
http://www.baseball-reference.com/ – This is Baseball Reference, perhaps the most complete online baseball encyclopedia available. This site runs on the Lahman Database.
For most standard baseball research, the information presented on Baseball Reference will be more than adequate. However, if you’re interested in running more specific queries on this historic set of data, you will need to download a copy of the database, and I will guide you through that process.
1. First, I navigated to: http://www.baseball1.com/content/view/57/82/. I clicked “Download Version 5.6 (1871-2008), and then I clicked “Download SQL Version”.
It’s worth noting that Microsft Access and CSV versions of the database are also available. If these files are sufficient for your purposes, you’ll likely find them easier to use. Just download the files, and open them up in their proper programs (Microsoft Access for the mdb files, and your favorite spreadsheet program for the csv files).
2. I created a blank database on my MySQL server called bball_stats to house the Lahman Database. Procedures on how to create a new database will vary depending on your database setup and access privileges.
3. The next thing that you will need to do is import the SQL file. The file is quite large (36 MB uncompressed, 7.9 MB zipped). I found it easiest to upload the SQL file using a MySQL GUI program called HeidiSQL. The script uploaded in a matter of minutes, and the tables and data were ready for research!
4. To verify that all data had uploaded properly, I checked the rowcount of each table. Here is the expected number of rows for Version 5.6 of the database.
TABLE => ROWS
Master => 17264
Teams => 2595
TeamsFranchises => 120
TeamsHalf => 52
Batting => 91457
Pitching => 39016
Fielding => 154843
FieldingOF => 12028
Salaries => 19819
Managers => 3167
ManagersHalf => 93
Allstar => 4321
AllstarFull => 4522
AwardsPlayers => 2558
AwardsSharePlayers => 6182
AwardsManagers => 53
AwardsShareManagers => 318
HallOfFame => 3477
HOFold => 286
BattingPost => 10422
FieldingPost => 8981
PitchingPost => 4006
SeriesPost => 250
Schools => 732
SchoolsPlayers => 5904
xref_stats => 16631
Appearances => 40139
Finally, while we’re talking about these tables, it’s worth talking about the data that each table contains.
The database is comprised of the following main tables:
MASTER – Player names, DOB, and biographical info
Batting – batting statistics
Pitching – pitching statistics
Fielding – fielding statistics
It is supplemented by these tables:
AllStar – All-Star appearances
Hall of Fame – Hall of Fame voting data
Managers – managerial statistics
Teams – yearly stats and standings
BattingPost – post-season batting statistics
PitchingPost – post-season pitching statistics
TeamFranchises – franchise information
FieldingOF – outfield position data
FieldingPost- post-season fieldinf data
ManagersHalf – split season data for managers
TeamsHalf – split season data for teams
Salaries – player salary data
SeriesPost – post-season series information
AwardsManagers – awards won by managers
AwardsPlayers – awards won by players
AwardsShareManagers – award voting for manager awards
AwardsSharePlayers – award voting for player awards
AllStarFull – Expanded All-Star info
Appearances – Detailed games played info
Schoools – college info
SchoolsPlayers – players college info
Later, I’ll write more about how I’ve used the Lahman Database for baseball research. In the meantime, I encourage you to try the database for yourself. Look for the 2009 update to arrive by the end of the year!