Major League Baseball Record Keeping
This essay is adapted from one that appeared in many of the eight editions of Total Baseball from 1989 to 2004. My decision to publish is precipitated by a current dustup over Baseball-Reference.com’s decision to uncouple the records of the Baltimore Orioles of 1901-02 from those of the successor franchise, today’s New York Yankees (http://www.sports-reference.com/blog/2014/07/1901-02-orioles-removed-from-yankees-history/). It is delightful to see so much passion exhibited in the Comments section over a decision with little practical consequence, except that the Yanks’ 10,0000th win will now come sometime next year instead of next week. No individual or team seasonal records have been excised. Anyway, the entire matter of team records when there have been precedessor franchises becomes very murky and has been observed anecdotally by the clubs themselves.
The nascence of the Yankees is an interesting question and unsurprisingly arouses more passion than that of the Pittsburgh Pirates (see: http://goo.gl/QbIhZD). I gave a pretty full rendering of the riotous events of 1902 in this space: http://goo.gl/hL1fXP. In any event, I offered my opinion on the Oriole-Yankee issue as a rather well-informed fan, not as Major League’s Baseball official historian. To my knowledge MLB takes no stance regarding how clubs choose to observe their records. I have excavated the essay below (with cuts and modifications) to demonstrate that baseball’s record book has always been a mine field, which may offer comfort to those currently incensed. [*]
Major League Baseball Record Keeping
John Thorn, Pete Palmer, and Joseph M. Wayman
Despite the death of editors Hy Turkin in 1957 and S.C. Thompson ten years later, their Official Encyclopedia of Baseball (first issued in of 1951) remained the dominant book of baseball statistics, although many fans were frustrated with the fragmentary records it presented. As Frank V. Phelps wrote in the 1987 edition of The National Pastime, “Gaps and obvious errors in official averages, the lack of many early records, difficulty in securing the records of players who appeared in only a few games, and frustrating discrepancies among existing guides and registers had long since created a desire for an ultimate, complete, correct set of major league records. But it wasn’t until the mid-1960s that the development of sophisticated computers which could absorb, retain, order, and output huge amounts of data finally made a project feasible.”
Beginning in 1967, a battalion of researchers commanded by David S. Neft foraged through the official records and newspaper box scores to provide freshly compiled figures for those who had no ERAs, RBIs, slugging averages, saves, and all manner of wonderful things. The material which finally appeared in the tome was entered into a data bank, and the book was the first to be typeset entirely by computer, now a common practice. Published in 1969, The Baseball Encyclopedia was a milestone in computer technology, but as indispensable as the computer were the old-fashioned scrapbooks and files of Lee Allen and John Tattersall. The result was a mammoth ledger book of the major leagues more thorough than any that had appeared before.
The ICI group not only found new data to correct old inaccuracies but also applied new yardsticks to men who had gone to their graves never having heard of an RBI or a save. The ICI research that went into The Baseball Encyclopedia of 1969 created new stars, launching several previously underappreciated heroes of old into the Hall of Fame. Sam Thompson, Addie Joss, Roger Connor, Amos Rusie—their phenomenal level of play was hidden simply because statisticians back then were not recording the particular numbers which would show them off to best advantage. If sabermetrics consists of finding things in the existing data that were not seen before, or in collecting that data which makes possible the application of new statistics to old performances, the first edition of The Baseball Encyclopedia was a monument in the course of sabermetrics.
However, its subsequent editions declined from that standard, dropping valuable data, altering figures for star players in a misguided homage to tradition, and making a shambles of individual/team balance in the totals. As Phelps wrote of the second edition, supervised by the Macmillan Company after the ICI group broke up:
“Players’ batting statistics were changed without compensating for changes in the records of other players on the same teams or in the corresponding team and league totals. Later editions included even more unbalanced adjustments . . .
“Quite apart from the problem of record-balancing, the numerous changes in players’ totals and averages has caused serious misapprehensions and confusions for fans, writers, and researchers. The records of Fred Clarke and Cy Young differ in all six editions [to 1987] even without counting Clarke’s astronomical 1899 BA [in the third edition, Clarke was credited with a batting average of .986 that boosted his lifetime mark by 15 points]. The figures for Burkett, Chesbro, Duffy, Hornsby, Walter Johnson, Radbourn, Speaker, and Waddell differ in five of the six books. The same is so in four of six for at least twenty-three other Hall of Famers, and many more less gifted players.”
The seventh edition was issued in 1988 and, like the five that preceded it, was less accurate than the classic first issue. The eighth edition, published in 1990, corrected many of the errors in the seventh but retained many once-contested errors that historians had long since expunged from the record, while changing other statistics in a manner at variance with Major League Baseball’s standards and with a rationale that remains unclear. For the ninth edition, Major League Baseball distanced itself from the both the product and its database.
Even when The Baseball Encyclopedia had been launched back in 1969, the ICI findings raised the hackles of traditionalists, prompting the formation by Major League Baseball of a Special Baseball Records Committee. Its members ruled upon such matters as whether, for the historical record, bases on balls should be counted as hits (as they were in 1887), outs (as they were in 1876), or neither (as has been the practice in all other years); or whether “sudden-death” home runs—thirty-seven game-winning blows with men on base that they identified as having occurred in the bottom half of the ninth or an extra inning—would be credited as homers or, in the practice before 1920, would count for only as many bases as were needed to push across the winning run. In the latter controversy, committee members first decided to count the disputed blows as homers, but then, when complaints arose that Babe Ruth’s famous total of 714 would change to 715, they reversed themselves. They also decided that the National Association of 1871-1875 was not a major league, while the Federal League, Union Association, and Players League were; and they ruled on several other issues, all of which were published in the Appendix to The Baseball Encyclopedia.
Because earlier editions of Total Baseball enjoyed neither the privilege nor the responsibility of official Major League Baseball status, the editors committed themselves to the process of history—its research, reporting, and interpretation—more than to its product. History is not static and unchanging. Our course then as now seemed unassailable: publish the best-documented data and remain humbly amenable to subsequent revision in the light of new evidence. (This is not very different from the placard in the Baseball Hall of Fame, which states that although later studies have called into question the accuracy of information on the plaques, the facts as engraved were believed to be accurate at the time.)
However, it must be acknowledged that we paid little mind to the consequences of our findings and reasoned judgments, such as the stripping of a batting championship from Ty Cobb in 1910 or Bobby Avila in 1954. For the fourth edition of Total Baseball, the first to receive official MLB status, our challenge was to devise a more historically sensitive framework that would permit us to incorporate the best modern research while continuing to honor the judgments of the past. For example, Total Baseball abided by the Special Baseball Records Committee’s decision on game-ending homers—not to preserve Ruth’s total, but because there were many more such homers before 1920 than the thirty-seven the committee identified, and the disputes surrounding some of them are now beyond settling.
Like Turkin/Thompson and all previous record books, and in accordance with the view of most historians, we rejected the committee’s position that the National Association was not a major league. We committed to the creation of a full statistical record of that trailblazing circuit, and hoped one day to integrate the NA and NL records of such players as Al Spalding, Cap Anson, George Wright, and all the others who played in the professional league between 1871 and 1875.
We also differed from the committee’s ruling on awarding pitchers wins and losses in the years before 1920. Not finding any official scoring rule or practice for that time, they chose to apply 1950 guidelines to decisions awarded in 1876-1920. This well-intentioned decision produced substantial alterations in the records of such hurlers as Cy Young, Christy Mathewson, Grover Alexander, and others. In the ensuing years, the notable research of Frank Williams (reported in “All the Record Books Are Wrong,” The National Pastime, 1982) revealed that there was indeed a pattern and a rationale for the way decisions were awarded in those days; the data in Total Baseball and today all other sources conforms with his meticulously substantiated findings.
More involved, and perhaps of most direct interest to fans and media, are the subjects of (a) statistical discrepancies between the record presented in Total Baseball or Baseball-Reference.com and the figures published in other reference works, or memorialized on Hall of Fame plaques, and (b) the implications of corrected data for the awarding of batting championships.
There are seven major sources for baseball statistical research. By far the most significant one is the official Major League Baseball records kept by the leagues, published in the baseball guides, and maintained on microfilm at the Baseball Hall of Fame in Cooperstown and in the league offices. These records cover the years since 1903 for the National League and 1905 for the American League. Any source data before these years were lost.
The second major source is the computer printouts prepared by ICI for The Baseball Encyclopedia in 1969. These cover the NL for 1891-1902, the AL for 1901-05, the Federal League for 1914-15, plus all the nineteenth-century leagues (1882-91 American Association, 1884 Union Association, and the 1890 Players League). These records, obtained from newspaper box scores, were turned over to the Hall of Fame and made public by agreement with historian Lee Allen, who permitted ICI to use his voluminous player demographic files.
The third source is John Tattersall’s newspaper box-score research for the NL of 1876-90. Since Tattersall had done such careful work, day-by-day computer printouts were never generated for this period. Any day-by-day records created by John have been lost, but what has survived is a batting and fielding summary and a pitching summary for each club each year listing many categories. This collection, now owned by SABR, also includes the home run log, which lists every home run ever hit from 1871 to date, the date, teams, game location, batter, pitcher, inning, men on base, and other notes.
The fourth source for baseball statistics is a box score collection accumulated by Michael Stagno, covering the National Association of 1871-1875, which was also purchased by SABR. Preliminary basic data was calculated by Stagno. Bob Richardson of Boston and Bob Tiemann of St. Louis worked to get complete totals in most categories from this data.
The fifth source is additional data done by the ICI researchers for the 1969 edition, covering data that were not kept officially during the years since 1903 for the NL and 1905 for the AL. Examples of this data are runs batted in before 1920, extra base hits in the AL for 1905 and 1906, double plays by fielders before 1920 NL and 1923 AL, pitching data except wins and losses for the AL for 1905-07, earned runs for pitchers before 1912 NL and 1913 AL, complete games before 1913 NL and 1926 AL, games started before 1926 AL and 1938 NL, and saves before 1969. Any day-by-day records from this source have been lost, but the season totals have survived.
The sixth source is newspaper box score research to pick up additional categories not covered by the first five sources. Examples of this are hit by pitch for batters before 1917 NL and 1920 AL (by Alex Haas, Tattersall, Palmer, and many others), triple plays by fielders before 1928 AL and 1930 NL (mainly by Jim Smith), home runs allowed by pitchers before 1950 AL and 1952 NL (again by Tattersall). Frank Williams carefully researched AL pitching records for 1901-1919, when the league records were particularly sloppy. Total Baseball gathered day-by-day sheets for most of this data, with the rest residing in the Tattersall collection.
The seventh and newest is the astoundingly voluminous play-by-play restoration at Retrosheet.org.
The data reported by the ICI group in the first edition of The Baseball Encyclopedia upset many people in baseball, for their numbers were different from those traditionally accepted; in subsequent editions, many of the prominent players’ statistics were fudged back to their traditional values. Yet 1969 had hardly been the first time corrections had been made to official data. In 1929 Grover Cleveland Alexander won his 373rd game, breaking Christy Mathewson’s National League record, then thought to be 372. He never won another game. A number of years later, Joseph Reichler found a game in which, by the rules of that time, Matty should have gotten the win, this game taking place on May 21, 1902. The official record was changed and Matty pulled into a tie with Alex. The problem was that no one checked all of Mathewson’s other games to see how many times he received a win under the old rules that wouldn’t have been credited that way today. When ICI did its original research in 1968, it found Matty had only 367 wins by today’s rules, while Alexander had 374. (Further research, notably by Frank Williams, has restored Alexander and Mathewson to a tie at 373 wins.)
In another celebrated example of record-book flip-flops, when the American League was formed in 1901, Nap Lajoie was credited with a .422 average, with 220 hits in 543 at bats. After a number of years, someone noticed that if you take these at bats and hits, the average comes out only to .405, so his average was changed. (Turkin/Thompson gave Nap a mark of .409 in its first edition.) Later in the 1950s, John Tattersall had his doubts and decided to go through his newspaper collection of box scores. He found 229 hits for Lajoie, not 220–the error had been in the figure for hits, not in the figure for batting average. Thus his average was restored to .422, which happened to be the highest in American League history. Then ICI research in this area came up with a .426 mark (232 for 544, based on newspaper accounts), which was published in the first edition, then trimmed back to .422 in subsequent editions. Because the day-by-day source data for the American League of 1901 has been lost, one must make an informed choice between .422 and .426.
In 1910 there was a very close batting race between Cobb and Lajoie. At the end of the season, most people thought Nap had won, based on his getting seven hits in a doubleheader on the final day of the season. There was talk that the opposing Browns had let him get a number of bunts by playing back, so that the hated Cobb would lose. However, the AL office went over their figures and gave Cobb the title, .385 to .384. Nearly eighty years later, Palmer discovered a critical error: a game in which Cobb had two hits in three at bats had been entered twice. This was found because Sam Crawford had 14 games on his official sheet for the homestand yet the Tigers had only played 13. It turned out that Detroit played a doubleheader on September 24, but the second game inadvertently was inserted in the official sheets as being played on September 25. Later, this second game of the 24th, which appeared to have been missing, was put in the scoresheets again. The League Office discovered this mistake soon after its official announcement that Cobb had won the batting title, because the double entry was corrected for all the other Detroit players. However, Ban Johnson had made a big deal out of how carefully his people had checked the figures in order to settle the controversy, so the AL kept quiet about the gaffe, leaving Cobb the winner.
Appeals to Commissioner Kuhn in 1981 to set the matter straight officially were to no avail, because that would not only have changed the outcome of the 1910 batting race, it would also have altered Cobb’s lifetime hit total, then being pursued to massive media attention by Pete Rose. Kuhn’s statement read, in part, “The passage of seventy years, in our judgment . . . constitutes a certain statute of limitation as to recognizing any changes in the records with confidence of the accuracy of such changes. . . . Since a variety of questions have been raised through the years about the accuracy of the statistics of that period, the only way to make changes with confidence would be for a complete and thorough review of all team and individual statistics. That is not practical.” It may not have not been practical, but we embarked upon such a course, and brought it to conclusion.
Asked at the time how we would have resolved the dispute over the 1910 batting race, we responded in this way: remove Cobb’s two redundant hits and alter his batting average accordingly, effectively dropping it beneath Lajoie’s, and correct his lifetime hit total as well; however, retain Cobb’s batting championship, for two reasons—one, because Lajoie’s flurry of bunt hits were highly suspect, and two, because Cobb was awarded the title in his day, and awards should be permanent, not contingent. Furthermore, a reasonable case can be made that Ban Johnson, if he had believed that Lajoie’s tainted hits would have been sufficient to produce a batting championship, would have nullified them; after all, he did banish from baseball the Browns’ manager who had instructed his rookie third baseman to play exceptionally deep.
It is this singular event in baseball history that supplied a model for how Total Baseball and Major League Baseball developed a policy for incorporating new research finds into the historical record without revoking long-held personal championships. Player records may be changed upon the evidence of historical error, but league awards and titles are forever.
Here is what happened in the now celebrated Honus Wagner case in the 1990 Baseball Encyclopedia, over which Major League Baseball and the Macmillan publishing firm became estranged. The Macmillan editor noticed that previous edition figures for Wagner did not agree with the data presented in the first edition in 1969. He assumed that the data had been corrupted over the years, and thus returned the 1897-1900 data to the original figures, costing Wagner 12 hits. However, the editor did not restore the 1901-02 data, which would have resulted in Wagner losing three more hits. The outcome was that Wagner had a total of 3415 hits in the 1969 edition, 3430 in the 1988 edition (the traditional figure) and 3418 in the 1990 edition (also in 1993). One of the problems with the Macmillan newspaper research was that it did not count protested games in the player data. Although the games were thrown out of the standings, the player stats did count in the league compilations at the time, which should be the criterion for inclusion. (Protested games were included in the official records through 1909, then omitted 1910-1919, and were made once again part of the official records in 1920. When our review of these protested games prior to 1909 is completed, the individual stats will be added to our figures.)
Wagner was involved in three of these protested games. There were about twenty-five of them altogether in the nineteenth century. However, the newspaper research did show up additional differences in player stats beyond those from the protested games.
When checking the plaques for the Hall of Fame players, we found about forty players with differences from the Total Baseball data. Most were nineteenth century players with small differences due to discrepancies between the old guide figures and the later newspaper research. Some had to do with rule changes from the 1969 Special Baseball Records Committee, such as not counting walks as hits in 1887. For the twentieth century, there were a number of differences due to official errors, mostly in the area of pitcher won-lost marks in the 1901-1919 American League period. There were only a few outright errors on the plaques (Anson, Clarkson, Hamilton, McCarthy, McGinnity, and Nichols). Exact differences can be found by comparing The Sporting News Daguerreotypes (which often agrees with the plaques) with Total Baseball.
[*] I focus here on the major league period and do not recount the rise of stats from 1845 to the twentieth century—a favorite topic but one which I have addressed of late. See, in this space, “Stats and History,” as taken from The Hidden Game:
Tomorrow, a guide to variances between the plaques and the official record of Major League Baseball.