It seems like it might be for the reasons I have have seen people give the last week or so: no post-season exposure, somewhat short career (he did not reach 10,000 PAs), lack of milestones like 3000 hits or 500 HRs and lack of MVP awards.
Last year and earlier this year I posted some regression generated equations that tried to explain the percentage of the Hall of Fame vote player got in their first year of eligibility (and also their highest percentage). The model I came up with was based on some trial and error. That seemed unavoidable, since it is hard to have priors on what exactly the voters are thinking. The model looked at all players that became eligible for the first time from 1980-2009.
The model uses the following data to explain vote percentage:
Reaching 10,000 PAs
World Series performance
Gold Gloves and All-Star games got capped at certain levels which were then squared. The idea was that those things have an exponential effect which tapers off. There were also interaction terms for World Series performance, Gold Gloves and All-Star games. The idea there was that getting lots of Gold Gloves and playing in lots of All-Star games has more than an additive effect (after I discuss what the model predicted for Santo, technical details like regression results and variable descriptons will be covered).
Santo's first year percentage was 3.9%. Normally, he would no longer be eligible in the writers' voting. But he and some other players were re-instated in 1985. He got 13.4%. The model predicted that he would get 17.65%. The standard error was .08. So even if we give him 8% more, that only jumps him up to 21.4%. Still a pretty low total for a first year (Billy Williams got 23.4% in his first year in 1982 and steadily increased until he got 85.7% in 1987).
Santo's highest percentage was 43%. The model predicted it would be 30%. So he actually did better than that. The standard error was .117. So he was predicted to be about 4 standard errors below what is needed for induction, 75%. And his actual highest percentage was still about 3 standard errors below 75%. Billy Williams highest predicted percentage was 29.6% while it was actually 85.7%. That differential of 56.1% is the highest positive differential. Why Williams is in and Santo isn't is an interesting question.
Here was the equation where the player's first year vote percentage was the dependent variable
PCT = -.010 + .00086(WSAS) + .048(GGAS) + .070(MVP) + .404(3000 HIT) + .280(500 HR) + .002(ASSQ10) - .00089(GGSQ7) + .071(500SB) - .006(WSIMPSQ50) + .100(10000PA)
The adjusted r-squared was .898 The standard error was .08.
Here was the equation where the player's highest vote percentage was the dependent variable
PCT = -.014 + .00037(WSAS/1000) + .025(GGAS/1000) + .067(MVP) + .257(3000 HIT) + .201(500 HR) + .0048(ASSQ10) - .0013(GGSQ7) + .071(500SB) - .00167(WSIMPSQ50/1000) + .137(10000PA)
The adjusted r-squared was .861 The standard error was .117.
MVP is number of MVP awards won, 3000H is a dummy variable (1 if a player reached it, 0 otherwise). The 500HR is also a dummy variable as it is for 500SB and 10000PA (if you made it to 10,000 career plate appearances, you get a 1, 0 otherwise). I used all the voting data from 1990-2009.
What is ASSQ10? It is the square of the number of All-star games played in squared. But AS games played is maxed out at 10. The assumption here is that being an all-star has a positive exponential effect but only up to a point where no more games helps (I have a graph below to help explain this). The GGSQ7 is the same thing for Gold Gloves.
WSIMPSQ50 involves World Series play. First, WSIMP is World Series PAs times OPS. The idea here that the more you play in the World Series the more votes you would get, but by multiplying it by OPS, it also includes how well you played (or just hit). This gets maxed out at 50 and is squared, for the same reason as all-star games (yes, Reggie Jackson is first here and way ahead of everyone else at 141, with Dave Justice and Lonnie Smith tied for 2nd at 101).
The last two variables are interaction variables. GGAS is the gold glove variable multiplied by the all-star variable and WSAS is the world series variable times the all-star game variable. It looks strange that the coefficient values on GGSQ7 and WSIMPSQ50 are negative. But you might notice that they are positive on the interactive variables. I think this is like when a regression uses both X and X-squared in a regression if the phenomena is non-linear (an inverted parabola, for example). The coefficient on X ends up being positive while the x-squared coefficient is negative. The reason I put in these interactive variables was to see if players who were strong in both got an extra boost, as if there was some synergy going on. It seems like they did get an extra boost.
Since the dependent variable can only go from 0 to 100, the coefficient would be very low. So I divided these three variables by 1000 (my stat package was showing coefficient values of .00000 before I did this).