Statting Lineup
Blog Posts
This is my first post in a long while and the first I will be doing since graduating from The University of Alabama with a Bachelor's degree in Commerce and Business Administration. In my final semester at Alabama, I took a course entitled "Introduction to Statistical Learning and Data Mining" where we learned about various predictive models for both regression and classification problems. For our final project, we had to find or develop our own dataset and use various different models to make predictions about that dataset. Naturally, I chose to create a predictive model that would determine whether a player was Hall of Fame caliber or not (a classification problem). It is this model that I will be sharing and using to predict the Hall of Fame future of the most notable players on this year's ballot. Upon receiving a perfect score on my final project and encouragement from my professor, I entered the report on my model to the Undergraduate Statistics Class Project (USCLAP) competition at the intermediate level. I will be sure to share how our fared in the competition once the results are in! (**UPDATE** - I ended up finishing 2nd in the competition, and was the only top finisher that did not work in a group. You can check out the winners here: https://www.causeweb.org/usproc/usclap/2021/fall/winners). The report that I submitted for the competition can be seen below. It is 13 pages total, but only about 3 pages of actual reading with the rest of the pages being visuals.
While viewing this report will save you some time reading, it really only serves as a general summary of the scope of the work I did to develop my model and therefore I feel it is not adequate writing for those of you that are interested in learning the full picture. Therefore, below you can find a longer version of the report that goes into deeper detail about the model, especially details surrounding baseball and history. This version is 34 pages total, with about 10 pages of actual reading.
The general idea is that I compiled the data of all current Hall of Fame position players (non-pitchers) that retired after 1920, which marked the end of the dead-ball era. No players or statistics from the Negro Leagues were used. The data that was used was a player's standard batting and fielding statistics, the various awards they received and accomplishments they completed, and how many seasons they led the league in a particular offensive category (such as hits or batting average). The exact same data was compiled for all of the non-Hall of Fame players (players that were removed from the BBWAA ballot). These players were decided using career WAR, how they fared at their position all-time in terms of stats and awards, and my own judgement. *PLAYERS THAT ARE NOT IN THE HALL OF FAME FOR NON-STATISTICAL REASONS, SUCH AS FOR GAMBLING AND USING STEROIDS, WERE NOT INCLUDED IN THE DATASET*. Hence, no Pete Rose, no Mark McGwire, etc. The full dataset consisted of 124 Hall of Famers and 130 non-Hall of Famers, and can be seen below.
If you are interested in the specifics of how I developed my model, I encourage you to look at one of the report versions above. If you want less reading and only care about the model's eventual predictions, as well as a crash course on predictive modeling, feel free to read on. *If you truly only care about the model's predicted results, feel free to skip ahead to the point with 3 asterisks.* After trimming down the dataset by eliminating some players and some data columns (aka predictors), the models were ready to be trained. Essentially, all the remaining players in the dataset are put into 2 groups, the training set and the testing set. Here, the training set consisted of 146 players and the testing set consisted of 46 players. The idea is that the model examines the players in the training set and looks at the relationships between the predictors and a player's Hall of Fame status. It trains itself to be able to look at a player's career accomplishments and determine if they should be a Hall of Famer. From there, it looks at the career accomplishments of the players in the testing set and makes a prediction for their Hall of Fame status. Since the Hall of Fame status of all players in the dataset are known, we can compare the predicted Hall of Fame status of each player in the testing set with their actual Hall of Fame status. We measure how often the model is right and wrong in this regard to determine its accuracy. My initial model version correctly predicted 43 of the 46 players in the testing set. It correctly predicted 28 of the 29 non-Hall of Famers, with "The Cobra" Dave Parker being the lone player getting the hypothetical promotion. It correctly predicted 15 of the 17 Hall of Famers, with Alan Trammel and Lou Brock being the two snubbed players. While predicting Dave Parker as a Hall of Famer and Alan Trammel as not a Hall of Famer is understandable, failing to predict Lou Brock as a Hall of Famer is more of an egregious error. Nonetheless, the model is able to correctly assess 93.48% of the players it sees, likely much better than most BBWAA voters fare. Unfortunately, the recent Golden Days Era Committee election results adulterated the accuracy of the model somewhat. In the initial run Gil Hodges, Minne Minoso, and Tony Oliva were recorded as non-Hall of Famers. By changing these players to Hall of Famers in the dataset, the model develops a slightly different idea of what makes a Hall of Fame player. I could have gone through and refined and optimized all of the model's parameters, but that would have taken more time that I frankly don't have right now in the midst of final wedding preparations. Keeping the model the same and changing the dataset by those 3 players lowered the model's predictive accuracy to 91.30%, which is still pretty good. The model predicted Hodges as not a Hall of Famer (as it did the first time around), but with the recent election results this prediction is now inaccurate. Furthermore, the model also failed to predict Orlando Cepeda as a Hall of Famer. While the Era Committee results did make the model worse, it still remains a strong predictor of whether a player will be in the Hall of Fame. We can thus use the model on the players on the 2022 BBWAA election ballot to see which players it thinks are Hall of Fame worthy. The official 2022 BBWAA ballot has 30 players on it, but again the model does not deal with players that are pitchers (Roger Clemens, Curt Schilling, etc.) or that have ties to steroids (notably Barry Bonds, Alex Rodriguez, Sammy Sosa, Manny Ramirez, and Gary Sheffield). Furthermore, some of the players on the ballot are quite obviously not Hall of Fame worthy (sorry Justin Morneau) so it didn't make sense to waste time running them through the model. In the end 12 position players on the ballot were run through the model, and you can view all of their dataset values in the spreadsheet below.
***SKIP HERE IF YOU ONLY CARE ABOUT THE PREDICTED RESULTS*** The model was slightly more harsh on the players than I anticipated. In my opinion 6 of these players deserve to be Hall of Famers (perhaps more on that in a later post), but the model only predicted 2 as Hall of Famers. If you read the reports above you know that the actual final model really consists of 4 different models that it averages out to determine the final results. David Ortiz was the one universal constant, predicted as a Hall of Famer by the final model and all 4 of the sub-models. Some of you may be questioning his inclusion since he does have a rumored tie to steroids, but it is my opinion that the evidence of steroid use by Ortiz is much thinner than that of his counterparts that I chose to exclude. The other predicted Hall of Famer by the final model was Todd Helton, who was predicted by 3 of the sub-models. Surprisingly, the closest non-Hall of Famer was Jimmy Rollins, who was not predicted as a Hall of Famer by the final model but was predicted as a Hall of Famer by 2 of the sub-models. Both Omar Vizquel and Bobby Abreau were not predicted as Hall of Famers by the final model but were predicted as Hall of Famers by 1 of the sub-models, albeit by different ones. In summary, the predictive model - which correctly determines a player's Hall of Fame fate 91.3% of the time - concluded that David Ortiz and Todd Helton are worthy of inclusion into the Hall of Fame. If you are interested in the weeds behind developing the final model, take a look below at the R file I wrote to tune and run the sub-models, as well as to develop predictions using the sub-models.
You can also take a look at the slides of the presentation I gave to my class below.
Thank you so much to take the time to read this post and I hope you found it interesting. Feel free to contact me or leave a comment with any questions you may have about the model, whether they be statistics or baseball related. I know it's been a while since I posted last, but I have plenty of exciting ideas and material planned out in my mind to share with you all over the coming months. Let me know your thoughts on the mentioned players' Hall of Fame worthiness in the polls below!
1 Comment
Doug Smiddy
1/6/2022 12:48:10 pm
I am a homer, hence Scott Rolen. Jeff Kent is a shoe in, best hitting 2nd baseman of his generation and Papi is Papi.
Reply
Leave a Reply. |
Statting Lineup Newsletter Signup Form:
|