MLB HIT PREDICTOR

⚾ Data science project to predict if a Major League Baseball player will get a hit on any given day (2020) ⚾

View the Project on GitHub eglouberman/MLB-hit-predictor

Building a machine learning classifier to predict which MLB players will achieve a base-hit on any given day.

home

About

Our project explores one of the most iconic outcomes in sports- the major league base hit. We were inspired to approach this problem by a betting game called “Beat the Streak”. In the form of an app, you can pick up to two players each day who you think will get a hit, and if you get 57 in a row correct, you can win 5.6 million dollars! While getting rich from this project would surely be a nice (and unlikely) benefit, we decided to give this problem a go using the most powerful tool our disposal, statistical reasoning and data science. Previous attempts at this problem have been quite successful, but not perfect. Alceo and Henriques (2017) utilized batter performance, team performance, weather, and ballpark characterstics to achieve an 85 percent correct-pick-ratio using machine learning classifiers. We were inspired by their research to take a crack at the problem ourselves. Calculating further features to account for “streakiness” and obtaining even more data excited us to develop ourown model(s) to possibly achieve even greater accuracy.

Our goal was to build an end-to-end data science project, from building a centralized MySQL database using AWS RDS, to data cleaning and engineering in Python (and a bit of R), to data modeling (Python). We spent about five weeks conducting initial research, five weeks scraping and data exploration, and another ten weeks modeling and building this website.

Data Collection and Preparation

We collected baseball statistics from the years 2014-2019, and over 190,000 samples. The data was collected from many different sources using many different methods: API’s, web scraping, etc. Click below on each type of data to learn the importance of it and how it was attained. The database was organized using Amazon Web Services RDS on a mySQL server. We utilized RDS due to its easiness to set up, free storage space, and convenient accessibility among group members.

Exploratory Analysis & Modeling

Results & Conclusion

Our best model produced exciting results. A top-100 precision score of 82% was obtained using Logisitic Regression (similar scores were obtained using a MLP) in a generalized model (not player-specific). This is almost 20 percent more accurate than luck, which was really cool to see. In player-specific models, the average score was about 10 percent lower than generalized models (around 60 percent), but we did achieve great accuracy in several instances with particular players. Check out our results in more detail below!

Who We Are

We are a group of curious and passionate college students from UCLA. We organized this project through the Data Science Union at UCLA. Learn more about us below!

Elon's photo Andrew's photo nate's photo
Elon Glouberman Andrew Liu Nathaniel Barrett
Project Lead Project Associate Project Associate