UCLAIS Challenge 1: Premier League Match Result Prediction
This challenge is a multi-class classification problem to predict the results of past Premier League matches dating from August 2003 until November 2022.
This challenge is a multi-class classification problem to predict the results of past Premier League matches dating from August 2003 until November 2022.
Football is one of the most popular sports globally with the Premier League in England seeing the highest viewership of any sports league worldwide. The wealth of data available on player and team performance also makes it ripe for a machine learning challenge on DOXA.
Your aim is to build a classifier to predict historical match results โ specifically whether each match was a home team win, an away team win or a draw โ for games taking place from August 2003 up until November 2022 (when this competition was launched).
This is an excellent opportunity to experiment with a range of classical and deep machine learning techniques, such as support vector machines, random forests and neural networks using scikit-learn. If you want to find out more, we have prepared a Jupyter notebook to help you get started:
https://github.com/UCLAIS/ml-tutorials-season-4/blob/main/doxa-challenges/challenge-1/getting-started.ipynb
We have aggregated English Premier League match results dating from 2003 up to the present day, which was sourced from football-data.co.uk. This dataset contains match information, such as the number of shots, shots on target, corners, yellow cards, red cards and fouls for the home and away teams.
{"RangeIndex: 6630 entries, 0 to 6629
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 6630 non-null datetime64[ns]
1 home_team 6630 non-null object
2 away_team 6630 non-null object
3 full_time_home_goals 6630 non-null int64
4 full_time_away_goals 6630 non-null int64
5 full_time_result 6630 non-null object
6 half_time_home_goals 6630 non-null int64
7 half_time_away_goals 6630 non-null int64
8 half_time_result 6630 non-null object
9 referee 6630 non-null object
10 home_shots 6630 non-null int64
11 away_shots 6630 non-null int64
12 home_shots_on_target 6630 non-null int64
13 away_shots_on_target 6630 non-null int64
14 home_fouls 6630 non-null int64
15 away_fouls 6630 non-null int64
16 home_corners 6630 non-null int64
17 away_corners 6630 non-null int64
18 home_yellow_cards 6630 non-null int64
19 away_yellow_cards 6630 non-null int64
20 home_red_cards 6630 non-null int64
21 away_red_cards 6630 non-null int64
dtypes: datetime64[ns](1), int64(16), object(5)
memory usage: 1.1+ MB"}
Your challenge is to tackle the multi-class classification problem of predicting whether each match in our test set was a home team win (H), an away team win (A) or a draw (D) based on the data available. Submissions to this challenge will be evaluated and ranked according to their micro-averaged F1 score.
This is a educational competition, so you are also welcome to seek out additional data to help you build your model, such as shot- or player-level data. We encourage you to be creative! Just be careful avoid data leakage, such as by training on full-time goal data when predicting match results. If you do take an interesting approach, we would like to hear about it on the DOXA Community Discord server.
More information is available in the getting started notebook.
It is a good idea to get started by visualising and analysing the dataset. This might give you an indication as to what data preprocessing techniques and model types would be suitable for this problem. You can then start experimenting with some relatively simple models and build more complex ones from there.
Here are a few questions you might want to consider: