·Finished·

UCLAIS Challenge 1: Premier League Match Result Prediction

This challenge is a multi-class classification problem to predict the results of past Premier League matches dating from August 2003 until November 2022.

Overview Scoreboard

Overview

Football is one of the most popular sports globally with the Premier League in England seeing the highest viewership of any sports league worldwide. The wealth of data available on player and team performance also makes it ripe for a machine learning challenge on DOXA.

Your aim is to build a classifier to predict historical match results – specifically whether each match was a home team win, an away team win or a draw – for games taking place from August 2003 up until November 2022 (when this competition was launched).

Getting Started

This is an excellent opportunity to experiment with a range of classical and deep machine learning techniques, such as support vector machines, random forests and neural networks using scikit-learn. If you want to find out more, we have prepared a Jupyter notebook to help you get started:

https://github.com/UCLAIS/ml-tutorials-season-4/blob/main/doxa-challenges/challenge-1/getting-started.ipynb

Open in Google Colab 📒

The Challenge

We have aggregated English Premier League match results dating from 2003 up to the present day, which was sourced from football-data.co.uk. This dataset contains match information, such as the number of shots, shots on target, corners, yellow cards, red cards and fouls for the home and away teams.

    {"RangeIndex: 6630 entries, 0 to 6629
Data columns (total 22 columns):
#   Column                Non-Null Count  Dtype 
---  ------                --------------  -----
0   date                  6630 non-null   datetime64[ns]
1   home_team             6630 non-null   object
2   away_team             6630 non-null   object
3   full_time_home_goals  6630 non-null   int64
4   full_time_away_goals  6630 non-null   int64
5   full_time_result      6630 non-null   object
6   half_time_home_goals  6630 non-null   int64
7   half_time_away_goals  6630 non-null   int64
8   half_time_result      6630 non-null   object
9   referee               6630 non-null   object
10  home_shots            6630 non-null   int64
11  away_shots            6630 non-null   int64
12  home_shots_on_target  6630 non-null   int64
13  away_shots_on_target  6630 non-null   int64
14  home_fouls            6630 non-null   int64
15  away_fouls            6630 non-null   int64
16  home_corners          6630 non-null   int64
17  away_corners          6630 non-null   int64
18  home_yellow_cards     6630 non-null   int64
19  away_yellow_cards     6630 non-null   int64
20  home_red_cards        6630 non-null   int64
21  away_red_cards        6630 non-null   int64
dtypes: datetime64[ns](1), int64(16), object(5)
    memory usage: 1.1+ MB"}

DataFrame

Your challenge is to tackle the multi-class classification problem of predicting whether each match in our test set was a home team win (H), an away team win (A) or a draw (D) based on the data available. Submissions to this challenge will be evaluated and ranked according to their micro-averaged F₁ score.

This is a educational competition, so you are also welcome to seek out additional data to help you build your model, such as shot- or player-level data. We encourage you to be creative! Just be careful avoid data leakage, such as by training on full-time goal data when predicting match results. If you do take an interesting approach, we would like to hear about it on the DOXA Community Discord server.

More information is available in the getting started notebook.

General Advice

It is a good idea to get started by visualising and analysing the dataset. This might give you an indication as to what data preprocessing techniques and model types would be suitable for this problem. You can then start experimenting with some relatively simple models and build more complex ones from there.

Here are a few questions you might want to consider:

Are there any missing data values? If so, how should you deal with this? (e.g. by dropping rows containing them or through some sort of multivariate imputation)
What features can you engineer from the data?
How do you want to encode the categorical data? (e.g. would a one-hot encoding be sensible?)
Do you need to scale or standardise the numerical features?
Should you try a dimensionality reduction technique? (e.g. PCA)
If you are using more traditional machine learning models (e.g. logistic regression, support vector machines, etc), what strategy do you want to use to build a multi-class model for this problem (e.g. one-versus-rest (OvR))
If you are using a neural network, what loss function might be the most suitable for this problem? What optimiser do you want to use? What might be a good architecture to start with?