Data sets for "Prediction with expert advice for the Brier game" by Vladimir Vovk and Fedor Zhdanov

There are two versions of the data sets: for the conference (ICML 2008) and journal (JMLR 2009) versions of the paper.

Data sets for the conference version of the paper

The paper itself can be downloaded by clicking here. The reference is: Vladimir Vovk and Fedor Zhdanov. Prediction with expert advice for the Brier game. In: Proceedings of the Twenty Fifth International Conference on Machine Learning (edited by Andrew McCallum and Sam Roweis), pages 1104 - 1111. New York: ACM Press, 2008.

The football data set

Click here to open or download an extended data set (see below).

This data set contains predictions for and the results of 6473 football matches in various English football league competitions, namely: the Premier League, the Football League Championship, Football League One, Football League Two, and the Football Conference. The data, provided by Football-Data, cover three seasons, 2005/2006, 2006/2007, and 2007/2008. The matches are sorted first by date, then by league, and then by the name of the home team. The predictions are made by eight bookmakers, namely Bet365, Bet&Win, Gamebookers, Interwetten, Ladbrokes, Sportingbet, Stan James, and VC Bet.

The format of the file football1.txt is as follows. There 6473 rows, each row corresponding to a match, and 28 columns. The first four columns are: the 1st column is the date of the match represented as the serial value (see below); the 2nd column is the indicator of the event "home win" (i.e., it is 1 if the home team won, and it is 0 otherwise); the 3rd column is the indicator of the event "draw"; the 4th column is the indicator of the event "away win". The other columns are the predictions of the 8 experts: the 5th column shows the probability of the event "home win" according to the first expert; the 6th column shows the probability of the event "draw" according to the first expert; the 7th column shows the probability of the event "away win" according to the first expert; etc.

The date of each match is shown as a serial number, according to the Microsoft Excel default format for Windows (the default for the Macintosh version is different: see Microsoft Excel help). January 1, 1900 is serial number 1, and January 1, 2008 is serial number 39448, because it is 39,448 days after January 1, 1900. Therefore, the first entry 38570 in the first line of football1.txt means August 6, 2005. Use the Microsoft Excel "Format Cells" command to see one of the traditional representations of a serial number.

The 2007/2008 season ended in May shortly after the ICML 2008 submission deadline, and so the data set used in our paper in the ICML 2008 Proceedings covered only part of that season, with 6416 matches in total. See our arXiv technical report for the analysis of the full data set.

The tennis data set

Click here to open or download.

This data set contains predictions for and the results of 10,087 matches in a large number of tennis tournaments in 2004, 2005, 2006, and 2007. The tournaments include, e.g., Australian Open, French Open, Wimbledon, and US Open; the data is provided by Tennis-Data. The matches are sorted by date, then by tournament, and then by the winner's name. The data contain information about the winner of each match and the probabilities assigned by the four bookmakers to his/her win and to the opponent's win. The bookmakers are: Bet365, Centrebet, Expekt, and Pinnacle Sports.

The format of the file tennis1.txt is as follows. Each of the 10,087 rows corresponds to a match. There are 11 columns: the 1st column is the date of the match given as a serial number (see above); the 2nd column is the indicator of the event "the first player won" (it is always 1 since the first player is defined to be the winner of the match); the 3rd column is the indicator of the event "the second player won" (it is always 0 since the second player is defined to be the loser); the 4th column shows the probability of the event "the first player wins" according to the first expert; the 5th column shows the probability of the event "the second player wins" according to the first expert; etc.

Data sets for the journal version of the paper

The paper can be downloaded by clicking here. The reference is: Vladimir Vovk and Fedor Zhdanov. Prediction with expert advice for the Brier game. Journal of Machine Learning Research 10 2413 - 2440.

There have been two changes to the data sets. First, the football data set was extended by adding another season, 2008/2009. Second, the formula for converting betting odds into probabilities used in the ICML 2008 paper (formula (3) in the conference version and formula (26) in the journal version of the paper) was replaced by a better formula (formula (3) in the journal version) suggested by Victor Khutsishvili. The data sets now contain the quoted betting odds rather than the probabilities derived from them.

The football data set

Click here to open or download.

This data set contains betting odds quoted by eight major bookmakers for, and the actual results of, 8999 football matches in various English football league competitions, namely: the Premier League, the Football League Championship, Football League One, Football League Two, and the Football Conference. The data, provided by Football-Data, cover four seasons, 2005/2006, 2006/2007, 2007/2008, and 2008/2009. The matches are sorted first by date, then by league, and then by the name of the home team. The eight bookmakers are Bet365, Bet&Win, Gamebookers, Interwetten, Ladbrokes, Sportingbet, Stan James, and VC Bet.

The format of the file football2.txt is as follows. There 8999 rows, each row corresponding to a match, and 28 columns. The first four columns are: the 1st column is the date of the match represented as the serial value (for details, see the description of the data set football1.txt above); the 2nd column is the indicator of the event "home win" (i.e., it is 1 if the home team won, and it is 0 otherwise); the 3rd column is the indicator of the event "draw"; the 4th column is the indicator of the event "away win". The other columns are the betting odds quoted by the 8 experts: the 5th column shows the betting odds for the event "home win" quoted by the first bookmaker; the 6th column shows the betting odds for the event "draw" quoted by the first bookmaker; the 7th column shows the betting odds for the event "away win" quoted by the first bookmaker; etc.

The tennis data set

Click here to open or download.

This data set contains the betting odds quoted by four major bookmakers, and the actual results of, 10,087 matches in a large number of tennis tournaments in 2004, 2005, 2006, and 2007. The tournaments include, e.g., Australian Open, French Open, Wimbledon, and US Open; the data is provided by Tennis-Data. The matches are sorted by date, then by tournament, and then by the winner's name. The data contain information about the winner of each match and the betting odds of four bookmakers for his/her win and for the opponent's win. The bookmakers are: Bet365, Centrebet, Expekt, and Pinnacle Sports.

The format of the file tennis2.txt is as follows. Each of the 10,087 rows corresponds to a match. There are 11 columns: the 1st column is the date of the match given as a serial number (see above); the 2nd column is the indicator of the event "the first player won" (it is always 1 since the first player is defined to be the winner of the match); the 3rd column is the indicator of the event "the second player won" (it is always 0 since the second player is defined to be the loser); the 4th column shows the betting odds quoted on the event "the first player wins" by the first bookmaker; the 5th column shows the betting odds quoted on the event "the second player wins" by the first bookmaker; etc.