World Cup - Host Advantage and Curse of the Champion

The 2018 World Cup is well underway, and it made me go digging for actual game data from past tournaments.

What can we learn from match history spanning the Cup's 88 years?


library(ggplot2)
library(plotly)

1.
The Data

We can start with a long-format data set of each country's result in every World Cup (including years in which countries did not qualify or participate).

This data set is reshaped from a table found in the Wikipedia entry for National team appearance in the FIFA World Cup, and also available in this PDF provided by FIFA.

The cleaned and reshaped data set is available in rds format here:

all_results.rds

If we look at the contents of the set, you can see that there's one observation per team-year:

all_results <- readRDS(all_results.rds)
head(all_results, 10)

1.1.
FIFA final rankings

Although teams don't all play each other in the World Cup, FIFA still ranks the performance of each time to produce an integer variable to distinguish performance. Luckily for data analyses like this one, FIFA has even ranked teams from past World Cups, so we have a complete data set.

Ties do occur, but for the most part, the more continuous nature of the ranking variable produces much more interesting visualizations than using the round of exit, since so many teams exit the group stage together, despite the potential for a wide difference in performance during the three group matches.

To confirm we're looking at the right results, let's look at the range of final rankings by year of each World Cup - we should see a data point per team:

plot_ly(
  data = all_results,
  x = ~year,
  y = ~rank, 
  text = ~team,
  type = 'scatter', 
  mode = 'markers',
	hoverinfo = 'x+y+text') %>%
layout(
  title = 'Years and ranks',
  xaxis = list(title = 'Year'),
  yaxis = list(title = 'Rank'))

This looks right - the first few World Cups had between 14 and 16 teams, then a standard 16 from 1952 to 1978, then 24 from 1982 to 1994, and 32 teams since the 1998 World Cup hosted by France.

You can hover over individual points to see the year, rank, and team (remember that lower rankings are better).

Gaps in the data occur because of ties in the rankings (for example, if there's a two-team tie for 9th, the next ranking would be 11th).


2.
Host results

The file below contains a simple data.frame with host nations and the year in which they hosted (or co-hosted) the World Cup.

host_years.rds
host_years <- readRDS(host_years.rds)
host_years
# Merge host_years onto all_results
host_years$team[host_years$team == 'West Germany'] <- 'Germany'
host_years$host <- 'Host'

all_results <- merge(
  all_results,
  host_years,
	by = c('year', 'team'),
	all.x = TRUE)

head(all_results)
# No merge means `host` should be FALSE

host_na_index <- is.na(all_results$host)
print(table(host_na_index))
all_results$host[host_na_index] <- 'Not host'

# Check results
stopifnot(table(all_results$host)['Host'] == nrow(host_years))

There's something interesting when you look at the host rankings - hosts do really well compared to non-hosts. If we create the same plot above, but color the hosts brighter, you can see that 13 / 21 host countries have finished in the top 5, and 18/12 in the top 9:

# Plot to highlight host rankings
plot_ly(
  data = all_results,
  x = ~year,
  y = ~rank, 
  color = ~host,
  text = ~team,
  colors = c('red', 'lightgray'),
  type = 'scatter', 
  mode = 'markers',
	hoverinfo = 'x+y+text') %>%
layout(title = 'Years and ranks',
  xaxis = list(title = 'Year'),
  yaxis = list(title = 'Rank'))

This makes some intuitive sense - nations most willing to host the World Cup seem likely to have good teams, and bids to host are likely to be accepted for countries who can not only put on a good performance on the pitch, but also have a fan base ready to attend in good numbers.

So to narrow in, let's restrict to countries that have ever hosted a World Cup, a set that includes all past champions.

champions <- unique(all_results$team[all_results$rank %in% 1])
all(champions %in% host_years$team)

To simplify things, let's look at host countries' performance:

  • in the World Cup before hosting
  • during the World Cup as hosts
  • in the World Cup directly after hosting

The first step is to filter the data to rows directly before, during, or after a hosted World Cup. To do this, we can merge the host_years data onto the all_results data, and keep only the matches, then repeat the merge twice, adding and subtracting four years to the year variable to merge with the results before and after the host year.

# Look at performance just before hosting, as host, then right after hosting
host_years$host <- paste(host_years$year, host_years$team)
host_years$time <- 'Hosted WC'

all_results$host <- NULL
      
host_results <- merge(
	host_years,
  all_results)
      
host_years_pre <- host_years_post <- host_years
      
host_years_pre$time <- 'WC before hosting'
host_years_post$time <- 'WC after hosting'
      
host_years_pre$year <- host_years_pre$year - 4
host_years_post$year <- host_years_post$year + 4
      
host_results_pre <- merge(
  host_years_pre,
  all_results)

host_results_post <- merge(
  host_years_post,
  all_results)

host_results_host_years <- rbind(
  host_results_pre,
  host_results,
  host_results_post)

host_results_host_years <- host_results_host_years[c('host', 'time', 'rank')]

host_results_host_years$time <- factor(
  host_results_host_years$time,
  levels = c('WC before hosting', 'Hosted WC', 'WC after hosting'))

sort_order <- order(host_results_host_years$host)

host_results_host_years[sort_order, ]

Now we can plot a three-point time series for each host, drawing a line from their rank in the World Cup prior to, during, and after their host year.

ggplotly(ggplot(
  data = host_results_host_years,
  mapping = aes(
    x = time,
    y = rank,
    group = host,
  	color = host)) +
  geom_line() +
	labs(
  	x = 'Time point',
    y = 'Rank') +
	theme_bw())

This is kind of messy, but I think there's a pattern of downward V shapes, indicating a better final rank in the year of ohosting.

So even after restrict to teams that have hosted a World Cup, and only to tournaments directly before, during, and directly after the host year, the effect of hosting seems pretty significant, at least visually.

Some examples:

  • Chile didn't qualify in 1954 or 1958, then promptly finished 3rd as hosts in 1962. Next time out in 1966, they fell to a tie for 13th place.
  • The United States finished 23rd in 1990, then rose to a respectable 14th at home in 1994, but followed it with a disastrous 32nd (last) in France in 1998.
  • South Korea had finished 20th or worse (and were trending down) in all four World Cups preceding 2002, when they co-hosted and placed 4th. In 2006, it was back to mediocrity and a 17th-place result.
  • Even among traditional powers who had never finished outside the top 10 (prior to 2018 :)), the trend holds: West Germany (whose results are now grouped with Germany) finished 3rd in 1970, won the whole thing as hosts in 1974, and then dropped back to 6th in 1978.

Separating all the results by team leads to a scattered visualization - if we summarize the ranks at each time point, we can look at a grouped boxplot to see the typical final ranks again before hosting, while hosting, and directly after hosting.

ggplotly(ggplot(
  data = host_results_host_years,
  mapping = aes(
    x = time,
    y = rank)) +
  geom_boxplot(
  	fill = c('lightgray', 'red', 'lightgray')) +
	labs(
  	x = 'Time point',
    y = 'Rank') +
	theme_bw())

Really interesting, right? The median finishing rank in the tournament before hosting is 8.5, and after hosting it's 8. But as hosts, the median rank is 3, and a full quarter of hosts have won the championship.

So why would hosting be worth a full five ranking spots, on average?


3.
Curse of the Champion

There's a darker side to this - while everyone loves the story of a host rising to win it all in front of a home crowd, why has it become so common for defending champions (whether they won as hosts or not) to perform much worse in the tournament following?

3.1.
Visualization

Only two countries have won back-to-back World Cups (Italy in 1934 and 1938, and Brazil in 1958 and 1962).

If we visualize it like the line chart by team above, except restricting the sample to champions and their performance as defending champions, we can see that there's a pretty wide range of results:

# Restrict to champions
champions <- all_results[all_results$rank %in% 1, ]

# Look at performance as champions, then after championship
champions$champion <- paste(champions$year, champions$team)
champions$time <- 'Won WC'

champions_post <- champions[c('team', 'year', 'champion')]
      
champions_post$time <- 'Next WC'
champions_post$year <- champions_post$year + 4
      
champions_post <- merge(
  champions_post,
  all_results)

champions_post <- rbind(
  champions,
  champions_post)

champions_post <- champions_post[c('champion', 'year', 'team', 'time', 'rank')]

champions_post$time <- factor(
  champions_post$time,
  levels = c('Won WC', 'Next WC'))

sort_order <- order(champions_post$champion)

champions_post[sort_order, ]

ggplotly(ggplot(
  data = champions_post,
  mapping = aes(
    x = time,
    y = rank,
    group = champion,
    color = champion)) +
  geom_line() +
	labs(
  	x = 'Time point',
    y = 'Rank') +
	theme_bw())

It gets more interesting if we actually show them in time order, though:

# Changing x-variable from win/post-win to actual year
ggplotly(ggplot(
  data = champions_post,
  mapping = aes(
    x = year,
    y = rank,
    group = champion,
    color = champion)) +
  geom_line() +
	labs(
  	x = 'Time point',
    y = 'Rank') +
	theme_bw())

The three worst defense efforts of all World Cup champions have occurred in the past four tournaments (France finished 28th in 2002, Italy 26th in 2010, and Spain 23rd in 2014).

As I worked on this article, Germany lost in the group stage of the 2018 tournament (final ranking to be determined). So now the four worst title defenses have occurred in the past five tournaments.

So what's the issue for all the returning champions? There could be any number of explanations, and unfortunately many would require more detailed data than I have.

But one thought is that modern-day marketing of individual players might lead to teams feeling pressure to keep aging stars on the roster longer than they should, and playing too many minutes over younger, up-and-coming talent.

3.2.
Roster turnover

With that in mind, I pulled the roster for each country in every World Cup year, getting player name, position, and date of birth:

players_cleaned.rds
players <- readRDS(players_cleaned.rds)

Then I compared the roster for champions in the year they won and four years later, calculating the number of players dropped/kept from the championship roster, as well as the mean/median roster age from the championshion year and four years later:

roster_changes <- mapply(
  FUN = function(year, team, players) {
    
    if (team == 'Germany') team_match <- c('Germany', 'Germany FR')
    else team_match <- team
    
    players_champions <- players[players$team %in% team_match & players$year == year, ]
    players_post <- players[players$team %in% team_match & players$year == year + 4, ]
      
    mean_age_diff <- mean(players_post$age, na.rm = TRUE) -
                     mean(players_champions$age, na.rm = TRUE)
    
    median_age_diff <- median(players_post$age, na.rm = TRUE) -
                       median(players_champions$age, na.rm = TRUE)
        
    data.frame(
      year = year,
      team = team,
      players_dropped = length(setdiff(players_champions$name, players_post$name)),
      players_kept = length(intersect(players_champions$name, players_post$name)),
      mean_age_diff = mean_age_diff,
      median_age_diff = median_age_diff,
      stringsAsFactors = FALSE)
  },
  year = champions$year,
  team = champions$team,
  MoreArgs = list(players = players),
  SIMPLIFY = FALSE)

roster_changes <- do.call(rbind, roster_changes)

roster_changes$year <- roster_changes$year + 4

roster_changes <- merge(
  champions_post[as.character(champions_post$time) == 'Next WC', ],
  roster_changes)
plot_ly(
  data = roster_changes,
  x = ~players_dropped,
  y = ~rank, 
  text = ~champion,
  type = 'scatter', 
  mode = 'markers',
  marker = list(
    size = 10),
	hoverinfo = 'x+y+text') %>%
layout(title = '',
  xaxis = list(
    range = c(0, max(roster_changes$players_dropped)),
    title = 'Players dropped after championship'),
  yaxis = list(
    range = c(0, 32),
    title = 'Defending WC rank'))

There's almost a pattern here that aligns with my hypothesis - two of the worst defending finishes were by teams that dropped the fewest players (8) from their championship rosters, and there's a general negative correlation for the rest...except that 1958 Brazil only kept 8 players and won the tournament.

The correlation coefficient here overall, including Brazil, is -0.49:

cor(roster_changes$players_dropped, 
    roster_changes$rank, 
    use = 'pairwise.complete.obs')

There are some weaknesses to this approach with rosters. I only know which players were on the roster for each country in each year. I don't know which games each player participated in, or whether they played the majority of the game or entered as a substitute late in the game, which would be extremely valuable.

4.
Just for fun

To round this out, what are some notable cases of World Cup performance, good or bad?

For example, what teams have participated in the most World Cups without a final appearance, or a semifinal appearance?

4.1.
Worst of the best

# Calculate total appearances and best rank by team, merge together
team_appearances <- aggregate(
  x = setNames(all_results['rank'], 'appearances'),
  by = all_results['team'],
  FUN = function(x) sum(!is.na(x)))
    
team_best_rank <- aggregate(
	x = setNames(all_results['rank'], 'best_rank'),
  by = all_results['team'],
  FUN = min,
  na.rm = TRUE)

team_appearances_best_rank <- merge(
	team_appearances,
  team_best_rank,
  by = 'team')
plot_ly(
  data = team_appearances_best_rank,
  x = ~appearances,
  y = ~best_rank, 
  text = ~team,
  type = 'scatter', 
  mode = 'markers',
  marker = list(
    size = 10),
	hoverinfo = 'x+y+text') %>%
layout(title = 'Appearances and best ranks',
  xaxis = list(title = 'Appearances'),
  yaxis = list(title = 'Best rank'))

There's a clear pattern in this plot: for almost all countries, more appearances leads to a better all-time best rank, and no team with fewer than 12 appearances has won a championship.

But there's also an obvious outlier - only five teams have appeared in at least 14 World Cups, and all of them have won a championship...except Mexico, whose best finish is 6th place in the quarterfinal round, without a top-10 finish since 1986.

In fact, among all teams participating in 9 or more World Cups, Mexico is tied for the worst all-time best rank, even with 15 total opportunities to improve that statistic.


4.2.
Rank consistency

It's not as easy to look at worst rankings, since the worst possible ranking changed over time (14 to 16 to 24 to 32), making them non-comparable over time periods.

But despite that issue, if we look at teams that have at least 4 appearances, and sort them by the minimum overall range of ranking, we should get the most consistent teams in World Cup history:

team_appearances_4 <- team_appearances[team_appearances$appearances >= 4, ]

all_results_4 <- merge(team_appearances_4,
      								 all_results)

rank_min <- aggregate(
  x = setNames(all_results_4['rank'], 'rank_min'),
  by = all_results_4['team'],
  FUN = min,
  na.rm = TRUE)

rank_max <- aggregate(
  x = setNames(all_results_4['rank'], 'rank_max'),
  by = all_results_4['team'],
  FUN = max,
  na.rm = TRUE)

rank_min_max <- merge(
  rank_min,
  rank_max)

rank_min_max$rank_range <- rank_min_max$rank_max -
													 rank_min_max$rank_min

sort_order <- rev(order(rank_min_max$rank_range))

rank_min_max <- rank_min_max[sort_order, ]

ggplotly(ggplot(
	data = rank_min_max,
  mapping = aes(
    x = rank_max,
  	y = factor(team, levels = rank_min_max$team),
    xmin = rank_min,
    xmax = rank_max,
    group = team)) +
  geom_errorbarh() +
  labs(
    x = 'Rank range',
    y = 'Team') +
  theme_bw())

So if it's any consolation to Mexico (it's not), they have been the third-most consistent finisher in World Cup history - unfortunately, they've been consistently between 6th place and 16th place, while watching other countries finish both above and below them across multiple tournaments.

4.3.
World Cup resume clustering

Lastly, what if we order teams' set of ranks over time, and then run a simple clustering algorithm to find the most similar teams?

head(all_results)

# Missing values will break stats::kmeans, but they're an important element of clustering in this case, since frequent qualification distinguishes good teams from great.
# Replace non-qualification with a rank of 40 for now - any higher and it might influence the algorithm to cluster solely on number of appearances.
all_results$rank[is.na(all_results$rank)] <- 40

sort_order <- do.call(order, all_results[c('team', 'rank')])

all_results <- all_results[sort_order, ]

all_results$order <- ave(
	x = all_results$rank,
  all_results$team,
  FUN = seq_along)

all_results_wide <- reshape(
	data = all_results,
  v.names = 'rank',
  timevar = 'order',
  idvar = 'team',
  drop = c('year', 'round'),
	direction = 'wide')

I'll set a seed here so the clustering will be reproducible - otherwise stats::kmeans will choose a random set of rows as the starting cluster centers.

rank_vars <- grep('^rank', colnames(all_results_wide), value = TRUE)

set.seed(2018)

clustering <- stats::kmeans(all_results_wide[rank_vars],
                     centers = 5)

all_results_wide$cluster <- clustering$cluster

# Cluster sizes
table(all_results_wide$cluster)

cluster_long <- reshape(
	data = all_results_wide,
  varying = list(rank = rank_vars),
  v.names = 'rank',
  timevar = 'order',
  idvar = 'team',
  direction = 'long')

We can't print enough data here to give a good representation of the results, so let's draw a plot of the ordered ranks, colored by cluster:

ggplot(
	data = cluster_long[cluster_long$rank < 40, ], # don't plot non-qualifiers
  mapping = aes(
  	x = order,
    y = rank,
    group = cluster,
    color = factor(cluster),
    shape = factor(cluster))) +
geom_point(
	position = position_dodge(
  	width = 0.25)) +
labs(
	x = 'Order',
	y = 'Rank') +
theme_bw()

But to really separate them into groups, we can facet by cluster:

ggplot(
	data = cluster_long[cluster_long$rank < 40, ], # don't plot non-qualifiers
  mapping = aes(
  	x = order,
    y = rank,
    color = factor(cluster))) +
geom_point() +
facet_grid(cluster ~ .) +
labs(
  x = 'Order',
  y = 'Rank') +
guides(color = FALSE) +
theme_bw()

Interesting separation, right? Clearly these are groups of teams with different World Cup histories. If we step through each of the five clusters, looking at team names and also the facets in the plot, maybe we can come up with a short description for each one:

all_results_wide$team[all_results_wide$cluster == 1]

Cluster 1: All-time greats, nearly always qualifying, lots of top-5 and top-10 finishes.

all_results_wide$team[all_results_wide$cluster == 2]

Cluster 2: at most 3 appearances, but mostly 1, and mostly no top-10 finishes.

all_results_wide$team[all_results_wide$cluster == 3]

Cluster 3: Some great finishes, but no championships, and no more than 10 appearances.

all_results_wide$team[all_results_wide$cluster == 4]

Cluster 4: Roughly the same range of ranks, but in half the appearances of cluster 3.

all_results_wide$team[all_results_wide$cluster == 5]

Cluster 5: So close, lots of appearances and even some championships (England, France, Uruguary, Spain) but without the staying power to make it to cluster 1.

© 2018 Nextjournal GmbH