As a late followup to all the press, summarizing results of 2014 Winter Olympics, I decided to apply data envelopment analysis to find most efficient teams on 2014 Winter Olympics. I’ll use Benchmarking package to estimate efficiencies and ggplot2 + ggthemes to visualize it.

Data envelopment analysis is a methodology to distinguish between efficient and inefficient units, which could be anything from industries to firms to branches to hedge funds to algotrading robots to platoons to cockroaches to galaxies etc etc. Initially born within dull and stern macro economic realms, DEA gradually evolved into KPI-like methodology applied to approximately anything, usually studied and researched under operation research umbrella. Anything which could be described as a black box, transforming inputs into outputs, could one day appear a subject to DEA in some peer-reviewed paper. Currently this is one of the most cited and published topic in business and economics field. Everyone likes that it requires minimal assumptions, is able to cope with different units of measure, and is non-parametric. No free lunches here: as a price for universality you have to collect ALL units before DEA, otherwise you risk to significantly overestimate efficiency of the ones you’ve managed to collect.

Putting math magic aside (it’s described perfectly lapidary in Pessanha, Marinho, Laurencel, Amaral presentation Implementing DEA models in the R program and their respective paper), let’s proceed to pictures. But first lets calculate DEA scores. DEA score ranges from 1 to 0, where 1 is for efficient decision making units (DMUs), and all other integers – for all various levels of inefficiency. Paraphrasing Tolstoy (see Prokudin-Gorskiy photo below), efficient DMUs are all alike; every inefficient is inefficient in its own way.

In R, you could use Benchmarking package to do all DEA things (or write your own function, like Pessanha et al.) Syntax is simple: you supply dea() function with inputs dataframe, outputs dataframe, and several options (we’ll discuss one of them – return to scale – in greater details later). So here comes my data preparation routine:

library(Benchmarking) library(mosaic)

olympics <- read.csv("dea/olympics.csv") olympics <- olympics[1:26,] olympics[,2:ncol(olympics)] <- olympics[,2:ncol(olympics)] + 0.000000000000000001 rownames(olympics) <- as.character(olympics$Country) olympics <- olympics[,2:ncol(olympics)] olympics <- as.data.frame(olympics)

# let's create inputs/outputs: inputs <- subset(olympics, select = Athletes) outputs <- subset(olympics, select = TotMedals) DMUs <-rownames(olympics) dea.dtf <- data.frame(inputs, outputs, DMUs)

To replicate it you have to download my dataset first. Now everything is ready for the first launch. For a starter, we’ll have very simple DEA model. All DMUs – teams – are black boxes, transforming athletes to medals.

dea <- dea(inputs, outputs, RTS='vrs', ORIENTATION = 1)

Although I called dea() with RTS (return to scale) parameter, which equals ‘vrs’, i.e. variable return to scale, I now have in *dea* variable enough data to plot both VRS and CRS DEA frontier (and IRS, and DRS, whatever it may be).

So what is data envelopment analysis in details? How it determines who is efficient, who is not? Suppose we’ve plotted all our DMUs on a 2-D plane. Xs, inputs, are athletes, Ys, or outputs, are medals. What we’ll have is as follows:

colnames(dea.dtf) ## create ggplot with no layers: dea.plot <- ggplot() + ## set X and Y axis limits 20% more than the max Y and X available coord_cartesian(xlim=c(0, round(max(dea.dtf[,1])*1.2, digits=-1)), ylim=c(0, round(max(dea.dtf[,2])*1.2, digits=-1))-1 ## no layers yet - plot produces errors: plot(dea.plot) ## add layers: dea.plot <- dea.plot + geom_point(data=dea.dtf, mapping = aes(x=Athletes, y=TotMedals), shape=1) ## now - simple plot: plot(dea.plot)

That’s it. We have our points. Lets add some labels so that it does not plot over each other. I disliked position_jitter(), as it allows no control over points where I do not need jittering at all. It makes my plot messy and untidy. Definitely not The Economist style. So what I did was as follows:

## add labels with controlled offset: ## calculate offsets for [nearly] overlapping points dea.dtf.t <- dea.dtf dea.dtf.t[,1:2] <- apply(dea.dtf[,1:2],2, round, -2) +1 dea.dtf$offset <- ddply(dea.dtf.t, .(Athletes, TotMedals), transform, offset = (1:length(TotMedals)-1)/26)$offset dea.dtf$offset <- ddply(dea.dtf.t, .(Athletes, TotMedals), transform, offset = (1:length(Athletes)-1)/26)$offset + dea.dtf$offset rm(dea.dtf.t) # to keep environment tidy dea.plot <- dea.plot + geom_text(data = dea.dtf, mapping = aes(x=Athletes, y=TotMedals+offset, label=DMUs), hjust=-0.3, vjust=1, , label = DMUs, alpha=0.6, # adjust transparency size=3.8) # adjust text size) plot(dea.plot)

Finally, let’s add some ‘Economist’ *ambre*:

# add "The Economist" style: dea.plot <- dea.plot + theme_economist() + scale_colour_economist() + ggtitle("Simple DEA model of Olympics'14 team efficiency") plot(dea.plot)

## add references grid.text("Source: Wikipedia, IOC, author's calculations", x = unit(.02, "npc"), y = unit(.01, "npc"), just = c("left", "bottom"), gp = gpar(fontface = "italic", fontsize = 12, col = "black"))

We already could see some potential leaders: Croatia, Belarus, Netherlands, Norway, and – yes – Russia – are among them. Those are the teams that earned most medals among teams of their size. The rest formed oblong peloton headed by US and pointing north-east. Outsiders – Finland, Kazakhstan, and Switzerland. Now who is leader? This is how CRS – constant return to scale – DEA model answers the question:

#add frontiers Y <- sort(outputs[,1][dea$eff == 1]) X <- sort(inputs[,1][dea$eff == 1]) # add frontier under CRS: dea.plot <- dea.plot + geom_abline(intercept = 0, slope = max(Y/X), linetype=2, col='black') plot(dea.plot)

Under CRS we simply draw a line, connecting zero and leftmost-topmost point in the plot. The closer DMU to the line – the more efficient it is. You can check it yourself by running dea() with RTS=’crs’ and then studying $eff property of the object returned by dea(). So, say, if Belarus team wants to become efficient it should either fire half of the team (and move to the left on the plot), or earn another 10 medals (and move higher). Or it could do both; say, firing only quarter of the team would make others work more intensively, and they earn 5 medals, which, combined with more compact team size, would put Belarus on efficiency frontier (now imagine the challenge of Norway team under CRS model).

But why Netherlands is the only efficient DMU? What makes it so perfect? Can we guess it is perfect without DEA? Sure, we can. With the help of the wisdom of one of the best Russian writers, to which we already had resorted today. As Leo Tolstoy (pictured above) said on the other occasion, “A man is like a fraction whose numerator is what he is and whose denominator is what he thinks of himself. The larger the denominator, the smaller the fraction.” Seems like Lev Nikolaevich somehow knew about DEA exactly 100 years before it was in fact introduced by Charnes, Coopers, et al. in 1978 (“Anna Karenina”, which lent me a quote in the beginning of the post, was published in 1878). If we would simply divide outputs to inputs, and take DMU which would maximize the ratio, say

rownames(inputs)[outputs/inputs == max(outputs/inputs)]

we’ll have the same result. Netherlands team is simply returning the maximum medals per athlet. Is it fair? Not really. First, objection for smaller teams: there is increasing return to scale, as big teams attract more sponsor’s and government’s attention, there is economies of scale (say, more medics with diverse specialties), and, finally, more in-team support, not mentioning statistical issues (two athletes in one sport = double chances). Second, objection for big teams: return to scale is decreasing, as big teams are less manageable; have more chances to mix the best athletes in nation with athletes of a lower quality; have to compete in sports like hockey, when 12+ players earn only one medal (and small teams could be in sports like short track, where one athlete could bring 12+ medals). So we’ll use both considerations in variable-return-to-scale framework, mixing increasing (unfavorable to big teams) and decreasing (favorable to big teams) approach:

#add frontier under VRS: dea.plot <- dea.plot + geom_segment( aes(x=X[1], y=0, xend=X[1], yend=Y[1]), linetype=2, col='black') dea.plot <- dea.plot + geom_segment( aes(x=X[1], y=Y[1], xend=X[2], yend=Y[2]), linetype=2, col='black') dea.plot <- dea.plot + geom_segment( aes(x=X[2], y=Y[2], xend=X[3], yend=Y[3]), linetype=2, col='black') dea.plot <- dea.plot + geom_segment( aes(x=X[3], y=Y[3], xend=Inf, yend=Y[3]), linetype=2, col='black') plot(dea.plot)

Finally, Croatia and Russia teams join the club of best teams. Belarus team become slightly better (due ti IRS – a part of VRS), and big Norway team significantly improved its position (due to DRS – the other part of VRS).

As you see, my dataset contains much more columns than just Athletes and Total Medals, but I’m not going to make War and Peace out of this blog post; I’ll save it for future.