# Matching, Optimal Transport and Statistical Tests

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

To explain the “optimal transport” problem, we usually start with Gaspard Monge’s “Mémoire sur la théorie des déblais et des remblais“, where the the problem of transporting a given distribution of matter (a pile of sand for instance) into another (an excavation for instance). This problem is usually formulated using distributions, and we seek the “optimal” transport from one distribution to the other one. The formulation, in the context of distributions has been formulated in the 40’s by Leonid Kantorovich, e.g. from the distribution on the left to the distribution on the right.

Consider now the context of finite sets of points. We want to transport mass from points $\{A_1,\cdots,A_4\}$ to points $\{B_1,\cdots,B_4\}$. It is a complicated combinatorial problem. For 4 points, there are only 24 possible transfer to consider, but it exceeds 20 billions with 15 points (on each side). For instance, the following one is usually seen as inefficient

while the following is usually seen as much better

Of course, it depends on the cost of the transport, which depends on the distance between the origin and the destination. That cost is usually either linear or quadratic.

There are many application of optimal transport in economics, see eg Alfred’s book Optimal Transport Methods in Economics. And there are also applications in statistics, that what I’ve seen while I was discussing with Pierre while I was in Boston, in June. For instance if we want to test whether some sample were drawn from the same distribution,

set.seed(13)
npoints
mu1
mu2
Sigma1
Sigma2
Sigma2[2,1]
Sigma1
Sigma2
library(mnormt)
X1
X2
plot(X1[,1], X1[,2], ,col="blue")
points(X2[,1], X2[,2], col = "red")

Here we use a parametric model to generate our sample (as always), and we might think of a parametric test (testing whether mean and variance parameters of the two distributions are equal).

or we might prefer a nonparametric test. The idea Pierre mentioned was based on optimal transport. Consider some quadratic loss

ground_p
p
w1
w2
C
library(transport)
a

then it is possible to match points in the two samples

nonzero
from_indices
to_indices
for (i in from_indices){
segments(X1[from_indices[i],1], X1[from_indices[i],2], X2[to_indices[i], 1], X2[to_indices[i],2])
}

Here we can observe two things. The total cost can be seen as rather large

> cost=function(a,X1,X2){
nonzero
naa=a[nonzero,]
d=function(i) (X1[naa\$from[i],1]-X2[naa\$to[i],1])^2+(X1[naa\$from[i],2]-X2[naa\$to[i],2])^2
sum(Vectorize(d)(1:npoints))
}
> cost(a,X1,X2)
[1] 9.372472

and the angle of the transport direction is alway in the same direction (more or less)

> angle=function(a,X1,X2){
nonzero
naa=a[nonzero,]
d=function(i) (X1[naa\$from[i],2]-X2[naa\$to[i],2])/(X1[naa\$from[i],1]-X2[naa\$to[i],1])
atan(Vectorize(d)(1:npoints))
}
> mean(angle(a,X1,X2))
[1] -0.3266797

> library(plotrix)
> ag=(angle(a,X1,X2)/pi)*180
> ag[ag
> dag=hist(ag,breaks=seq(0,361,by=1)-.5)
> polar.plot(dag\$counts,seq(0,360,by=1),main=”Test Polar Plot”,lwd=3,line.col=4)

(actually, the following plot has been obtain by generating a thousand of sample of size 25)

In order to have a decent test, we need to see what happens under the null assumption (when drawing samples from the same distribution), see

Here is the optimal matching

Here is the distribution of the total cost, when drawing a thousand samples,

VC=rep(NA,1000)
VA=rep(NA,1000*npoints)
for(s in 1:1000){
X1a
X1b
ground_p
p
w1
w2
C
ab
VC[s]=cout(ab,X1a,X1b)
VA[s*npoints-(0:(npoints-1))]=angle(ab,X1a,X1b)
}
plot(density(VC)

So our cost of 9 obtained initially was not that high. Observe that when drawing from the same distribution, there is now no pattern in the optimal transport

ag=(VA/pi)*180
ag[ag
dag=hist(ag,breaks=seq(0,361,by=1)-.5)
polar.plot(dag\$counts,seq(0,360,by=1),main="Test Polar Plot",lwd=3,line.col=4)

Nice isn’t it? I guess I will spend some time next year working on those transport algorithm, since we have great R packages, and hundreds of applications in economics…