Wiki
Wiki
Courses and public working groups
Courses and public working groups
Private Working Groups
Private Working Groups
- New!!! Reading Group
- Theatre
- Admin
- Research
- Teaching
This is an old revision of the document!
$$ \newcommand{\arginf}{\mathrm{arginf}} \newcommand{\argmin}{\mathrm{argmin}} \newcommand{\argmax}{\mathrm{argmax}} \newcommand{\asconv}[1]{\stackrel{#1-a.s.}{\rightarrow}} \newcommand{\Aset}{\mathsf{A}} \newcommand{\b}[1]{{\mathbf{#1}}} \newcommand{\ball}[1]{\mathsf{B}(#1)} \newcommand{\bbQ}{{\mathbb Q}} \newcommand{\bproof}{\textbf{Proof :}\quad} \newcommand{\bmuf}[2]{b_{#1,#2}} \newcommand{\card}{\mathrm{card}} \newcommand{\chunk}[3]{{#1}_{#2:#3}} \newcommand{\condtrans}[3]{p_{#1}(#2|#3)} \newcommand{\convprob}[1]{\stackrel{#1-\text{prob}}{\rightarrow}} \newcommand{\Cov}{\mathbb{C}\mathrm{ov}} \newcommand{\cro}[1]{\langle #1 \rangle} \newcommand{\CPE}[2]{\PE\lr{#1| #2}} \renewcommand{\det}{\mathrm{det}} \newcommand{\dimlabel}{\mathsf{m}} \newcommand{\dimU}{\mathsf{q}} \newcommand{\dimX}{\mathsf{d}} \newcommand{\dimY}{\mathsf{p}} \newcommand{\dlim}{\Rightarrow} \newcommand{\e}[1]{{\left\lfloor #1 \right\rfloor}} \newcommand{\eproof}{\quad \Box} \newcommand{\eremark}{</WRAP>} \newcommand{\eqdef}{:=} \newcommand{\eqlaw}{\stackrel{\mathcal{L}}{=}} \newcommand{\eqsp}{\;} \newcommand{\Eset}{ {\mathsf E}} \newcommand{\esssup}{\mathrm{essup}} \newcommand{\fr}[1]{{\left\langle #1 \right\rangle}} \newcommand{\falph}{f} \renewcommand{\geq}{\geqslant} \newcommand{\hchi}{\hat \chi} \newcommand{\Hset}{\mathsf{H}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\img}{\text{Im}} \newcommand{\indi}[1]{\mathbf{1}_{#1}} \newcommand{\indiacc}[1]{\mathbf{1}_{\{#1\}}} \newcommand{\indin}[1]{\mathbf{1}\{#1\}} \newcommand{\itemm}{\quad \quad \blacktriangleright \;} \newcommand{\jointtrans}[3]{p_{#1}(#2,#3)} \newcommand{\ker}{\text{Ker}} \newcommand{\klbck}[2]{\mathrm{K}\lr{#1||#2}} \newcommand{\law}{\mathcal{L}} \newcommand{\labelinit}{\pi} \newcommand{\labelkernel}{Q} \renewcommand{\leq}{\leqslant} \newcommand{\lone}{\mathsf{L}_1} \newcommand{\lp}[1]{\mathsf{L}_{{#1}}} \newcommand{\lrav}[1]{\left|#1 \right|} \newcommand{\lr}[1]{\left(#1 \right)} \newcommand{\lrb}[1]{\left[#1 \right]} \newcommand{\lrc}[1]{\left\{#1 \right\}} \newcommand{\lrcb}[1]{\left\{#1 \right\}} \newcommand{\ltwo}[1]{\PE^{1/2}\lrb{\lrcb{#1}^2}} \newcommand{\Ltwo}{\mathrm{L}^2} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\mcbb}{\mathcal B} \newcommand{\mcD}{\mathcal{D}} \newcommand{\mcf}{\mathcal{F}} \newcommand{\mcl}{\mathcal{L}} \newcommand{\meas}[1]{\mathrm{M}_{#1}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\normmat}[1]{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert #1 \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}} \newcommand{\nset}{\mathbb N} \newcommand{\N}{\mathcal{N}} \newcommand{\one}{\mathsf{1}} \newcommand{\PE}{\mathbb E} \newcommand{\pminfty}{_{-\infty}^\infty} \newcommand{\PP}{\mathbb P} \newcommand{\projorth}[1]{\mathsf{P}^\perp_{#1}} \newcommand{\Psif}{\Psi_f} \newcommand{\pscal}[2]{\langle #1,#2\rangle} \newcommand{\pscal}[2]{\langle #1,#2\rangle} \newcommand{\psconv}{\stackrel{\PP-a.s.}{\rightarrow}} \newcommand{\qset}{\mathbb Q} \newcommand{\revcondtrans}[3]{q_{#1}(#2|#3)} \newcommand{\rmd}{\mathrm d} \newcommand{\rme}{\mathrm e} \newcommand{\rmi}{\mathrm i} \newcommand{\Rset}{\mathbb{R}} \newcommand{\rset}{\mathbb{R}} \newcommand{\rti}{\sigma} \newcommand{\section}[1]{==== #1 ====} \newcommand{\seq}[2]{\lrc{#1\eqsp: \eqsp #2}} \newcommand{\set}[2]{\lrc{#1\eqsp: \eqsp #2}} \newcommand{\sg}{\mathrm{sgn}} \newcommand{\supnorm}[1]{\left\|#1\right\|_{\infty}} \newcommand{\thv}{{\theta_\star}} \newcommand{\tmu}{ {\tilde{\mu}}} \newcommand{\Tset}{ {\mathsf{T}}} \newcommand{\Tsigma}{ {\mathcal{T}}} \newcommand{\ttheta}{{\tilde \theta}} \newcommand{\tv}[1]{\left\|#1\right\|_{\mathrm{TV}}} \newcommand{\unif}{\mathrm{Unif}} \newcommand{\weaklim}[1]{\stackrel{\mathcal{L}_{#1}}{\rightsquigarrow}} \newcommand{\Xset}{{\mathsf X}} \newcommand{\Xsigma}{\mathcal X} \newcommand{\Yset}{{\mathsf Y}} \newcommand{\Ysigma}{\mathcal Y} \newcommand{\Var}{\mathbb{V}\mathrm{ar}} \newcommand{\zset}{\mathbb{Z}} \newcommand{\Zset}{\mathsf{Z}} $$
Let $f$ a convex function to be minimized on a restricted set $\mcD$ that will be now defined. Let $(h_i)_{1 \leq i \leq n}$ be convex differentiable functions on $\Xset = \mathbb{R}^p$, representing inequality constraints. Let $(g_j)_{1 \leq j \leq m}$ be affine equality constraints: $$ g_j(x) = a_j^T x - b_j, \quad j=1,\dots,m. $$ Define the feasible set $$ \mcD = \bigcap_{i=1}^n \{h_i \leq 0\} \cap \bigcap_{j=1}^m \{g_j = 0\} \neq \emptyset. $$
Since each $h_i$ is convex and each $g_j$ is affine, the set $\mcD$ is convex. We aim at minimizing the convex function $f$ over the convex set $\mcD$.
For $(x,\lambda,\mu)\in \Xset\times (\mathbb{R}^+)^n \times \mathbb{R}^m$, we define the Lagrangian function $$ \mcl(x,\lambda,\mu) = f(x) + \sum_{i=1}^n \lambda_i h_i(x) + \sum_{j=1}^m \mu_j g_j(x), $$
For all $(x,\lambda,\mu) \in \Xset \times (\mathbb{R}^+)^n \times \mathbb{R}^m$, $$ \inf_{x \in \Xset} \mcl(x,\lambda,\mu) \leq \mcl(x,\lambda,\mu). $$
Taking the supremum over $\lambda \ge 0, \mu \in \mathbb{R}^m$ (where the notation $\lambda \geq 0$ means that all the components of $\lambda$ are non-negative) yields $$ \sup_{\lambda \geq 0,\ \mu} \inf_{x \in \Xset} \mcl(x,\lambda,\mu) \leq \sup_{\lambda \geq 0,\ \mu} \mcl(x,\lambda,\mu) = \infty \mathbf{1}_{x \notin \mcD} + f(x) \mathbf{1}_{x \in \mcD}. $$
Taking the infimum over $x \in \Xset$ leads to the weak duality relation : \begin{equation} \label{eq:weak} \sup_{\lambda \geq 0,\ \mu} \inf_{x \in \Xset} \mcl(x,\lambda,\mu) \leq \inf_{x \in \Xset} \sup_{\lambda \geq 0,\ \mu} \mcl(x,\lambda,\mu) = \inf_{x \in \mcD} f(x). \end{equation}
Observe that in the lhs (left-hand side), the infimum is taken over $x \in \Xset$, hence no constraint is imposed. In contrast, the rhs (right-hand side) corresponds to the constrained problem.
The rhs is called the primal problem, while the lhs is referred to as the dual problem. Since $x \mapsto \mcl(x,\lambda,\mu)$ is convex, the dual problem $\sup_{\lambda\geq 0,\mu}\inf_{x \in \Xset} \mcl(x,\lambda,\mu)$ is equivalent to $$ \sup \{\mcl(x,\lambda,\mu)\,:\lambda \geq 0, \mu \mbox{ and }\nabla_x \mcl(x,\lambda)=0\} $$
This formulation is often useful when solving the optimization problem.
To obtain equality in \eqref{eq:weak} (known as the strong duality relation), additional assumptions are required, such as the existence of a Slater point. Before discussing this, we introduce a useful intermediate result.
Lemma
Assume there exist $(x^*, \lambda^*, \mu^*) \in \mcD \times (\mathbb{R}^+)^n \times \mathbb{R}^m$ such that
Then $$ f(x^*) = \inf_{x \in \mcD} f(x) = \mcl(x^*, \lambda^*, \mu^*), $$ and strong duality holds.
Proof
Proof
Let $x \in \mcD$. By convexity of $f$, we have $$ f(x) - f(x^*) \geq \nabla f(x^*)^T (x - x^*). $$
Using the definition of the Lagrangian and the stationarity condition, we obtain $$ \nabla f(x^*) + \sum_{i=1}^n \lambda_i^* \nabla h_i(x^*) + \sum_{j=1}^m \mu_j^* \nabla g_j(x^*) = 0, $$ which implies \begin{equation} \label{eq:grad} f(x) - f(x^*) \geq \nabla f(x^*)^T (x - x^*) = - \sum_{i=1}^n \lambda_i^* \nabla h_i(x^*)^T (x - x^*) - \sum_{j=1}^m \mu_j^* \nabla g_j(x^*)^T (x - x^*). \end{equation}
Inequality constraints: $h_i$: by convexity, for any $x \in \mcD$, $$ - \nabla h_i(x^*)^T (x - x^*) \ge h_i(x^*) - h_i(x) \ge h_i(x^*), $$ hence $$ - \sum_{i=1}^n \lambda_i^* \nabla h_i(x^*)^T (x - x^*) \ge \sum_{i=1}^n \lambda_i^* h_i(x^*) = 0. $$
Equality constraints: $g_j$: since $g_j$ is affine and $x \in \mcD$, we have $g_j(x) = g_j(x^*) = 0$, hence $$ \nabla g_j(x^*)^T (x - x^*) = g_j(x) - g_j(x^*)= 0. $$
Combining these relations in \eqref{eq:grad}, we obtain that for any $x \in \mcD$, $$ f(x) - f(x^*) \ge \nabla f(x^*)^T (x - x^*) \ge 0, $$ which proves that $f(x^*) = \inf_{x \in \mcD} f(x)= \mcl(x^*, \lambda^*, \mu^*)$. Since $x \mapsto \mcl(x,\lambda^*,\mu^*)$ is convex, the KKT conditions show that $$ \mcl(x^*, \lambda^*, \mu^*) = \inf_{x\in \Xset} \mcl(x, \lambda^*, \mu^*) \leq \sup_{\lambda \geq 0,\mu} \inf_{x\in \Xset} \mcl(x, \lambda, \mu) $$ which is the converse inequality in \eqref{eq:weak} and the strong duality holds.
Definition
We say that $(x^*, \lambda^*, \mu^*) \in \Xset \times (\mathbb{R}^+)^n \times \mathbb{R}^m$ is a saddle point of the Lagrange function $\mcl$ if for every $(x, \lambda, \mu) \in \Xset \times (\mathbb{R}^+)^n \times \mathbb{R}^m$, $$ \mcl(x^*, \lambda, \mu) \leq \mcl(x, \lambda^*, \mu^*). $$
Proposition
If $(x^*, \lambda^*, \mu^*) \in \Xset \times (\mathbb{R}^+)^n \times \mathbb{R}^m$ is a saddle point for $\mcl$, then strong duality holds, and the KKT conditions are satisfied at $(x^*, \lambda^*, \mu^*)$.
Proof
Proof
By the saddle point property, $$ \sup_{\lambda \geq 0, \mu} \mcl(x^*, \lambda, \mu) \leq \inf_{x \in \Xset} \mcl(x, \lambda^*, \mu^*). $$ Hence, $$ \inf_{x \in \Xset} \sup_{\lambda \geq 0, \mu} \mcl(x, \lambda, \mu) \leq \sup_{\lambda \geq 0, \mu} \mcl(x^*, \lambda, \mu) \leq \inf_{x \in \Xset} \mcl(x, \lambda^*, \mu^*) \leq \sup_{\lambda \geq 0, \mu} \inf_{x \in \Xset} \mcl(x, \lambda, \mu). $$
This chain of inequalities implies strong duality (see \eqref{eq:weak} for the reverse inequality).
Moreover, the upper bound in the saddle point property, which holds for all $\lambda \geq 0, \mu$, implies that $\sup_{\lambda \geq 0, \mu} \mcl(x^*, \lambda, \mu) <\infty$ and hence $h_i(x^*) \leq 0$ and $g_j(x^*) = 0$.
Finally, choosing $\lambda = 0$ and $\mu = 0$, we obtain $$ f(x^*) = \mcl(x^*, 0, 0) \le \mcl(x, \lambda^*, \mu^*) \le f(x), \quad \forall x \in \mcD, $$ which shows that $x^*$ minimizes $f$ over $\mcD$ and that $\lambda_i^* h_i(x^*) = 0$ for all $i$ where the last identity follows from the above inequality with $x=x^*$.
We now assume the existence of a Slater point, that is, there exists $\tilde{x} \in \mcD$ such that for all $i \in \{1, \ldots, n\}$, $h_i(\tilde{x}) < 0$.
$$ \{x \in \mcD : f(x) < 0\} = \emptyset \iff \exists \lambda^* \ge 0,\ \mu^* \in \mathbb{R}^m \text{ s.t. } f(x) + \sum_{i=1}^n \lambda_i^* h_i(x) + \sum_{j=1}^m \mu_j^* g_j(x) \ge 0 \ \forall x \in \Xset. $$
Proof
Proof
Recall that $$ g_j(x) = a_j^T x - b_j, \quad j=1,\dots,m. $$ Without loss of generality, we assume that $(a_j)_{1\leq j \leq m}$ are \textbf{linearly independent}. We define the set $U$ as $$ U = \{ u = (u_0, u_1, \dots, u_n, u_{n+1}, \dots, u_{n+m}) \in \rset^{n+m+1}: \exists x \in \Xset, f(x) < u_0, h_i(x) \le u_i, g_j(x) = u_{n+j} \}. $$
First note that the condition $\{x \in \mcD : f(x) < 0\} = \emptyset$ is equivalent to $0 \notin U$. Since $U$ is convex, a separation argument shows that $0 \notin U$ if and only if there exists a nonzero vector $\phi \in \mathbb{R}^{n+m+1}$ such that $$ \phi^T u \ge 0, \quad \forall u \in U. $$
Take an arbitrary $i \in [0:n]$. If $u\in U$ then $u+t e_i \in U$ where $t>0$ and $e_i=(\indiacc{k=i})_{k \in [0:n+m]} \in \rset^{n+m+1}$. The previous inequality implies $\phi^T (u+t e_i)=\phi^T u + t\phi_i \geq 0$ for all $t>0$, therefore $\phi_i\geq 0$ for any $i \in [0:n]$. Now, since $\phi^T u\geq 0$ for all $u \in U$, a simple limiting argument yields for all $x\in \Xset$, $$ \phi_0 f(x)+\sum_{i=1}^n \phi_i h_i(x) +\sum_{j=n+1}^{n+m} \phi_j g_j(x)\geq 0 $$ To conclude, and since we already know that $\phi_i\geq 0$ for all $i \in [0:n]$, it only remains to prove that $\phi_0 \neq 0$ and to set in that case $\lambda^*_i=\frac{\phi_i}{\phi_0}$ for $i\in \lrcb{1,\ldots,n}$ and $\mu_j^*=\frac{\phi_{n+j}}{\phi_0}$ for $j \in \lrcb{1,\ldots,m}$. The rest of the argument is by contradiction. If $\phi_0=0$, then the previous inequality on a Slater point $x=\tilde x \in \mcD $ yields $\sum_{i=1}^n \phi_i h_i(\tilde x) \geq 0$ but since $h_i(\tilde x)<0$ for all $i \in [1:n]$, we finally get $\phi_i=0$ for all $i \in [1:n]$. Then the previous inequality becomes for any $x\in \Xset$, $$ \sum_{j=n+1}^{n+m} \phi_j g_j(x)= \lr{\sum_{j=n+1}^{n+m} \phi_j a_j}^T x + \sum_{j=n+1}^{n+m} \phi_j b_j\geq 0 $$ which in turn implies that $\sum_{j=n+1}^{n+m} \phi_j a_j=0$ and since $(a_j)_{n+1\leq j \leq n+m}$ are linearly independent, we deduce that $\phi_j=0$ for any $j\in [n+1:n+m]$. Finally all the components of $\phi$ are null and we are face to a contradiction. The proof is completed.
If $x^* \in \mcD$ minimizes $f$ and a Slater point exists, then there exist $\lambda^* \ge 0$ and $\mu^* \in \mathbb{R}^m$ such that: $(x^*,\lambda^*,\mu^*)$ is a saddle point and hence, KKT conditions hold:
$$ \nabla_x f(x^*) + \sum_{i=1}^n \lambda_i^* \nabla h_i(x^*) + \sum_{j=1}^m \mu_j^* \nabla g_j(x^*) = 0 $$
$$ h_i(x^*) \le 0,\ \lambda_i^* \ge 0,\ \lambda_i^* h_i(x^*) = 0 \quad \forall i $$
$$ g_j(x^*) = 0 \quad \forall j $$
and the strong duality is satisfied.
Proof
Proof
Assume that $f(x^*)=\inf_{x\in \mcD} f(x)$ for some $x^* \in \mcD$. Then $\{f-f(x^*)<0\} \cap \mcD = \emptyset$. By the Farkas lemma, there exist $\lambda^* \geq 0$ and $\mu^* \in \mathbb{R}^m$ such that for all $x\in \Xset$, $$ f(x)-f(x^*)+\sum_{i=1}^n \lambda^*_i h_i(x) + \sum_{j=1}^m \mu_j^* g_j(x)\geq 0. $$
This implies that $(x^*,\lambda^*,\mu^*)$ is a saddle point. Indeed, for all $\lambda \geq 0$, $\mu \in \rset^m$ and $x \in \Xset$, $$ f(x^*)+\sum_{i=1}^n \lambda_i h_i(x^*) + \sum_{j=1}^m \mu_j g_j(x^*)\leq f(x^*) \leq f(x)+\sum_{i=1}^n \lambda^*_i h_i(x) + \sum_{j=1}^m \mu_j^* g_j(x) $$
Therefore, $(x^*,\lambda^*,\mu^*)$ is a saddle point, which implies strong duality (as shown previously). This concludes the proof.
Let $X_1,\ldots, X_n$ be $n$ points in $\mathbb{R}$ ordered in non-decreasing order. Let $Y_1,\ldots, Y_n$ be $n$ other points in $\mathbb{R}$, also ordered in non-decreasing order.
$$ \mathcal{W}_p \left(\frac{1}{n}\sum_{i=1}^n \delta_{X_i}, \frac{1}{n}\sum_{i=1}^n \delta_{Y_i}\right) = \left( \frac{1}{n} \sum_{i=1}^n |X_i - Y_i|^p \right)^{1/p}. $$
Proof
Proof
In order to prove the proposition, we will show that $$ \mathrm{argmin}\left\{ \sum_{i,j=1}^n p_{ij} |X_i- Y_j|^p \,: \forall i,j,\ p_{ij} \geq 0, \forall i, \sum_{j} p_{ij}=\frac{1}{N} , \forall j, \sum_{i} p_{ij}=\frac{1}{N} \right\}= \left[ \frac{1}{N} \mathsf{1}(i=j) \right]_{1\leq i,j \leq n}. $$
The function $p \mapsto \sum_{i,j} p_{ij}|X_i-Y_j|^p$ is linear, hence convex. Moreover, the constraints are affine: for all $i\in [1:n]$, $\sum_{j=1}^n p_{ij}=\frac{1}{N}$ and for all $j\in [1:n]$, $\sum_{i=1}^n p_{ij}=\frac{1}{N}$, together with inequality constraints $\forall i,j \in [1:n],\ -p_{ij} \leq 0$.
Therefore, we can apply the KKT theorem. We seek $p\in \rset^{n\times n}$, $\alpha,\beta \in \rset^n$, and $\gamma \in \rset^{n \times n}$ such that, defining $$ \mathcal{L}(p,\alpha,\beta,\gamma)=\sum_{i,j=1}^n p_{ij}|X_i-Y_j|^p - \gamma_{ij} p_{ij}+\sum_{i=1}^n \alpha_i \lr{\sum_{j=1}^n p_{ij}- \frac 1 N} +\sum_{j=1}^n \beta_j \lr{\sum_{i=1}^n p_{ij}- \frac 1 N}, $$ we have $\nabla_p \mathcal{L}(p,\alpha,\beta,\gamma)=0$, together with the conditions $\gamma_{ij} p_{ij}=0$ for all $i,j \in [1:n]$ and $\sum_{j=1}^np_{ij}=\sum_{i=1}^n p_{ij}=1/N$.
We already know that $p_{ij}=\frac 1 N \indi{i=j}$ is a solution. It therefore remains to find $\alpha,\beta,\gamma$ such that the KKT conditions are satisfied for this choice of $p$. These conditions can be written as:
The complementarity condition reduces to $\gamma_{ii}=0$, since $p_{ij}=\frac 1 N \indi{i=j}$. Hence, the KKT conditions are equivalent to: for all $i,j \in [1:n]$, $\gamma_{ij}=\alpha_i+\beta_j + |X_i-Y_j|^p \geq 0$ and $\gamma_{ii}=\alpha_i+\beta_i + |X_i-Y_i|^p=0$.
This is in turn equivalent to the existence of $\beta \in \rset^n$ such that
Set $\beta_1=0$ and, for all $i\in [1:n-1]$, define $\beta_j=\sum_{\ell=1}^{j-1} \lrcb{|X_\ell-Y_\ell|^p - |X_\ell - Y_{\ell+1}|^p}$. Then, $$ \beta_j - \beta_i=\sum_{\ell=i}^{j-1} \lrcb{|X_\ell-Y_\ell|^p - |X_\ell - Y_{\ell+1}|^p}. $$
Since $u \mapsto |u|^p$ is convex, we have $|a+c|^p-|a|^p \geq |b+c|^p-|b|^p$ for all $a \geq b$ and $c\geq 0$. We set $a=X_\ell -Y_{\ell+1}$, $b=X_i-Y_{\ell+1}$, and $c=Y_{\ell+1}-Y_\ell$. For $\ell \geq i$, we have $a \geq b$ and $c\geq 0$. Hence, for any $\ell \geq i$, the inequality can be rewritten as $$ |X_\ell-Y_\ell|^p - |X_\ell - Y_{\ell+1}|^p \geq |X_i-Y_\ell|^p - |X_i - Y_{\ell+1}|^p. $$
Finally, plugging into the previous identity yields $$ \beta_j-\beta_i \geq \sum_{\ell=i}^{j-1} |X_i-Y_\ell|^p - |X_i - Y_{\ell+1}|^p = |X_i-Y_i|^p - |X_i - Y_j|^p. $$
This proves the KKT conditions, and the proof is complete.
The proposition shows that, for any $p \geq 1$, $$ \mathcal{W}_p(\mu,\nu) = \left( \int_0^1 \left|F_\mu^{-1}(u)- F_\nu^{-1}(u)\right|^p \, \rmd u \right)^{1/p}, $$
where $$ \mu=\frac{1}{n} \sum_{i=1}^n \delta_{X_i}, \qquad \nu=\frac{1}{n} \sum_{i=1}^n \delta_{Y_i}. $$
We only assume that the sequences $(X_i)$ and $(Y_i)$ are ordered in non-decreasing order; in particular, repetitions are allowed.
We then extend this result to arbitrary discrete probability measures. Let $$ \mu=\sum_{i=1}^n \lambda_i \delta_{X_i}, \qquad \nu=\sum_{j=1}^m \gamma_j \delta_{Y_j}, $$
where the coefficients $(\lambda_i)$ and $(\gamma_j)$ are non-negative rational numbers summing to $1$. By the previous result, we still have $$ \mathcal{W}_p(\mu,\nu) = \left( \int_0^1 \left|F_\mu^{-1}(u)- F_\nu^{-1}(u)\right|^p \, \rmd u \right)^{1/p}. $$
By a density argument, this identity extends to coefficients $(\lambda_i)$ and $(\gamma_j)$ with non-negative real values summing to $1$. Finally, by approximation, we obtain that for any probability measures $\mu$ and $\nu$ on $(\rset,\mathcal{B}(\rset))$, $$ \mathcal{W}_p(\mu,\nu) = \left( \int_0^1 \left|F_\mu^{-1}(u)- F_\nu^{-1}(u)\right|^p \, \rmd u \right)^{1/p}. $$