Jekyll2020-10-16T16:50:04+00:00https://ucsdml.github.io//feed.xmlUCSD Machine Learning GroupResearch updates from the UCSD community, with a focus on machine learning, data science, and applied algorithms.Explainable 2-means Clustering: Five Lines Proof2020-10-16T18:00:00+00:002020-10-16T18:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/10/16/explain_2_means<p><strong>TL;DR:</strong> we will show <em>why</em> only one feature is enough to define a good $2$-means clustering. And we will do it using only 5 inequalities (!)
In a <a href="explain_k_means.html">previous post</a>, we explained what is an explainable clustering.</p>
<h3 id="explainable-clustering">Explainable clustering</h3>
<p>In a <a href="explain_k_means.html">previous post</a>, we discussed why explainability is important, defined it as a small decision tree, and suggested an algorithm to find such a clustering. But why the resulting clustering is any good?? We measure “good” by <a href="https://en.wikipedia.org/wiki/K-means_clustering">$k$-means cost</a>. The cost of a clustering $C$ is defined as the sum of squared Euclidean distances of each point $x$ to its center $c(x)$. Formally,
\begin{equation}
cost(C)=\sum_x \|x-c(x)\|^2,
\end{equation} the sum is over all points $x$ in the dataset.</p>
<p>In this post, we focus on the $2$-means problem, where there are only two clusters. We want to show that for every dataset there is <strong>one</strong> feature $i$ and <strong>one</strong> threshold $\theta$ such that the following simple clustering $C^{i,\theta}=(C^{i,\theta}_1,C^{i,\theta}_2)$ has a low cost:
\begin{equation}
\text{if } x_i\leq\theta \text{ then } x\in C^{i,\theta}_1 \text{ else } x\in C^{i,\theta}_2.
\end{equation}
We call such a clustering a <em>threshold cut</em>. There might be many threshold cuts that are good, bad, or somewhere in between. We want to show that there is at least one that is good (i.e., low cost). In the <a href="https://arxiv.org/abs/2002.12538">paper,</a> we prove that there is always a threshold cut, $C^{i,\theta}$, that is almost as good as the optimal clustering:
\begin{equation}
cost(C^{i,\theta})\leq4\cdot cost(opt),
\end{equation}
where $cost(opt)$ is the cost of the optimal 2-means clustering. This means that there is a simple explainable clustering $C^{i,\theta}$ that is only $4$ times worse than the optimal one. It’s independent of the dimension and the number of points. Sounds crazy, right? Let’s see how we can prove it!</p>
<h3 id="the-minimal-mistakes-threshold-cut">The minimal-mistakes threshold cut</h3>
<p>We want to compare two clusterings: the optimal clustering and the best threshold cut. The best threshold cut is hard to analyze, so we introduce an intermediate clustering: <em>the minimal-mistakes threshold cut</em>, $\widehat{C}$. Even though this clustering will not be the best threshold cut, it will be good enough. In the paper we prove that $cost(\widehat{C})$ is at most $4cost(opt)$. For simplicity, in this post, we will show a slightly worse bound of $11cost(opt)$ instead of $4cost(opt)$.</p>
<!--Let's define what the minimal-mistakes cut is. -->
<p>We define the number of mistakes of a threshold cut $C^{i,\theta}$ as the number of points $x$ that are not in the same cluster as their optimal center $c(x)$ in $C^{i,\theta}$, i.e., number of points $x$ such that<br />
\begin{equation}
sign(\theta-x_i) \neq sign(\theta-c(x)_i).
\end{equation}
The <em>minimal-mistakes clustering</em> is the threshold cut that has the minimal number of mistakes. Take a look at the next figure for an example.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/mistakes_example.png" width="30%" style="margin: 0 auto" />
<figcaption>
Two optimal clusters are in red and blue. Centers are the stars. Split (in yellow) with one mistake. This is a minimal-mistakes threshold cut, as any threshold cut has at least $1$ mistake.
</figcaption>
</figure>
<h3 id="playing-with-cost-warm-up">Playing with cost: warm-up</h3>
<p>Before we present the proof, let’s familiarize ourselves with the $k$-means cost and explore several of its properties. It will be helpful later on!</p>
<h4 id="changing-centers">Changing centers</h4>
<p>If we change the centers of a clustering from their means (which are their optimal centers) to different centers $c=(c_1, c_2)$, then the cost can only increase. Putting this into math, denote by $cost(C,c)$ the cost of clustering $C=(C_1,C_2)$ when $c_1$ is the center of cluster $C_1$ and $c_2$ is the center of cluster $C_2$, then</p>
<p>\begin{align}
cost(C) &= \sum_{x\in C_1} \|x-mean(C_1)\|^2 + \sum_{x\in C_2} \|x-mean(C_2)\|^2 \newline &\leq \sum_{x\in C_1} \|x-c_1\|^2 + \sum_{x\in C_2} \|x-c_2\|^2 = cost(C,c).
\end{align}
What if we further want to change the centers from some arbitrary centers $(c_1, c_2)$ to other arbitrary centers $(m_1, m_2)$? How does the cost change? Can we bound it? To our rescue comes the (almost) triangle inequality that states that for any two vectors $x,y$:
\begin{equation}
\|x+y\|^2 \leq 2\|x\|^2+2\|y\|^2.
\end{equation}
This implies that the cost of changing the centers from $c=(c_1, c_2)$ to $m=(m_1, m_2)$ is bounded by
\begin{equation}
cost(C,c)\leq 2cost(C,m)+2|C_1|\|c_1-m_1\|^2+2|C_2|\|c_2-m_2\|^2.
\end{equation}</p>
<h4 id="decomposing-the-cost">Decomposing the cost</h4>
<p>The cost can be easily decomposed with respect to the data points and the features. Let’s start with the data points. For any partition of the points in $C$ to $S_1$ and $S_2$, the cost can be rewritten as
\begin{equation}
cost(C,c)=cost(C \cap S_1,c)+cost(C \cap S_2,c).
\end{equation}
The cost can also be decomposed with respect to the features, because we are using the squared Euclidean distance. To be more specific, the cost incur by the $i$-th feature is $cost_i(C,c)=\sum_{x}(x_i-c(x)_i)^2,$ and the total cost is equal to
\begin{equation}
cost(C,c)=\sum_i cost_i(C,c).
\end{equation}
If the last equation is unclear just recall the definition of the cost ($c(x$) is the center of a point $x$):
\begin{equation}
cost(C,c)=\sum_{x}\|x-c(x)\|^2=\sum_i\sum_{x}(x_i-c(x)_i)^2=\sum_icost_i(C,c).
\end{equation}</p>
<h3 id="the-5-line-proof">The 5-line proof</h3>
<p>Now we are ready to prove that $\widehat{C}$ is only a constant factor worse than the optimal $2$-means clustering:
\begin{equation}
cost(\widehat{C})\leq 11\cdot cost(opt).
\end{equation}</p>
<p>To prove that the minimal-mistakes threshold cut $\widehat{C}$ gives a low-cost clustering, we will do something that might look strange at first. We analyze the quality of this clustering $\widehat{C}$ with the optimal centers of the optimal clustering. And not the optimal centers for $\widehat{C}$. This step will only increase the cost, so why are we doing it — because it will ease our analysis, and if there are not many mistakes, then the centers do not change much, like in the previous figure. So it’s not much of an increase. So, here comes the first step — change the centers of $\widehat{C}$ to the optimal centers $c^*=(mean(C^*_1),mean(C^*_2))$. Recall from the warm-up that this can only increase the cost:
\begin{equation}
cost(\widehat{C})\leq cost(\widehat{C},c^{*}) \quad (1)
\end{equation}
Next we use one of the decomposition properties of the cost. We partition the dataset into the set of points that are correctly labeled, $X^{cor}$, and those that are not, $X^{wro}$.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/mistakes_example_wrong.png" width="30%" style="margin: 0 auto" />
<figcaption>
The same dataset and split as before. Point with a grey circle is in the wrong cluster and is the only member in $X^{wro}$. All other points have the same cluster assignment as the optimal clustering and are in $X^{cor}$.
</figcaption>
</figure>
<p>Thus, we can rewrite the last term as
\begin{equation}
cost(\widehat{C},c^{*})=cost(\widehat{C}\cap X^{cor},c^{*})+cost(\widehat{C}\cap X^{wro},c^{*}) \quad (2)
\end{equation}</p>
<p>Let’s look at this sum. The first term contains all the points that have their correct center in $c^*$ (which is either $mean(C^*_1)$ or $mean(C^*_2)$). Hence, the first term in (2) is easy to bound: it’s at most $cost(opt)$. So from now on, we focus on the second term.</p>
<p>In the second term, all points are in $X^{wro}$, which means they were assigned to the incorrect optimal center. So let’s change the centers once more, so that $X^{wro}$ will have the correct centers. The correct centers of $X^{wro}$ are the same centers $c^*$, but the order is reversed, i.e., all points assigned to center $mean(C^*_1)$ are now assigned to $mean(C^*_2)$ and vice versa. Using the “changing centers” property of the cost we discussed earlier, we have <!--, the second term in (2) is at most--></p>
<p>\begin{equation}
cost(\widehat{C},c^{*}) \leq 3cost(opt)+2|X^{wro}|\cdot\|c^{*}_1-c^{*}_2\|^2 \quad (3)
\end{equation}</p>
<p>Now we’ve reached the main step in the proof. We show that the second term in (3) is bounded by $8cost(opt)$. We first decompose $cost(opt)$ using the features. Then, all we need to show is that:</p>
<p>\begin{equation}
cost_i(opt)\geq\left(\frac{|c^{*}_{1,i}-c^{*}_{2,i}|}{2}\right)^2|X^{wro}| \quad (4)
\end{equation}</p>
<p>The trick is, for each feature, to focus on the threshold cut defined by the middle point between the two optimal centers. Since $\widehat{C}$ is the minimal-mistakes clustering we know that in every threshold cut there are at least $|X^{wro}|$ mistakes. Each mistake contributes at least half the distance between the two centers.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/IMM_blog_pic_4.png" width="30%" style="margin: 0 auto" />
<figcaption>
Proving step $4.$ Projecting to feature $i$. Points in blue belong to the first cluster, and in red to the second. We focus on the cut that is the mid-point between the two optimal centers.
</figcaption>
</figure>
<p>This figure shows how to prove step (4). We see that there is $1$ mistake, which is the minimum possible. This means that even the optimal clustering must pay for at least half the distance between the centers for each of these mistakes. This gives us a lower bound on $cost_i(opt)$ in this feature. Then we can sum over all the features to see that the second term of (3) is at most $8cost(opt)$, which is what we wanted. <!--Since the whole expression in (3) is at most $10cost(opt)$, and we lose another $cost(opt)$ from the first term of (2), we can put these together to get-->
<!--Summing everything together we achieve our goal:-->
Putting everything together, we get exactly what we wanted to prove in this post:
\begin{equation}
cost(\widehat{C})\leq1 1\cdot cost(opt) \quad (5)
\end{equation}
<!--That's it!--></p>
<h3 id="epilogue-improvements">Epilogue: improvements</h3>
<p>The bound that we got, $11$, is not the best possible. With more tricks we can get a bound of $4$. One of them is using Hall’s theorem. Similar ideas provide a $2$-approximation to the optimal $2$-medians clustering as well.
To complement our upper bounds, we also prove lower bounds showing that any threshold cut must incur almost $3$-approximation for $2$-means and almost $2$-approximation for $2$-medians. You can read all about it in our <a href="https://proceedings.icml.cc/paper/2020/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf">paper</a>.</p><a href='https://sites.google.com/view/michal-moshkovitz'>Michal Moshkovitz</a>, <a href='mailto:navefrost@mail.tau.ac.il'>Nave Frost</a>, <a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a>In a previous post, we discussed tree-based clustering and how to develop explainable clustering algorithms with provable guarantees. Now we will show why only one feature is enough to define a good 2-means clustering. And we will do it using only 5 inequalities (!)Explainable k-means Clustering2020-10-16T17:00:00+00:002020-10-16T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/10/16/explain_k_means<p><strong>TL;DR:</strong>
Explainable AI has gained a lot of interest in the last few years, but effective methods for unsupervised learning are scarce. And the rare methods that do exist do not have provable guarantees. We present a new algorithm for explainable clustering that is provably good for $k$-means clustering — the Iterative Mistake Minimization (IMM) algorithm. Specifically, we want to build a clustering defined by a small decision tree. Overall, this post summarizes our new paper: <a href="https://arxiv.org/pdf/2002.12538.pdf">Explainable $k$-Means and $k$-Medians clustering</a>.</p>
<h3 id="explainability-why">Explainability: why?</h3>
<p>Machine learning models are mostly “black box”. They give good results, but their reasoning is unclear. These days, machine learning is entering fields like healthcare (e.g., for a better understanding of <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6543980/#:~:text=In%20the%20medical%20field%2C%20clustering,in%20labeled%20and%20unlabeled%20datasets.&text=The%20aim%20is%20to%20provide,AD%20based%20on%20their%20similarity.">Alzheimer’s Disease</a> and <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118453#sec013">Breast Cancer</a>), transportation, or law. In these fields, quality is not the only objective. No matter how well a computer is making its predictions, we can’t even imagine blindly following computer’s suggestion. Can you imagine blindly medicating or performing a surgery on a patient just because a computer said so? Instead, it would be much better to provide insight into what parts of the data the algorithm used to make its prediction.</p>
<h3 id="tree-based-explainable-clustering">Tree-based explainable clustering</h3>
<!--Despite the popularity of explainability, there is limited work in unsupervised learning. To remedy it, -->
<p>We study a prominent problem in unsupervised learning, $k$-means clustering. We are given a dataset, and the goal is to partition it to $k$ clusters such that the <a href="https://en.wikipedia.org/wiki/K-means_clustering">$k$-means cost</a> is minimal. The cost of a clustering $C=(C^1,\ldots,C^k)$ is the sum of all points from their optimal centers, $mean(C^i)$:</p>
<p>\[cost(C)=\sum_{i=1}^k\sum_{x\in C^i} \lVert x-mean(C^i)\rVert ^2.\]</p>
<p>For any cluster, $C^i$, one possible explanation of this cluster is $mean(C^i)$. In a low-cost clustering, the center is close to its points, and they are close to each other. For example, see the next figure.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_1.png" width="40%" style="margin: 0 auto" />
<figcaption>
Near optimal 5-means clustering
</figcaption>
</figure>
<p>Unfortunately, this explanation is not as useful as it could be. The centers themselves may depend on all the data points and all the features in a complicated way. We instead aim to develop a clustering method that is explainable by design. To explain why a point is in a cluster, we will only need to look at small number of features, and we will just evaluate a threshold for each feature one by one. This allows us to extract information about which features cause a point to go to one cluster compared to another. This method also means that we can derive an explanation that does not depend on the centers.</p>
<p>More formally, at each step we test if $x_i\leq \theta$ or not, for some feature $i$ and threshold $\theta$. We call this test a <strong>split</strong>. According to the test’s result, we decide on the next step. In the end, the algorithm returns the cluster identity. This procedure is exactly a decision tree where the leaves correspond to clusters.</p>
<p>Importantly, for the tree to be explainable it should be <strong>small</strong>. The smallest decision tree has $k$ leaves since each cluster must appear in at least one leaf. We call a clustering defined by a decision tree with $k$ leaves a <strong>tree-based explainable clustering</strong>. See the next tree for an illustration.</p>
<p align="center">
<tr>
<td> <img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_2.png" width="40%" style="margin: 0 auto" /> </td>
<td> <img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_3.png" width="40%" style="margin: 0 auto" /> </td>
</tr>
</p>
<!--
{:refdef: style="text-align: center;"}
<figure class="image">
<img src="/assets/2020-06-06/intro_IMM_blog_pic_2.png" width="40%" style="margin: 0 auto">
<figcaption>
Decision tree
</figcaption>
</figure>
{:refdef}
{:refdef: style="text-align: center;"}
<figure class="image">
<img src="/assets/2020-06-06/intro_IMM_blog_pic_3.png" width="40%" style="margin: 0 auto">
<figcaption>
Geometric representation of the decision tree
</figcaption>
</figure>
{:refdef}
-->
<p>On the left, we see a decision tree that defines a clustering with $5$ clusters. On the right, we see the geometric representation of this decision tree. We see that the decision tree imposes a partition to $5$ clusters aligned to the axis. The clustering looks close to the optimal clustering that we started with. Which is great. But can we do it for all datasets? How?</p>
<p>Several algorithms are trying to find a tree-based explainable clustering like <a href="https://link.springer.com/chapter/10.1007/11362197_5">CLTree</a> and <a href="https://www.researchgate.net/profile/Ricardo_Fraiman/publication/47744381_Clustering_using_Unsupervised_Binary_Trees_CUBT/links/09e41508aeaf39a453000000/Clustering-using-Unsupervised-Binary-Trees-CUBT.pdf">CUBT</a>. But we are the first to give formal guarantees. We first need to define the quality of an algorithm. It’s common that unsupervised learning problems are <a href="http://cseweb.ucsd.edu/~dasgupta/papers/kmeans.pdf">NP-hard</a>. Clustering is no exception. So it is common to settle for an approximated solution. A bit more formal, an algorithm that returns a tree-based clustering $T$ is an <em>$a$-approximation</em> if $cost(T)\leq a\cdot cost(opt),$ where $opt$ is the clustering that minimizes the $k$-means cost.</p>
<h3 id="general-scheme">General scheme</h3>
<p>Many supervised learning algorithms learn a decision tree, can we use one of them here? Yes, after we transform the problem into a supervised learning problem! How might you ask? We can use any clustering algorithm that will return a good, but not explainable clustering. This will form the labeling. Next, we can use a supervised algorithm that learns a decision tree. Let’s summarize these three steps:</p>
<ol>
<li>Find a clustering using some clustering algorithm</li>
<li>Label each example according to its cluster</li>
<li>Call a supervised algorithm that learns a decision tree</li>
</ol>
<p>Which algorithm can we use in step 3? Maybe the popular ID3 algorithm?</p>
<h3 id="can-we-use-the-id3-algorithm">Can we use the ID3 algorithm?</h3>
<p>Short answer: no.</p>
<p>One might hope that in step 3, in the previous scheme, the known <a href="https://link.springer.com/content/pdf/10.1007/BF00116251.pdf">ID3</a> algorithm can be used (or one of its variants like <a href="https://link.springer.com/article/10.1007/BF00993309">C4.5</a>). We will show that this does not work. There are datasets where ID3 will perform poorly. Here is an example:</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_4.png" width="40%" style="margin: 0 auto" />
<figcaption>
ID3 performs poorly on this dataset
</figcaption>
</figure>
<p>The dataset is composed of three clusters, as you can see in the figure above. Two large clusters (0 and 1 in the figure) have centers (-2, 0) and (2, 0) accordingly and small noise. The third cluster (2 in the figure) is composed of only two points that are very, very (very) far away from clusters 0 and 1. Given these data, ID3 will prefer to maximize the information gain and split between clusters 0 and 1. Recall that the final tree has only three leaves. This means that in the final tree, one point in cluster 2 must be with cluster 0 or cluster 1. Thus the cost is enormous.
To solve this problem, we design a new algorithm called <a href="https://proceedings.icml.cc/paper/2020/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf"><em>Iterative Mistake Minimization (IMM)</em></a>.</p>
<h3 id="imm-algorithm-for-explainable-clustering">IMM algorithm for explainable clustering</h3>
<p>We learned that the ID3 algorithm cannot be used in step 3 at the general scheme. Before we give up on this scheme, can we use a different decision-tree algorithm? Well, since we wrote this post, you probably know the answer: there is such an algorithm, the IMM algorithm.</p>
<p>We build the tree greedily from top to bottom. Each step we take the split (i.e., feature and threshold) that minimizes a new parameter called a <strong>mistake</strong>. A point $x$ is a mistake for node $u$ if $x$ and its center $c(x)$ reached $u$ and then separated by $u$’s split. See the next figure for an example of a split with one mistake.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/mistakes_example.png" width="40%" style="margin: 0 auto" />
<figcaption>
Split (in yellow) with one mistake. Two optimal clusters are in red and blue. Centers are the stars.
</figcaption>
</figure>
<!--For another example of the mistakes concept, let's go back to the previous dataset where ID3 failed. Focus on the first split again. The ID3 split has one mistake since one of the points in cluster $2$ will be separated from its center. On the other hand, the horizontal split has $0$ mistakes: the two large clusters will go with their centers to one side of the tree, and the small cluster will go with its center to the other side of the tree. -->
<p>To summarize, the high-level description of the IMM algorithm:
<!--<center>
<span style="font-family:Papyrus; font-size:2em;align-self: center;">As long as there is more than one center
<br> find the split with minimal number of mistakes</span>
</center>
--></p>
<center>
<span style="font-size:larger;">
As long as there is more than one center
<br /> find the split with minimal number of mistakes
</span>
</center>
<p> </p>
<!--What if there are no mistakes.
The main definition that we need is a mistake:
Creare a different figure that explains a mistake with small number of points
-->
<!--
<center>
<span style="font-family:Papyrus; font-size:2em;align-self: center;">If a point and its center diverge,
<br> then it counts as a mistake</span>
</center>
<div class="definition"> [mistake at node $u$].
If a point and its center end up at different leafs, then it counts as a mistake.
</div>
... Explain what is a split early on ...
-->
<!---
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">):</span>
<span class="n">node</span> <span class="o">=</span> <span class="n">new</span> <span class="n">Node</span><span class="p">()</span>
<span class="k">if</span> <span class="o">|</span><span class="n">centers</span><span class="o">|</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">i</span><span class="p">,</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">find_split</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">)</span>
<span class="n">node</span><span class="p">.</span><span class="n">condition</span> <span class="o">=</span> <span class="s">'x_i <= theta'</span>
<span class="n">points_left_mask</span> <span class="o">=</span> <span class="n">points</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span> <span class="o"><=</span> <span class="n">theta</span>
<span class="n">centers_left_mask</span> <span class="o">=</span> <span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span> <span class="o"><=</span> <span class="n">theta</span>
<span class="n">node</span><span class="p">.</span><span class="n">left</span> <span class="o">=</span> <span class="n">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">[</span><span class="n">points_left_mask</span><span class="p">],</span> <span class="n">centers</span><span class="p">[</span><span class="n">centers_left_mask</span><span class="p">])</span>
<span class="n">node</span><span class="p">.</span><span class="n">right</span> <span class="o">=</span> <span class="n">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">[</span><span class="o">~</span><span class="n">points_left_mask</span><span class="p">],</span> <span class="n">centers</span><span class="p">[</span><span class="o">~</span><span class="n">centers_left_mask</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">node</span><span class="p">.</span><span class="n">label</span> <span class="o">=</span> <span class="n">centers</span>
<span class="k">return</span> <span class="n">node</span>
<span class="k">def</span> <span class="nf">find_split</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">d</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">])</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">])</span>
<span class="n">i</span><span class="p">,</span><span class="n">theta</span> <span class="o">=</span> <span class="n">argmin_</span><span class="p">{</span><span class="n">i</span><span class="p">,</span><span class="n">l</span> <span class="o"><=</span> <span class="n">theta</span> <span class="o"><</span> <span class="n">r</span><span class="p">}</span> <span class="n">mistakes</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="k">return</span> <span class="n">i</span><span class="p">,</span><span class="n">theta</span></code></pre></figure>
-->
<p>Here is an illustration of the IMM algorithm. We use $k$-means++ with $k=5$ to find a clustering for our dataset. Each point is colored with its cluster label. At each node in the tree, we choose a split with a minimal number of mistakes. We stop where each of the $k=5$ centers is in its own leaf. This defines the explainable clustering on the left.</p>
<center>
<img src="/assets/2020-10-16-explain_k_means/imm_example_slow.gif" width="600" height="320" />
</center>
<p>The algorithm is guaranteed to perform well. For any dataset. See the next theorem.</p>
<div class="theorem">
IMM is an $O(k^2)$-approximation to the optimal $k$-means clustering.
</div>
<p>This theorem shows that we can always find a small tree, with $k$ leaves, such that the tree-based clustering is only $O(k^2)$ times worse in terms of the cost. IMM efficiently find this explainable clustering. Importantly, this approximation is independent of the dimension and the number of points. A proof for the case $k=2$ will appear in a <a href="explain_2_means.html">follow-up post</a>, and you can read the proof for general $k$ in the paper. Intuitively, we discovered that the number of mistakes is a good indicator for the $k$-means cost, and so, minimizing the number of mistakes is an effective way to find a low-cost clustering. <!-- Surprisingly, we can also use a tree with $k$ leaves, which means that IMM produces an explainable clustering.--></p>
<h4 id="running-time">Running Time</h4>
<p>What is the running time of the IMM algorithm? With an efficient implementation, using dynamic programming, the running time is $O(kdn\log(n)).$ Why? For each of the $k-1$ inner nodes and each of the $d$ features, we can find the split that minimizes the number of mistakes for this node and feature, in time $O(n\log(n)).$</p>
<p>For $2$-means one can do better than running IMM: going over all possible $(n-1)d$ cuts and find the best one. The running time is $O(nd^2+nd\log(n))$.</p>
<h3 id="results-summary">Results Summary</h3>
<p>In each cell in the following table, we write the approximation factor. We want this value to be small for the upper bounds and large for the lower bounds. In $2$-medians, the upper and lower bounds are pretty tight, about $2$. But, there is a large gap for $k$-means and $k$-median: the lower bound is $\log(k)$, while the upper bound is $\mathsf{poly}(k)$.</p>
<center>
<table style="text-align: center">
<thead>
<tr>
<th></th>
<th colspan="2" style="text-align: center">$k$-medians</th>
<th colspan="2" style="text-align: center">$k$-means</th>
</tr>
<tr>
<th></th>
<th> $k=2$ </th>
<th> $k>2$ </th>
<th> $k=2$ </th>
<th> $k>2$ </th>
</tr>
</thead>
<tbody>
<tr>
<td> <strong>Lower</strong> </td>
<td> $2-\frac1d$ </td>
<td> $\Omega(\log k)$ </td>
<td> $3\left(1-\frac1d\right)^2$ </td>
<td> $\Omega(\log k)$ </td>
</tr>
<tr>
<td> <strong>Upper</strong> </td>
<td> $2$ </td>
<td> $O(k)$ </td>
<td> $4$ </td>
<td> $O(k^2)$ </td>
</tr>
</tbody>
</table>
</center>
<h3 id="whats-next">What’s next</h3>
<ol>
<li>IMM exhibits excellent results in practice on many datasets, see <a href="https://arxiv.org/abs/2006.02399">this</a>. It’s running time is comparable to KMeans implemented in sklearn. We implemented the IMM algorithm, it’s <a href="https://github.com/navefr/ExKMC">here</a>. Try it yourself.</li>
<li>We plan to have several posts on explainable clusterings, here is the <a href="explain_2_means.html">second</a> in the series, stay tuned for more!</li>
<li>In a follow-up work, we explore the tradeoff between explainability and accuracy. If we allow a slightly larger tree, can we get a lower cost? We introduce the <a href="https://arxiv.org/abs/2006.02399">ExKMC</a>, “Expanding Explainable $k$-Means Clustering”, algorithm that builds on IMM.</li>
<li>Found cool applications of IMM? Let us know!</li>
</ol><a href='https://sites.google.com/view/michal-moshkovitz'>Michal Moshkovitz</a>, <a href='mailto:navefrost@mail.tau.ac.il'>Nave Frost</a>, <a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a>Popular algorithms for learning decision trees can be arbitrarily bad for clustering. We present a new algorithm for explainable clustering that has provable guarantees --- the Iterative Mistake Minimization (IMM) algorithm. This algorithm exhibits good results in practice. It's running time is comparable to KMeans implemented in sklearn. So our method gives you explanations basically for free. Our code is available on github.Towards Physics-informed Deep Learning for Turbulent Flow Prediction2020-08-23T00:00:00+00:002020-08-23T00:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/08/23/TF-Net<h3 id="prediction-visualization">Prediction Visualization</h3>
<p>We propose a novel hybrid model for turbulence prediction, $\texttt{TF-Net}$, that unifies a popular <a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics">Computational fluid dynamics (CFD)</a> technique, RANS-LES coupling, with custom-designed U-net. The following two videos show the ground truth and the predicted U (left) and V (right) velocity fields from $\texttt{TF-Net}$ and three best baselines. We see that the predictions by $\texttt{TF-Net}$ are the closest to the target based on the shape and the frequency of the motions. Baselines generate smooth predictions and miss the details of small scale motion.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/U_prediction.gif" width="49%" style="margin: 0 auto" />
<img src="/assets/2020-08-23-TF-Net/V_prediction.gif" width="49%" style="margin: 0 auto" />
</div>
<p><br /></p>
<h3 id="introduction">Introduction</h3>
<p>Modeling the spatiotemporal dynamics over a wide range of space and time scales is a fundamental task in science, especially atmospheric science, marine science and aerodynamics. <a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics">Computational fluid dynamics (CFD)</a> is at the heart of climate modeling and has direct implications for understanding and predicting climate change. Recently, deep learning have demonstrated great success in the <a href="https://www.nature.com/articles/s41586-019-0912-1">automation, acceleration, and streamlining of highly compute-intensive workflows for science</a>. We hope deep learning can accelerate the turbulence simulation since the current CFD is purely physics-based and computationally-intensive, requiring significant computational resources and expertise.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/imgs.png" width="90%" style="margin: 0 auto" />
</div>
<p><br />
But purely data-driven methods are mainly statistical with no underlying physical knowledge incorporated, and are yet to be proven to be successful in capturing and predicting accurately the complex physical systems. Incorporating physics knowledge into deep learning models can improve not only prediction accuracy, but more importantly, physical consistency. Thus, developing deep learning methods that can incorporate physical laws in a systematic manner is a key element in advancing AI for physical sciences.</p>
<p><a href="https://uknowledge.uky.edu/me_textbooks/2/">Computational techniques</a> are at the core of present-day turbulence investigations, which are a branch of fluid mechanics that uses numerical method to analyze and predict fluid flows. In physics, people use the following <a href="https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations">Navier–Stokes equations</a> to describe the motion of viscous fluid dynamics.</p>
\[\nabla \cdot \pmb{w} = 0 \qquad\qquad\qquad\qquad\qquad\qquad\qquad \text{Continuity Equation}\]
\[\frac{\partial \pmb{w}}{\partial t} + (\pmb{w} \cdot \nabla) \pmb{w} = -\frac{1}{\rho_0} \nabla p + \nu \nabla^2 \pmb{w} + f \quad\text{Momentum Equation}\]
\[\frac{\partial T}{\partial t} + (\pmb{w} \cdot \nabla) T = \kappa \nabla^2 T \qquad\qquad\qquad\quad \text{Temperature Equation}\]
<p>where $\pmb{w}(t)$ is the vector velocity field of the flow, which is what we want to predict. $p$ and $T$ are pressure and temperature respectively, $\kappa$ is the coefficient of heat conductivity, $\rho_0$ is density at temperature at the beginning, $\alpha$ is the coefficient of thermal expansion, $\nu$ is the kinematic viscosity, $f$ the body force that is due to gravity.</p>
<p><br /></p>
<h3 id="turbulent-flow-net">Turbulent-Flow Net</h3>
<p>For turbulent flows, the range of length scales and complexity of phenomena involved in turbulence make <a href="https://en.wikipedia.org/wiki/Direct_numerical_simulation">Direct Numerical Simulation (DNS)</a> approaches prohibitively expensive. Great emphasis was then placed on the alternative approaches including Large-Eddy Simulation (LES), Reynolds-averaged Navier Stokes (RANS) as well as <a href="https://link.springer.com/article/10.1007/s10494-017-9828-8">Hybrid RANS-LES Coupling</a> that combines both RANS and LES approaches in order to take advantage of both methods. These methods decompose the fluid flow into different scales in order to directly simulate large scales while model small ones.</p>
<p>Hybrid RANS-LES Coupling decomposes the flow velocity into three scales: mean flow, resolved fluctuations and unresolved fluctuations. It applies the spatial filtering operator $S$ and the temporal average operator $T$ sequentially.</p>
\[\pmb{w^*}(\pmb{x},t) = S \ast\pmb{w} = \sum_{\pmb{\xi}} S(\pmb{x}|\pmb{\xi})\pmb{w}(\pmb{\xi},t)\]
\[\pmb{\bar{w}}(\pmb{x},t) = T \ast \pmb{w^*} = \frac{1}{n}\sum_{s = t-n}^tT(s) \pmb{w^*} (\pmb{x}, s)\]
<p>then $\pmb{\tilde{w}}$ can be defined as the difference between $\pmb{w^*}$ and $\pmb{\bar{w}}$:</p>
\[\pmb{\tilde{w}} = \pmb{w^*} - \pmb{\bar{w}}, \quad \pmb{w'} = \pmb{w} - \pmb{w^{*}}\]
<p>Finally we can have the three-level decomposition of the velocity field.</p>
<p>\begin{equation}
\pmb{w} = \pmb{\bar{w}} + \pmb{\tilde{w}} + \pmb{w’}
\end{equation}</p>
<p>The figure below shows this three-level decomposition in wavenumber space. $k$ is the wavenumber, the spatial frequency in the Fourier domain. $E(k)$ is the energy spectrum describing how much kinetic energy is contained in eddies with wavenumber $k$. Small $k$ corresponds to large eddies that contain most of the energy. The slope of the spectrum is negative and indicates the transfer of energy from large scales of motion to the small scales.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/decompose.png" width="40%" style="margin: 0 auto" />
</div>
<p><br />
Inspired by the hybrid RANS-LES Coupling, we propose a hybrid deep learning framework, $\texttt{TF-Net}$, based on the multilevel spectral decomposition. Specifically, we decompose the velocity field into three scales using the spatial filter $S$ and the temporal filter $T$. Unlike traditional CFD, both filters in $\texttt{TF-Net}$ are trainable neural networks. The motivation for this design is to explicitly guide the DL model to learn the non-linear dynamics of both large and small eddies. We design three identical convolutional encoders to encode the three scale components separately and use a shared convolutional decoder to learn the interactions among these three components and generate the final prediction. The figure below shows the overall architecture of our hybrid model $\texttt{TF-Net}$.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/model.png" width="98%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>Since the turbulent flow under investigation has zero divergence, we include $\Vert\nabla \cdot \pmb{w}\Vert^2$ as a regularizer to constrain the predictions, leading to a constrained TF-Net, $\texttt{Con TF-Net}$.</p>
<p><br />
<br /></p>
<h3 id="results">Results</h3>
<p>We compare our model with four purely data-driven deep learning models, including <a href="https://arxiv.org/abs/1512.03385">$\texttt{ResNet}$</a>, <a href="https://arxiv.org/abs/1506.04214">$\texttt{ConvLSTM}$</a>, <a href="https://arxiv.org/abs/1505.04597">$\texttt{U-net}$</a> and <a href="https://arxiv.org/abs/1406.2661">$\texttt{GAN}$</a>, and two hybrid physics-informed models, including <a href="https://arxiv.org/abs/1801.06637">$\texttt{DHPM}$</a> and <a href="https://arxiv.org/abs/1711.07970">$\texttt{SST}$</a>. All the models trained to make one step ahead prediction given the historic frames and we use them autoregressively to generate multi-step forecasts.</p>
<p>$\textbf{Accuracy}$ The following figure show the growth of RMSE with prediction horizon up to 60 time steps ahead. We can see that $\texttt{TF-Net}$ consistently outperforms all baselines, and constraining it with divergence free regularizer can further improve the performance.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/rmse_horizon.png" width="55%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>$\textbf{Physical Consistency}$ The left figure below is the averages of absolute divergence over all pixels at each prediction step and the right figure below is the energy spectrum curves. $\texttt{TF-Net}$ predictions are in fact much closer to the target even without additional divergence free constraint, which suggests that $\texttt{TF-Net}$ can generate predictions that are physically consistent with the ground truth.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/divergence.png" width="48%" style="margin: 0 auto" />
<img src="/assets/2020-08-23-TF-Net/spec_ci_square.png" width="48%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>$\textbf{Efficiency}$ This figure shows the average time to produce one 64 × 448 2d velocity field for all models on single V100 GPU. We can see that $\texttt{TF-net}$, $\texttt{U_net}$ and $\texttt{GAN}$ are faster than the numerical Lattice Boltzmann method. $\texttt{TF-Net}$ will show greater advantage of speed on higher resolution data.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/avg_time.png" width="60%" style="margin: 0 auto" />
</div>
<p><br />
$\textbf{Ablation Study}$ We also perform an ablation study to understand each component of $\texttt{TF-Net}$ and investigate whether the model has actually learned the flow with different scales. During inference, we applied each small U-net in $\texttt{TF-Net}$ with the other two encoders removed to the entire input domain. The video below includes $\texttt{TF-Net}$ predictions and the outputs of each small $\texttt{U-net}$ while the other two encoders are zeroed out. We observe that the outputs of each small $\texttt{U-net}$ are the flow with different scales, which demonstrates that $\texttt{TF-Net}$ can learn multi-scale behaviors.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/Ablation_Study.gif" width="70%" style="margin: 0 auto" />
</div>
<p><br />
<br /></p>
<h3 id="conclusion-and-future-work">Conclusion and Future Work</h3>
<p>We presented a novel hybrid deep learning model, $\texttt{TF-Net}$, that unifies representation learning and turbulence simulation techniques. $\texttt{TF-Net}$ exploits the multi-scale behavior of turbulent flows to design trainable scale-separation operators to model different ranges of scales individually. We provide exhaustive comparisons of $\texttt{TF-Net}$ and baselines and observe significant improvement in both the prediction error and desired physical quantifies, including divergence, turbulence kinetic energy and energy spectrum. Future work includes extending these techniques to very high-resolution, 3D turbulent flows and incorporating additional physical variables, such as pressure and temperature, and additional physical constraints, such as conservation of momentum, to improve the accuracy and faithfulness of deep learning models.</p>
<h3 id="more-details">More Details</h3>
<h4 id="see-our-paper-or-our-repository">See <a href="https://arxiv.org/abs/1911.08655">our paper</a> or our <a href="https://github.com/Rose-STL-Lab/Turbulent-Flow-Net">repository</a>.</h4><a href='mailto:ruw020@ucsd.edu'>Rui Wang</a>, <a href='mailto:kkashinath@lbl.gov'>Karthik Kashinath</a>, <a href='mailto:mmustafa@lbl.gov'>Mustafa Mustafa</a>, <a href='mailto:aalbert@lbl.gov'>Adrian Albert</a> and <a href='mailto:roseyu@eng.ucsd.edu'>Rose Yu</a>While deep learning has shown tremendous success in a wide range of domains, it remains a grand challenge to incorporate physical principles in a systematic manner to the design, training, and inference of such models. In this paper, we aim to predict turbulent flow by learning its highly nonlinear dynamics from spatiotemporal velocity fields of large-scale fluid flow simulations of relevance to turbulence modeling and climate modeling. We adopt a hybrid approach by marrying two well-established turbulent flow simulation techniques with deep learning. Specifically, we introduce trainable spectral filters in a coupled model of Reynolds-averaged Navier-Stokes (RANS) and Large Eddy Simulation (LES), followed by a specialized U-net for prediction. Our approach, which we call turbulent-Flow Net (TF-Net), is grounded in a principled physics model, yet offers the flexibility of learned representations. We compare our model, TF-Net, with state-of-the-art baselines and observe significant reductions in error for predictions 60 frames ahead. Most importantly, our method predicts physical fields that obey desirable physical characteristics, such as conservation of mass, whilst faithfully emulating the turbulent kinetic energy field and spectrum, which are critical for accurate prediction of turbulent flows.How to Detect Data-Copying in Generative Models2020-08-03T19:00:00+00:002020-08-03T19:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/08/03/how-to-detect-data-copying-in-generative-models<p>In our <a href="https://arxiv.org/abs/2004.05675">AISTATS 2020 paper</a>, professors <a href="https://cseweb.ucsd.edu/~kamalika/">Kamalika Chaudhuri</a>, <a href="https://cseweb.ucsd.edu/~dasgupta/">Sanjoy Dasgupta</a>, and I propose some new definitions and test statistics for conceptualizing and measuring overfitting by generative models.</p>
<p>Overfitting is a basic stumbling block of any learning process. Take learning to cook for example. In quarantine, I’ve attempted ~60 new recipes and can recreate ~45 of them consistently. The recipes are my training set and the fraction I can recreate is a sort of training error. While this training error is not exactly impressive, if you ask me to riff on these recipes and improvise, the result (i.e. dinner) will be dramatically worse.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/supervised_overfitting_2.png" width="75%" /></p>
<p>It is well understood that our models tend to do the same – deftly regurgitating their training data, yet struggling to generalize to unseen examples similar to the training data. Learning theory has nicely formalized this in the supervised setting. Our classification and regression models start to overfit when we observe a gap between training and (held-out) test prediction error, as in the above figure for the overly complex models.</p>
<p>This notion of overfitting relies on being able to measure prediction error or perhaps log likelihood of the labels, which is rarely a barrier in the supervised setting; supervised models generally output low dimensional, simple predictions. Such is not the case in the generative setting where we ask models to output original, high dimensional, complex entities like images or natural language. Here, we certainly lack any notion of prediction error and likelihoods are intractable for many of today’s generative models like VAEs and GANs: VAEs only provide a lower bound of the data likelihood, and GANs only leave us with their samples. This prevents us from simply measuring the gap between train and test accuracy/likelihood and calling it a day as we do with supervised models.</p>
<p>Instead, we evaluate generative models by comparing their generated samples with those of the true distribution, as in the following figure. Here, a two-sample test only uses a training sample and a generated sample. A three-sample test uses an additional held out test sample from the true distribution.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/unsupervised_setting_2.png" width="75%" /></p>
<p>This practice is well established by existing two-sample generative model tests like the <a href="https://arxiv.org/abs/1706.08500">Frechet Inception Distance</a>, <a href="https://arxiv.org/abs/1611.04488">Kernel MMD</a>, and <a href="https://arxiv.org/abs/1806.00035">Precision & Recall test</a>. But in absence of ground truth labels, what exactly are we testing for? We argue that unlike supervised models, generative models exhibit two varieties of overfitting: <strong>over-representation</strong> and <strong>data-copying</strong>.</p>
<h3 id="data-copying-vs-over-representation">Data-copying vs. Over-representation</h3>
<p>Most generative model tests like those listed above check for over-representation: the tendency of a model to over-emphasize certain regions of the instance space by assigning more probability mass there than it should. Consider a data distribution $P$ over an instance space $\mathcal{X}$ of cat cartoons. Region $\mathcal{C} \subset \mathcal{X}$ specifically contains cartoons of cats with bats. Using training set $T \sim P$, we train a generative model $Q$ from which we draw a sample $Q_m \sim Q$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/overrepresentation.png" width="95%" /></p>
<p>Evidently, the model $Q$ really likes region $\mathcal{C}$, generating an undue share of cats with bats. More formally, we say $Q$ is over-representing some region $\mathcal{C}$ when</p>
<p>\[ \Pr_{x \sim Q}[x \in \mathcal{C}] \gg \Pr_{x \sim P}[x \in \mathcal{C}] \]</p>
<p>This can be measured with a simple two-sample hypothesis test, as was done in Richardson & Weiss’s <a href="https://arxiv.org/abs/1805.12462">2018 paper</a> demonstrating the efficacy of Gaussian mixture models in high dimension.</p>
<p>Data-copying, on the other hand, occurs when $Q$ produces samples that are <em>closer to training set $T$</em> than they should be. To test for this, we equip ourselves with a held-out test sample $P_n \sim P$ in addition to some distance metric $d(x,T)$ that measures proximity to the training set of any $x \in \mathcal{X}$. We then say that $Q$ is data-copying training set $T$ when examples $x \sim Q$ are on average closer to $T$ than are $x \sim P$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/data_copying_1_.png" width="95%" /></p>
<p>We define proximity to training set $d(x,T)$ to be the distance between $x$ and its nearest neighbor in $T$ according to some metric $d_\mathcal{X}:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$. Specifically</p>
<p>\[ d(x,T) = \min_{t \in T}d_\mathcal{X}(x,t) \]</p>
<p>At a first glance, the generated samples in the above figure look perfectly fine, representing the different regions nicely. But taken alongside its training and test sets, we see that it has effectively copied the cat with bat in the lower right corner (for visualization, we let Euclidean distance $d_\mathcal{X}$ be a proxy for similarity).</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/data_copying_2.png" width="95%" /></p>
<p>More formally, $Q$ is data-copying $T$ in some region $\mathcal{C} \subset \mathcal{X}$ when</p>
<p>\[ \Pr_{x \sim Q, z \sim P}[d(x,T) < d(z,T) \mid x,z \in \mathcal{C}] \gg \frac{1}{2}\]</p>
<p>The key takeaway here is that data-copying and over-representation are <em>orthogonal failure modes</em> of generative models. A model that exhibits over-representation may or may not data-copy and vice versa. As such, it is critical that we test for both failure modes when designing and training models.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/orthogonal_concepts_2_.png" width="70%" /></p>
<p>Returning to my failed culinary ambitions, I tend to both data-copy recipes I’ve tried <em>and</em> over-represent certain types of cuisine. If you look at the ‘true distribution’ of recipes online, you will find that there is a tremendous diversity of cooking styles and cuisines. However, put in the unfortunate circumstance of having me cook for you, I will most likely produce some slight variation of a recipe I’ve recently tried. And, even though I have attempted a number of Indian, Mexican, Italian, and French dishes, I tend to over-represent bland pastas and salads when left to my own devices. To cook truly original food, one must both be creative enough to go beyond the recipes they’ve seen <em>and</em> versatile enough to make a variety of cuisines. So, be sure to test for both data-copying and over-representation, and do not ask me to cook for you.</p>
<h3 id="a-three-sample-test-for-data-copying">A Three-Sample Test for Data-Copying</h3>
<p>Adding another test to one’s modeling pipeline is tedious. The good news is that data-copying can be tested with a single snappy three-sample hypothesis test. It is non-parametric, and concentrates nicely with both increasing test-set and generated samples.</p>
<p>As described in the previous section, we use a training sample $T \sim P$, a held-out test sample $P_n \sim P$, and a generated sample $Q_m \sim Q$. We additionally need some distance metric $d_\mathcal{X}(x,z)$. In practice, we choose $d_\mathcal{X}(x,z)$ to be the Euclidean distance between $x$ and $z$ after being embedded by $\phi$ into some lower-dimensional perceptual space: $d_\mathcal{X}(x,z) = \| \phi(x) - \phi(z) \|_2$. The use of such embeddings is common practice in testing generative models as exhibited by several existing over-representation tests like <a href="https://arxiv.org/abs/1706.08500">Frechet Inception Distance</a> and <a href="https://arxiv.org/abs/1806.00035">Precision & Recall</a>.</p>
<p>Following intuition, it is tempting to check for data-copying by simply differencing the expected distance to training set:</p>
<div>
$$
\mathbb{E}_{x \sim Q} [d(x,T)] - \mathbb{E}_{x \sim P} [d(x,T)] \approx \frac{1}{m} \sum_{x_i \in Q_m} d(x_i, T) - \frac{1}{n} \sum_{x_i \in P_n}d(x_i, T) \ll 0
$$
</div>
<p>where, to reiterate, $d(x,T)$ is the distance $d_\mathcal{X}$ between $x$ and its nearest neighbor in $T$. This statistic — an expected distance — is a little too finicky: the variance is far out of our control, influenced by both the choice of distance metric and by outliers in both $P_n$ and $Q_m$. So, instead of probing for how <em>much</em> closer $Q$ is to $T$ than $P$ is, we probe for how <em>often</em> $Q$ is closer to $T$ than $P$ is:</p>
<div>
$$
\mathbb{E}_{x \sim Q, z \sim P} [\mathbb{1}_{d(x,T) > d(z,T)}] \approx \frac{1}{nm} \sum_{x_i \in Q_m, z_j \in P_n} \mathbb{1} \big( d(x_i, T) > d(z_j, T) \big) \ll \frac{1}{2}
$$
</div>
<p>This statistic — a probability — is closer to what we want to measure, and is more stable. It tells us how much more likely samples in $Q_m$ are to fall near samples in $T$ relative to the held out samples in $P_n$. If it is much less than a half, then significant data-copying is occurring. This statistic is much more robust to outliers and is lower variance. Additionally, by measuring a probability instead of an expected distance, the value of this statistic is interpretable. Regardless of the data domain or distance metric, less than half is overfit, half is good, and over half is underfit (in the sense that the generated samples are further from the training set than they should be). We are also able to show that this indicator statistic has nice concentration properties agnostic to the chosen distance metric.</p>
<p>It turns out that the above test is an instantiation of the <a href="https://en.wikipedia.org/wiki/Mann-Whitney_U_test">Mann-Whitney hypothesis test</a>, proposed in 1947, for which there are computationally efficient implementations in packages like <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html">SciPy</a>. By $Z$-scoring the Mann-Whitney statistic, we normalize its mean to zero and variance to one. We call this statistic $Z_U$. As such, a generative model $Q$ with $Z_U \ll 0$ is heavily data-copying and a score $Z_U \gg 0$ is underfitting. Near 0 is ideal.</p>
<h3 id="handling-heterogeneity">Handling Heterogeneity</h3>
<p>An operative phrase that you may have noticed in the above definition of data-copying is “on average”. Is the generative model closer to the training data than it should be <em>on average</em>? This, unfortunately, is prone to false negatives. If $Z_U \ll 0$, then $Q$ is certainly data-copying in some region $\mathcal{C} \subset \mathcal{X}$. However, if $Z_U \geq 0$, it may still be excessively data-copying in one region and significantly underfitting in another, leading to a test score near 0.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/bins_1.png" width="33%" /></p>
<p>For example, let the $\times$’s denote training samples and the red dots denote generated samples. Even without observing a held-out test sample, it is clear that $Q$ is data-copying in pink region and underfitting in the green region. $Z_U$ will fall near 0, suggesting the model is performing well despite this highly undesirable behavior.</p>
<p>To prevent this misreading, we employ an algorithmic tool seen frequently in non-parametric testing: binning. Break the instance space into a partition $\Pi$ consisting of $k$ ‘bins’ or ‘cells’ $\pi \in \Pi$ and collect $Z_U^\pi$ each cell $\pi$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/bins_2.png" width="33%" /></p>
<p>The statistic maintains its concentration properties within each cell. The more test and generated samples we have ($n$ and $m$), the more bins we can construct, and the more we can precisely pinpoint a model’s data-copying behavior. The ‘goodness’ of model’s fit is an inherently multidimensional entity, and it is informative to explore the range of $Z_U^\pi$ values seen in all cells $\pi \in \Pi$. Our experiments indicate that VAEs and GANs both tend to data-copy in some cells and underfit in others. However, to boil all this down into a single statistic for model comparisons, we simply take an average of the $Z_U^\pi$ values weighted by the number of test samples in the cell:</p>
<div>
$$
C_T = \sum_{\pi \in \Pi} \frac{\#\{P_n \in \pi\}}{n} Z_U^\pi
$$
</div>
<p>(In practice, we restrict ourselves to cells with a sufficient number of generated samples. See the <a href="https://arxiv.org/abs/2004.05675">paper</a>.). Intuitively, this statistic tells us whether the model tends to data-copy in the regions most heavily emphasized by the true distribution. It does not tell us whether or not the model $Q$ data-copies <em>somewhere</em>.</p>
<h3 id="experiments-data-copying-in-the-wild">Experiments: data-copying in the wild</h3>
<p>Observing data-copying in VAEs and GANs indicates that the $C_T$ statistic above serves as an instructive tool for model selection. For a more methodical interrogation of the $C_T$ statistic and comparison with baseline tests, be sure to check out the <a href="https://arxiv.org/abs/2004.05675">paper</a>.</p>
<p>To test how VAE complexity relates to data-copying, we train 20 VAEs on MNIST with increasing width as indicated by the latent dimension. For each model $Q$, we draw a sample of generated images $Q_m$, and compare with a held out test set $P_n$ to measure $C_T$. Our distance metric is given by the 64d latent space of an autoencoder we trained with a VGG perceptual loss produced by <a href="https://arxiv.org/abs/1801.03924">Zhang et al.</a>. The purpose of this alternative latent space is to provide an embedding that both provides a perceptual distance between images and is independent of the VAE embeddings. For partitioning, we simply take the Voronoi cells induced by the $k$ centroids found by $k$-means run on the embedded training dataset.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/VAE_overfitting.png" width="49%" />
<img src="/assets/2020-08-03-data-copying/VAE_gen_gap.png" width="46%" /></p>
<h5 style="text-align: center;">The data-copying $C_T$ statistic (left) captures overfitting in overly complex VAEs. The train/test gap in ELBO (right), meanwhile, does not.</h5>
<p>Recall that $C_T \ll 0$ indicates data-copying and $C_T \gg 0$ indicates underfitting. We see (above, left) that overly complex models (towards the left of the plot) tend to copy their training set, and simple models (towards the right of the plot) tend to underfit, just as we might expect. Furthermore, $C_T = 0$ approximately coincides with the maximum ELBO, the VAE’s likelihood lower bound. For comparison, take the generalization gap of the VAEs’ ELBO on the training and test sets (above, right). The gap remains large for both overly complex models ($d > 50$) and simple models ($d < 50$). With the ELBO being a lower bound to the likelihood, it is difficult to interpret precisely why this happens. Regardless, it is clear that the ELBO gap is a compartively imprecise measure of overfitting.</p>
<p>While the VAEs exhibit increasing data-copying with model complexity <em>on average</em>, most of them have cells that are over- and underfit. Poking into the individual cells $\pi \in \Pi$, we can take a look at the difference between a $Z_U^\pi \ll 0$ cell and a $Z_U^\pi \gg 0$ cell:</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/VAE_cells.png" width="90%" /></p>
<h5 style="text-align: center;"> A VAE's datacopied (left) vs. underfit (right) cells of the MNIST instance space.</h5>
<p>The two strips exhibit two regions of the same VAE. The bottom row of each shows individual generated samples from the cell, and the top row shows their training nearest neighbors. We immediately see that the data-copied region (left, $Z_U^\pi = -8.54$) practically produces blurry replicas of its training nearest neighbors, while the underfit region (right, $Z_U^\pi = +3.3)$ doesn’t appear to produce samples that look like any training image.</p>
<p>Extending these tests to a more complex and practical domain, we check the ImageNet-trained <a href="https://arxiv.org/abs/1809.11096">BigGAN</a> model for data-copying. Being a conditional GAN that can output images of any single ImageNet 12 class, we condition on three separate classes and treat them as three separate models: Coffee, Soap Bubble, and Schooner. Here, it is not so simple to re-train GANs of varying degrees of complexity as we did before with VAEs. Instead, we modulate the model’s ‘trunction threshold’: a level beyond which all inputs are resampled. A larger truncation threshold allows for higher variance latent input, and thus higher variance outputs.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/GAN_overfitting.png" width="60%" /></p>
<h5 style="text-align: center;"> BigGan, an ImageNet12 conditional GAN, appears to significantly data-copy for all but its highest truncation levels, which are said to trade off between variety and fidelity. </h5>
<p>Low truncation thresholds restrict the model to producing samples near the mode – those it is most confident in. However it appears that in all image classes, this also leads to significant data copying. Not only are the samples less diverse, but they hang closer to the training set than they should. This contrasts with the BigGAN authors’ suggestion that truncation level trades off between ‘variety and fidelity’. It appears that it might trade off between ‘copying and not copying’ the training set.</p>
<p>Again, even the least copying models with maximized truncation (=2) exhibit data-copying in <em>some</em> cells $\pi \in \Pi$:</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/GAN_cells.png" width="95%" /></p>
<h5 style="text-align: center;"> Examples from BigGan's data-copied (left) and underfit (right) cells of the 'coffee' (top) and 'soap bubble' (bottom) classes.</h5>
<p>The left two strips show show data-copied cells of the coffee and bubble instance spaces (low $Z_U^\pi$), and right two strips show underfit cells (high $Z_U^\pi$). The bottom row of each strip shows a subset of generated images from that cell, and the top row training images from the cell. To show the diversity of the cell, these are not necessarily the generated samples’ training nearest neighbors as they were in the MNIST example.</p>
<p>We see that the data-copied cells on the left tend to confidently produce samples of one variety, that linger too closely to some specific examples it caught in the training set. In the coffee case, it is the teacup/saucer combination. In the bubble case, it is the single large suspended bubble with blurred background. Meanwhile, the slightly underfit cells on the right arguably perform better in a ‘generative’ sense. The samples, albeit slightly distorted, are more original. According to the inception space distance metric, they hug less closely to the training set.</p>
<h3 id="data-copying-is-a-real-failure-mode-of-generative-models">Data-copying is a real failure mode of generative models</h3>
<p>The moral of these experiments is that data-copying indeed occurs in contemporary generative models. This failure mode has significant consequences for user privacy and for model generalization. With that said, it is a failure mode not identified by most prominent generative model tests in the literature today.</p>
<ul>
<li>
<p>Data-copying is <em>orthogonal to</em> over-representation; both should be tested when designing and training generative models.</p>
</li>
<li>
<p>Data-copying is straightforward to test efficiently when equipped with a decent distance metric.</p>
</li>
<li>
<p>Having identified this failure mode, it would be interesting to see modeling techniques that actively try to minimize data-copying in training.</p>
</li>
</ul>
<p>So be sure to start probing your models for data-copying, and don’t be afraid to venture off-recipe every once in a while!</p>
<h3 id="more-details">More Details</h3>
<p>Check out <a href="https://arxiv.org/abs/2004.05675">our AISTATS paper on arxiv</a>, and <a href="https://github.com/casey-meehan/data-copying">our data-copying test code on GitHub</a>.</p><a href='mailto:cmeehan@eng.ucsd.edu'>Casey Meehan</a>What does it mean for a generative model to overfit? We formalize the notion of 'data-copying', when a generative model produces only slight variations of the training set and fails to express the diversity of the true distribution. To catch this form of overfitting, we propose a three-sample hypothesis test that is entirely model agnostic. Our experiments indicate that several standard tests condone data-copying, and contemporary generative models like VAEs and GANs can commit data-copying.The Power of Comparison: Reliable Active Learning2020-07-27T17:00:00+00:002020-07-27T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/07/27/rel-comp<p>With the surge of widely available massive online datasets, we have become <em>very</em> good at building algorithms which distinguish between the following:</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/cat-dog.png" width="90%" /></p>
<p>But what about solving classification problems which may have large unlabeled datasets, but whose labeling requires expert advice? What about situations like examining an MRI scan or recognizing a pedestrian, where a mistake in classification could be the difference between life and death? In these situations, it would be great if we could build a classification algorithm with the following properties:</p>
<ol>
<li>The algorithm requires <strong>very few labeled data points</strong> to train.</li>
<li>The algorithm <strong>never makes a mistake</strong>.</li>
</ol>
<p>This raises an obvious question: is building an efficient algorithm with such strong guarantees even possible? It turns out the answer is <strong>yes</strong>—just not in the standard learning model. In <a href="https://arxiv.org/pdf/1907.03816.pdf">recent joint work</a> with <a href="https://cseweb.ucsd.edu/~dakane/">Daniel Kane</a> and <a href="https://cseweb.ucsd.edu/~slovett/home.html">Shachar Lovett</a>, we show that while it is impossible to have such guarantees using only the <em>labels</em> of data points, achieving the goal becomes easy if you give the algorithm <strong>a little more power</strong>.</p>
<h3 id="comparison-queries">Comparison Queries</h3>
<p>Our work explores the additional power of algorithms which are allowed to <strong>compare data</strong>. In slightly more detail, imagine points in $\mathbb{R}^d$ are labeled by a linear classifier: that is $\text{sign}(f)$ for some affine linear function $f(x) = \langle x, w \rangle + b$.</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/linear-classifier.png" width="80%" /></p>
<p>A comparison between two data points $x,y \in \mathbb{R}^d$ asks which point is <em>closer</em> to the decision boundary (e.g. the purple line in Figure 2). More formally, a <strong>comparison query</strong> asks:
\[
f(x) - f(y) \overset{?}{\geq} 0.
\]
On the other hand, a standard <strong>label query</strong> on $x \in \mathbb{R}^d$ only asks which <em>side</em> of the decision boundary $x$ lies on, i.e.
\[
f(x) \overset{?}{\geq} 0.
\]
Comparison queries are natural from a human perspective—think how often throughout your day you compare objects, ideas, or alternatives. In fact, it has even been shown that in many practical circumstances, we may be <a href="https://link.springer.com/chapter/10.1007/978-3-642-14125-6_4#:~:text=The%20learning%20by%20pairwise%20comparison,preference%20modeling%20and%20decision%20making.&text=We%20explain%20how%20to%20approach,within%20the%20framework%20of%20LPC.">better at accurately comparing objects</a> than we are at labeling them! Since we are allowing our algorithm access to an expert (possibly human) oracle, it makes sense to allow the algorithm to ask the expert to compare data.</p>
<h3 id="the-algorithm">The Algorithm</h3>
<p>How can we use comparisons to learn with <strong>few queries</strong> and <strong>no mistakes</strong>? It turns out that a remarkably simple algorithm suffices! Imagine you are given a finite sample $S \subset \mathbb{R}^d$, and would like to find the label of every point in $S$ without making any errors. Consider the following basic procedure:</p>
<ol>
<li>Draw a small subsample $S’ \subset S$</li>
<li>Send $S’$ to the oracle to learn both labels and comparisons</li>
<li>Remove points from $S$ whose labels are learned in Step 2, and repeat.</li>
</ol>
<p>How exactly does Step 2 “learn labels”? Formally, this is done through a linear program whose constraints are given by the oracle responses on $S’$. Informally, this has a nice geometric interpretation. Let’s consider first the two dimensional case, originally studied by <a href="https://arxiv.org/abs/1704.03564">Kane, Lovett, Moran, and Zhang</a> (KLMZ). Figure 3 shows how comparisons allow us to infer the labels of points in $S$ by building cones (one red, one blue) based on the query results on $S’$. In essence, comparison queries allow us to find the points in $S’$ closest to the decision boundary (Figure 3(c)), which we call <strong>minimal</strong>. By drawing a cone stemming from a minimal point to others of the same label (Figure 3(d)), we can infer that every point <em>inside</em> the cone must share the same label as well (Figure 3(e)).</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/Infer.png" width="90%" /></p>
<p>Why does this process satisfy our guarantees? Let’s first discuss why we never mislabel a point, which follows from the fact that our cones stem from minima. Because our classifier is linear, this guarantees that the edges of our cones do not cross the decision boundary (i.e. change labels). Thus, the label of any point inside such a cone must be the same as its base point! Notice that this only remains true so long as the base point of our cone is minimal, which explains why comparison queries, the mechanism through which we find minima, are crucial to the algorithm.</p>
<p>The second guarantee, ensuring that we make few queries overall, is a bit more subtle, and requires the combinatorial theory of <em>inference dimension</em>.</p>
<h3 id="inference-and-average-inference-dimension">Inference and Average Inference Dimension</h3>
<p>Inference dimension is a complexity parameter introduced by KLMZ to measure how large a subsample $S’$ must be in order to learn a constant fraction of $S$.</p>
<div class="definition">
Given a set $X\subseteq \mathbb{R}^d$ and a family of classifiers $H$, the inference dimension of the pair $(X,H)$ is the smallest $k$ such that any sample $S'$ of size $k$ contains a point $x$ whose label may be inferred by queries on $S' \setminus \{x\}$. In other words, $x$ lies in a cone stemming from some minimal point to other points of the same label (as seen above in Figure 3, or below in Figure 4)
</div>
<p>Let’s take a look at an example, linear classifiers in two dimensions. Figure 4 shows that this class has inference dimension at most 7. Why? A sample of size 7 will always have at least 4 points with the same label, and the label of one of these points can always be inferred from labels and comparisons on the rest!</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/inf-dim.png" width="90%" /></p>
<p>KLMZ show that by picking the size of $S’$ to be just a constant times larger than the inference dimension, the resulting cones will usually cover a constant fraction of our distribution. In other words, in two dimensions, every round of our algorithm infers a constant fraction of the remaining points, which means we only need $O(\log(|S|))$ rounds before our algorithm has labeled everything!</p>
<p>Unfortunately, in 3+ dimensions, linear classifiers become harder to deal with—indeed KLMZ show that their inference dimension is infinite. In <a href="https://arxiv.org/pdf/1907.03816.pdf">our recent work</a>, we circumvent this issue by applying a standard assumption from the data science and learning literature: we assume that our sample $S$ is drawn from some restricted range of natural distributions. The core idea of our analysis is then based off of a simple lemma, which informally states that even if $(X,H)$ has infinite inference dimension, samples from $X$ may still have small inference dimension with high probability!</p>
<div class="lemma">
If the probability that a sample $S$ of size $k$ contains no point which may be inferred from the rest is at most $g(k)$, then size $n$ finite samples have inference dimension $k$ with probability:
</div>
<p>\[
\Pr[\text{Inference dimension of} \ (S,H) \leq k] \geq 1-{ n \choose k}g(k).
\]</p>
<p>The main technical difficulty then becomes showing that $g(k)$, which we term <em>average inference dimension</em>, is indeed small over natural distributions. We confirm that this is the case for the class of s-concave distributions $(s \geq -\frac{1}{2d+3})$, a wide ranging generalization of Gaussians that includes fatter-tailed distributions like the Pareto and t-distribution.</p>
<div class="theorem">
If $S \subseteq \mathbb{R}^d$, $|S|=k$, is drawn from an s-concave distribution, the probability that $|S|$ contains no point which may be inferred from the rest is at most:
</div>
<p>\[
g(k) \leq 2^{-\tilde{\Omega}\left(\frac{k^2}{d}\right)}.
\]</p>
<p>Plugging this into our observation, we see that as long as $S$ is reasonably large, it will have inference dimension $\tilde{O}(d\log(|S|))$ with high probability! This allows us to efficiently learn the labels of $S$ through the algorithm we discussed before, so long as the distribution is s-concave. As a corollary, we get the following result:</p>
<div class="theorem">
Using comparisons, the process described in Figure 3 learns the labels of a sample $S \subset \mathbb{R}^d$ with respect to any linear classifier in only
</div>
<p>\[
\tilde{O}(d\log(|S|)^2)
\]
<em>expected queries, as long as $S$ is drawn from an s-concave distribution.</em></p>
<p>While we have focused in the above on learning finite samples, it turns out satisfying similar guarantees over all of $\mathbb{R}^d$ (under natural distributions) is also possible via the same argument. In this case, rather than trying to learn the label of every point, we allow our algorithm to respond “I don’t know” on an $\varepsilon$ fraction of samples. This type of algorithm goes by many names in the literature, perhaps the catchiest of which is a <a href="http://icml2008.cs.helsinki.fi/papers/627.pdf">“Knows What It Knows”</a> (KWIK) learner. The above then (with a bit of work) more or less translates into a KWIK-learner<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> that uses only $d\log(1/\varepsilon)^2$ calls to the oracle.</p>
<h3 id="lower-bound">Lower Bound</h3>
<p>Why doesn’t this algorithm work with only labels? It’s long been known that even in two dimensions, <a href="https://cseweb.ucsd.edu/~dasgupta/papers/sample.pdf">achieving these learning guarantees is impossible</a> for certain adversarial distributions such as $S^1$. Let’s take a look at how our results match up on a less adversarial example with a long history in learning theory: the d-dimensional unit ball.</p>
<div class="theorem">
Using comparisons, the process described in Figure 3 KWIK-learns linear classifiers over the d-dimensional unit ball in only
</div>
<p>\[
\tilde{O}(d\log^2(1/\varepsilon))
\]
oracle calls. On the other hand, using only labels takes at least
\[
\left (\frac{1}{\varepsilon}\right)^{\Omega(d)}
\]
oracle calls<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>This simple example shows <strong>the exponential power of comparisons</strong>: moving KWIK-learning from intractable to highly efficient. As a final note, implementing the algorithm amounts to running a series of small linear programs, and as a result is computationally efficient as well, only taking about $\text{poly}\left(\frac{1}{\varepsilon},d\right)$ time.</p>
<p>It remains to be seen whether this type of efficient, comparison-based KWIK-learning will be useful in practice. Comparison queries have <a href="https://arxiv.org/abs/1206.4674">already been shown</a> to provide practical improvements over labels for similar learning problems, and have been used to great effect in other areas such as <a href="https://dl.acm.org/doi/10.1007/11823865_4">recommender systems</a>, <a href="https://papers.nips.cc/paper/4381-randomized-algorithms-for-comparison-based-search">search</a>, and <a href="https://arxiv.org/abs/1606.08842">ranking</a> as well. Since we have recently extended our results to more realistic noisy scenarios in <a href="https://arxiv.org/abs/2001.05497">joint work</a> with <a href="https://gomahajan.github.io/">Gaurav Mahajan</a>, we are optimistic that our techniques will remain as powerful in practice as they are in theory.</p>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>It’s worth noting that we have abused notation here. Formally, our algorithms do not fall into the KWIK-model, but are built in a similar learning-theoretic framework called <a href="https://people.csail.mit.edu/rivest/pubs/RS88b.pdf">RPU-learning</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This follows from a standard cap packing argument—we can divide up the ball into many disjoint caps, and note that any KWIK-learner must query a point in at least half of them to be successful. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div><a href='http://cseweb.ucsd.edu/~nmhopkin/'>Max Hopkins</a>In the world of big data, large but costly to label datasets dominate many fields. Active learning, a semi-supervised alternative to the standard PAC-learning model, was introduced to explore whether adaptive labeling could learn concepts with exponentially fewer labeled samples. Unfortunately, it is well known that in standard models, active learning provides little improvement over passive learning for the foundational classes such as linear separators. We discuss how empowering the learner to compare points resolves not only this issue, but also allows us to build efficient algorithms which make no errors at all!Adversarial Robustness for Non-Parametric Classifiers2020-07-20T17:00:00+00:002020-07-20T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/07/20/adversarial-pruning<p>In a previous <a href="/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html">post</a>,
we discussed the relationship between accuracy and robustness
for separated data.
A classifier trained on $r$-separated data can be both accurate and robust with radius $r.$
What if the data are not $r$-separated?
In our <a href="https://arxiv.org/abs/1906.03310">recent paper</a>, we look at how to deal with this case.</p>
<p>Many datasets with natural data like images or audio are $r$-separated [<a href="https://arxiv.org/abs/2003.02460">1</a>].
In contrast, datasets with artificially-extracted features are often not.
Non-parametric methods like nearest neighbors,
random forest, etc. perform well on these kind of datasets.
In this post, we focus on the discussion of non-parametric methods on non-$r$-separated datasets.</p>
<p>We first present a defense algorithm – adversarial pruning –
that can increase the robustness of many non-parametric methods.
Then we dive into how adversarial pruning deals with non-$r$-separated data.
Finally, we present a generic attack algorithm that works well across many non-parametric methods
and use it to evaluate adversarial pruning.</p>
<h3 id="defense">Defense</h3>
<p>Let us start by visualizing the
decision boundaries of a $1$-nearest neighbor ($1$-NN) and a random forest (RF) classifier on a toy dataset.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn.png" width="28%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_rf.png" width="28%" /></p>
<p>We see that the decision boundaries are highly non-smooth, and lie close to many data points,
resulting in a non-robust classifier.
This is caused by the fact that many differently-labeled examples are near each
other.
Next, let us consider a modified dataset in which the red and blue examples are more separated.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn_ap30.png" width="28%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_rf_ap30.png" width="28%" /></p>
<p>Notice that the boundaries become smoother as examples move
further away from the boundaries.
This makes the classifier more robust as the predicted label stays the same
if data are perturbed a little.</p>
<h4 id="adversarial-pruning">Adversarial Pruning</h4>
<p>From these figures, we can see that these non-parametric methods are
more robust when data are better separated.
Given a dataset, to make it more separated, we need to remove examples.
To preserve information in the dataset, we do not want to remove too many examples.
We design our defense algorithm to minimally remove examples from the dataset
so that differently-labeled examples are well-separated from each other.
After this modification, we can train a non-parametric classifier on it.
We call this defense algorithm <em>adversarial pruning (AP)</em>.</p>
<p>More formally, given a robustness radius $r$ and a training set $\mathcal{S}$, AP computes
a maximum subset $\mathcal{S}^{AP} \subseteq \mathcal{S}$ such that differently-labeled
examples in $\mathcal{S}^{AP}$ have distance at least $r$.
We show that known graph algorithms can be used to efficiently compute $\mathcal{S}^{AP}$.
We build a graph $G=(V, E)$ as follows.
First, each training example is a vertex in the graph.
We connect pairs of differently-labeled examples (vertices) $\mathbf{x}$ and $\mathbf{x}’$ with an edge whenever $|\mathbf{x} − \mathbf{x}’| \leq 2r$.
Then, computing $\mathcal{S}^{AP}$ is reduced to removing as few examples as possible so that no more edges remain.
This is equivalent to solving the <a href="https://mathworld.wolfram.com/VertexCover.html">minimum vertex cover</a> problem.
When dealing with binary classification problem, the graph $G$ is bipartite and
standard algorithms like the <a href="https://en.wikipedia.org/wiki/Hopcroft%E2%80%93Karp_algorithm">Hopcroft–Karp algorithm</a>
can be used to solve this problem.
With multi-class classification, minimum vertex cover is NP Hard in general, and
<a href="https://networkx.github.io/documentation/stable/_modules/networkx/algorithms/approximation/vertex_cover.html">approximation algorithms</a>
have to be applied.</p>
<h4 id="theoretical-justification">Theoretical Justification</h4>
<p>It happens that Adversarial Pruning has a nice theoretical interpretation -
we can show that it can be interpreted as a finite sample approximation to the optimally robust and accurate classifier.
To understand this, first, let us try to understand what the goal of robust classification is.
We assume the data is sampled from a distribution $\mu$ on $\mathcal{X} \times [C]$, where $\mathcal{X}$ is the feature
space and $C$ is the number of classes.
Normally, the ultimate limit of accurate classification is the Bayes optimal classifier which maximizes the accuracy on the underlying data distribution.
But the Bayes optimal may not be very robust.</p>
<p>Let us look at the figure below. The blue curve is the decision boundary of the Bayes optimal classifier.
We see that this blue curve is close to the data distribution and thus not the most robust.
An alternative decision boundary is the black curve, which is further away from the distribution while still being accurate.</p>
<figure class="image" style="text-align: center;">
<span>
<img src="/assets/2020-07-20-adversarial-pruning/r-opt.png" width="60%" style="margin: 0 auto" />
</span>
</figure>
<p>We define the astuteness of a classifier as its accuracy on examples where it is robust with
radius $r$.
The objective of a robust classifier is to maximize the
astuteness under $\mu$, which is the probability that the classifier is both $r$-robust and accurate for a new sample $(\mathbf{x}, y)$ [<a href="https://arxiv.org/abs/1706.06083">1</a>, <a href="https://arxiv.org/abs/1706.03922">2</a>].</p>
<div class="definition" style="overflow-x: auto;">
Let $\mathbb{B}(\mathbf{x}, r)$ be the ball with radius $r$ around $\mathbf{x}$ and
$S_j(f,r) := \{\mathbf{x} \in \mathcal{X} \mid f(\mathbf{x}') = j \text{ s.t. } \forall \mathbf{x}' \in \mathbb{B}(\mathbf{x}, r)\}$.
For distribution $\mu$ on $\mathcal{X} \times [C]$, the astuteness is defined as
$$
ast_\mu(f,r) = \sum_{j=1}^{C} \int_{\mathbf{x} \in S_j(f,r)} Pr(y = j \mid \mathbf{x}) d \mu.
$$
</div>
<p>Next, we present the $r$-optimal classifier that achieves optimal astuteness.
By comparing it with the classic Bayes optimal classifier, which
achieves optimal accuracy, the $r$-optimal classifier is a <em>Robust Analogue to the Bayes optimal classifier</em>.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="">
<tr>
<th>$r$-optimal classifier (black curve)</th>
<th>Bayes optimal classifier (blue curve)</th>
</tr>
<tr>
<td>Optimal astuteness</td>
<td>Optimal accuracy</td>
</tr>
<tr>
<td>
\begin{split}
\max_{S_1,\ldots, S_c} & \sum_{j=1}^{c} \int_{\mathbf{x} \in S_j} Pr(y = j \mid \mathbf{x}) d\mu \\
\mbox{ s.t. } \quad & d(S_j, S_{j'}) \geq 2r \quad \forall j \neq j' \\
& d(S_j, S_{j'}) := \min_{u \in S_j, v \in S_{j'}} \| u-v\|_p
\end{split}
</td>
<td>
\begin{split}
\max_{S_1,\ldots, S_c} & \sum_{j=1}^{c} \int_{\mathbf{x} \in S_j} Pr(y = j \mid \mathbf{x}) d\mu \\
\end{split}
</td>
</tr>
</table>
</div>
<p>We observe that
AP can be interpreted as a finite sample approximation to the $r$-optimal classifier.
If $S_j$ are sets of examples, then
the solution to the $r$-optimal classifier is maximum subsets of
training data with differently-labeled examples being $2r$ apart.
As long as the training set $S$ is representative of $\mu$, these subsets ($S_j$) approximate
the optimal subsets ($S^*_j$).
Hence, we posit that non-parametric methods trained
on $S^{AP}$ should approximate the r-optimal classifier</p>
<p>For more about the $r$-optimal classifier,
please refer to this <a href="https://arxiv.org/abs/2003.06121">paper</a>.</p>
<h4 id="adversarial-pruning-generates-r-separated-datasets">Adversarial pruning generates $r$-separated datasets</h4>
<p>What AP does is remove the minimum number of examples so that the dataset
becomes $r$-separated.
In our previous
<a href="/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html">post</a>,
we show that there is no intrinsic trade-off between robustness and accuracy when the
dataset is $r$-separated.
This means that there exists a classifier that achieves
perfect robustness and accuracy.
However, the solution may make mistake on the examples removed by AP and we can
think about the removed examples as the trade-off between robustness and accuracy.</p>
<h3 id="evaluating-ap-an-attack-method">Evaluating AP: An Attack Method</h3>
<p>In this section, we provide an attack algorithm to evaluate the robustness
of non-parametric methods.
For parametric classifiers such as neural networks, generic gradient-based attacks exist.
Our goal is to develop an analogous general attack method, which applies to and
works well for multiple non-parametric classifiers.</p>
<p>The attack algorithm is called region-based attack (RBA).
Given an example $\mathbf{x}$, RBA can find the closest example to $\mathbf{x}$ with different prediction,
in other words, RBA achieves the optimal attack.
In addition, RBA can be applied to many non-parametric methods while
many prior attacks for non-parametric methods
[<a href="https://arxiv.org/abs/1605.07277">1</a>, <a href="https://arxiv.org/abs/1509.07892">2</a>] are classifier specific.
<a href="https://arxiv.org/abs/1605.07277">1</a> only applies to $1$ nearest neighbors and
<a href="https://arxiv.org/abs/1509.07892">2</a> only applies to tree-based classifiers.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn_voronoi.png" width="30%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_dt_regions.png" width="30%" />
<img src="/assets/2020-07-20-adversarial-pruning/region_pert_2.png" width="28%" /></p>
<p>To understand how RBA works, let us look at the figures above.
The figures above show the decision boundaries of $1$-NN and decision tree on a toy dataset.
We see that the feature space is divided into many regions, where
examples in the same region have the same prediction
(meaning we can assign a label to each region).
These regions are convex for nearest neighbors and tree-based classifiers.</p>
<p>Suppose the example we want to attack is $\mathbf{x}$ and $y$ is its label.
RBA works as follows.
Suppose we could find the region $P_i$ that is the closest to $\mathbf{x}$ and
its label is not $y$.
Then, the closest example in $P_i$ to $\mathbf{x}$ would be the optimal adversarial example.
RBA finds the closest region $P_i$ by iterating through each region that is labeled differently from $y$.
More formally, given a set of regions and its corresponding label $(P_i, y_i)$, the RBA solves
the following optimization problem:</p>
<div style="overflow-x: auto;">
\[
\underset{i : f(\mathbf{x}) \neq y_i }{\textcolor{red}{min}} \
\underset{\mathbf{x}_{adv} \in P_i}{\textcolor{ForestGreen}{min}} \|\mathbf{x} - \mathbf{x}_{adv}\|_p
\]
</div>
<p>The $\textcolor{red}{\text{outer $min$}}$ can be solved by iterating through all regions.
The $\textcolor{ForestGreen}{\text{inner $min$}}$ can be solved with standard linear programming when $p=1$ and $\infty$ and quadratic programming when $p=2$.
When this optimization problem is solved exactly, we call it RBA-Exact.</p>
<p>Interestingly, concurrent works [<a href="https://arxiv.org/abs/1810.07481">1</a>,
<a href="https://arxiv.org/abs/1809.03008">2</a>, <a href="https://arxiv.org/abs/1711.07356">3</a>,
<a href="https://arxiv.org/abs/1903.08778">4</a>] have also shown that the decision regions of
ReLU networks are also decomposable into convex regions and developed attacks based on this property.</p>
<p><strong>Speeding up RBA.</strong>
Different non-parametric methods divide the feature space into different numbers of regions.
When attacking $k$-NN, there would be $O(\binom{N}{k})$ regions, where $N$ is the number of training
examples.
When attacking RF, there is an exponential number of regions with growing number of trees.
It is computationally infeasible to solve RBA-Exact when the number of regions is large.</p>
<p>We develop an approximate version of RBA (RBA-Approx.) to speed up the process and make our algorithm applicable
to real datasets.
We relax the $\textcolor{red}{\text{outer $min$}}$ by iterating over only a fixed number of regions based on
the following two criteria.
First, a region has to have at least one training example in it to be considered.
Second, if $\mathbf{x}_i$ is the training example in the region $P_i$, then the
regions with smaller $\|\mathbf{x}_i - \mathbf{x}\|_p$ are considered first
until we exceed the number of regions we want to search.
We found that empirically using these two criteria to search $50$ regions can find
adversarial examples very close to the target example.</p>
<h3 id="empirical-results">Empirical Results</h3>
<p>We empirically evaluate the performance of our attack (RBA) and defense (AP) algorithms.</p>
<p><strong>Evaluation criteria for attacks.</strong>
We use the distance between an input $\mathbf{x}$ and its generated adversarial example
$\mathbf{x}_{adv}$ to evaluate the performance of the attack algorithm.
We call this criterion <em>empirical robustness (ER)</em>
The lower ER is, the better the attack algorithm is.
We calculate the average ER over correctly predicted test examples.</p>
<p><strong>Evaluation criteria for defenses.</strong>
To evaluate the performance of a defense algorithm,
we use the ratio of the distance between an input $\mathbf{x}$ and its closest
adversarial example being found before and after the defense algorithm is applied.
We call this criterion <em>defense score ($\text{defscore}$)</em>.
More formally,</p>
<div style="overflow-x: auto;">
$$
\text{defscore}(\mathbf{x}) =
\frac{\text{defended dist. from } \mathbf{x} \text{ to } \mathbf{x}_{adv}}{\text{undefended dist. from } \mathbf{x} \text{ to } \mathbf{x}_{adv}}
= \frac{\text{ER w/ defense}}{\text{ER w/o defense}}.
$$
</div>
<p>We calculate the average defscore over the correctly predicted test examples.
A larger defscore means that the attack algorithm needs a larger perturbation to change the label.
Thus, the more effective the defense algorithm is.
If the defscore is larger than one, then the defense is effectively making
the classifier more robust.</p>
<p>We consider the following non-parametric classifiers:
$1$ nearest neighbor ($1$-NN), $3$ nearest neighbor ($3$-NN), and random forest (RF).</p>
<p><strong>Attacks.</strong>
To evaluate RBA, we compare with other attack algorithms for non-parametric methods.
<a href="https://arxiv.org/abs/1605.07277">Direct attack</a> is designed to attack nearest neighbor classifiers.
<a href="https://arxiv.org/abs/1807.04457">Black box attack (BBox)</a> is another algorithm that applies to many
non-parametric methods.
However, as a black-box attack, it does not use the
internal structure of the classifier.
It appears that BBox is the state-of-the-art algorithm for attacking non-parametric methods.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="font-size: 80%; ">
<tr>
<th colspan="1"></th>
<th colspan="4">$1$-NN</th>
<th colspan="3">$3$-NN</th>
<th colspan="2">RF</th>
</tr>
<tr>
<th>Dataset</th>
<th>Direct</th> <th>BBox</th> <th>RBA Exact</th> <th>RBA Approx.</th>
<th>Direct</th> <th>BBox</th> <th>RBA Approx.</th>
<th>BBox</th> <th>RBA Approx.</th>
</tr>
<tr>
<td>cancer</td>
<td>.223</td> <td>.364</td> <td style="font-weight: bold">.137</td> <td style="font-weight: bold">.137</td>
<td>.329</td> <td>.376</td> <td style="font-weight: bold">.204</td>
<td>.451</td> <td style="font-weight: bold">.383</td>
</tr>
<tr>
<td>covtype</td>
<td>.130</td> <td>.130</td> <td style="font-weight: bold">.066</td> <td>.067</td>
<td>.200</td> <td>.259</td> <td style="font-weight: bold">.108</td>
<td>.233</td> <td style="font-weight: bold">.214</td>
</tr>
<tr>
<td>diabetes</td>
<td>.074</td> <td>.112</td> <td style="font-weight: bold">.035</td> <td style="font-weight: bold">.035</td>
<td>.130</td> <td>.143</td> <td style="font-weight: bold">.078</td>
<td style="font-weight: bold">.181</td> <td>.184</td>
</tr>
<tr>
<td>halfmoon</td>
<td>.070</td> <td>.070</td> <td style="font-weight: bold">.058</td> <td style="font-weight: bold">.058</td>
<td>.105</td> <td>.132</td> <td style="font-weight: bold">.096</td>
<td>.182</td> <td style="font-weight: bold">.149</td>
</tr>
</table>
</div>
<p>From the result, we see that the RBA algorithm is able to perform well across many non-parametric methods
and datasets (for results on more datasets and classifiers, please refer our
<a href="https://arxiv.org/abs/1906.03310">paper</a>).
For $1$-NN, RBA-Exact performed the best as expected since its optimal.
For $3$-NN and RF, RBA-Approx. also performed the best among the baselines.</p>
<p><strong>Defenses.</strong>
For baseline, we consider <a href="https://arxiv.org/abs/1706.03922">WJC</a> for the defense of $1$-NN and
<a href="https://arxiv.org/abs/1902.10660">robust splitting (RS)</a> for tree-based classifiers.
Another baseline is the <a href="https://arxiv.org/abs/1706.06083">adversarial training (AT)</a>,
which has a lot of success in parametric classifiers.
We use RBA-Exact to attack $1$-NN and RBA-Approx to attack $3$-NN and RF for the calculation
of defscore.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="width:100%; font-size: 80%;">
<tr>
<th colspan="1"></th>
<th colspan="3">$1$-NN</th>
<th colspan="2">$3$-NN</th>
<th colspan="3">RF</th>
</tr>
<tr>
<th>Dataset</th>
<th>AT</th> <th>WJC</th> <th>AP</th>
<th>AT</th> <th>AP</th>
<th>AT</th> <th>RS</th> <th>AP</th>
</tr>
<tr>
<td>cancer</td>
<td>0.82</td> <td>1.05</td> <td style="font-weight: bold">1.41</td>
<td>1.06</td> <td style="font-weight: bold">1.39</td>
<td>0.87</td> <td style="font-weight: bold">1.54</td> <td>1.26</td>
</tr>
<tr>
<td>covtype</td>
<td>0.61</td> <td style="font-weight: bold">4.38</td> <td style="font-weight: bold">4.38</td>
<td>0.88</td> <td style="font-weight: bold">3.31</td>
<td>1.02</td> <td>1.01</td> <td style="font-weight: bold">2.13</td>
</tr>
<tr>
<td>diabetes</td>
<td>0.83</td> <td style="font-weight: bold">4.69</td> <td style="font-weight: bold">4.69</td>
<td>0.87</td> <td style="font-weight: bold">2.97</td>
<td>1.19</td> <td>1.25</td> <td style="font-weight: bold">2.22</td>
</tr>
<tr>
<td>halfmoon</td>
<td>1.05</td> <td>2.00</td> <td style="font-weight: bold">2.78</td>
<td>0.93</td> <td style="font-weight: bold">1.92</td>
<td>1.04</td> <td>1.01</td> <td style="font-weight: bold">1.82</td>
</tr>
</table>
</div>
<p>From the table, we see that AP performs well across different classifiers.
AP always generates above $1.0$ defscore, which means the classifier becomes more robust after the defense.
This shows that AP is applicable to many non-parametric classifiers as oppose to
WJC and RS, which are classifier-specific defenses.
AT performs poorly for non-parametric classifiers (this is aligned with previous
<a href="https://arxiv.org/abs/1902.10660">findings</a>.)
This result demonstrates that AP can serve as a good baseline for a new non-parametric
classifier.</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we consider adversarial examples for non-parametric
classifiers and presented generic defenses and attacks.
The defense algorithm – adversarial pruning – bridges the gap between
$r$-separated and non-$r$-separated data by removing the minimum number of examples
to make the data well-separated.
Adversarial pruning can be interpreted as a finite sample approximation to the
$r$-optimal classifier, which is the most robust classifier under attack radius $r$.
The attack algorithm – region-based attack – finds the closest adversarial example
and achieves the optimal attack.
On the experiment side, we show that both these algorithms are able to perform well across
multiple non-parametric classifiers.
They can be good candidates for baseline evaluation of robustness for newly designed
non-parametric classifiers.</p>
<h3 id="more-details">More Details</h3>
<p>See <a href="https://arxiv.org/abs/1906.03310">our paper on arxiv</a> or <a href="https://github.com/yangarbiter/adversarial-nonparametrics">our repository</a>.</p><a href='http://yyyang.me/'>Yao-Yuan Yang</a>Adversarial robustness has received much attention recently. Prior defenses and attacks for non-parametric classifiers have been developed on a classifier-specific basis. In this post, we take a holistic view and present a defense and an attack algorithm that are applicable across many non-parametric classifiers. Our defense algorithm, adversarial pruning, works by preprocessing the dataset so the data is better separated. It can be interpreted as a finite sample approximation to the optimally robust classifier. The attack algorithm, region-based attack, works by decomposing the feature space into convex regions. We show that our defense and attack have good empirical performance over a range of datasets.Adversarial Robustness Through Local Lipschitzness2020-05-04T17:00:00+00:002020-05-04T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness<p>Neural networks are very susceptible to adversarial examples, a.k.a., small perturbations of normal inputs that cause a classifier to output the wrong label.
The standard defense against adversarial examples is <a href="https://arxiv.org/abs/1706.06083">Adversarial Training</a>, which trains a classifier using adversarial examples close to training inputs.
This improves test accuracy on adversarial examples, but it often lowers clean accuracy, sometimes by a lot.</p>
<p>Several recent papers investigate whether an accuracy-robustness trade-off is necessary.
Some <a href="https://arxiv.org/abs/1805.12152">pessimistic work</a> says that unfortunately this may be the case, possibly <a href="https://arxiv.org/abs/1801.02774">due to high-dimensionality</a> or <a href="https://arxiv.org/abs/1805.10204">computational infeasibility</a>.</p>
<p>If a trade-off is unavoidable, then we have a dilemma: should we aim for higher accuracy or robustness or somewhere in between?
Our <a href="https://arxiv.org/abs/2003.02460">recent paper</a> explores an optimistic perspective: we posit that robustness and accuracy should be attainable together for real image classification tasks.</p>
<p>The main idea is that we should use a locally smooth classifier, one that doesn’t change its value too quickly around the data. Let’s walk through some theory about why this is a good idea. Then, we will explore how to use this in practice.</p>
<h3 id="the-problem-with-natural-training">The problem with natural training</h3>
<p>The reason why we see a trade-off between robustness and accuracy is due to training methods. The best neural network optimization methods lead to functions that change very rapidly, as this allows the network to closely fit the data.</p>
<p>Since we care about robustness, we actually want to move as slowly as possible from class to class. This is especially true for separated data. Think about an image dataset. Cats look different than dogs, and pandas look different than gibbons. Quantitatively, different animals should be far apart (for example, in $\ell_{\infty}$ and $\ell_2$ distance). It follows that we should be able to classify them robustly. If we are very confident in our prediction, then as long as we don’t modify a true image too much, we should output the same, correct label.</p>
<p>So why do adversarial perturbations lead to a high error rate? This is a very active area of research, and there’s no easy answer.
As a step towards a better understanding, we present theoretical results on achieving perfect accuracy and robustness by using a locally smooth function. We also explore how well this works in practice.</p>
<p>As a motivating example, consider a simple 2D binary classification dataset. The goal is to find a decision boundary that has 100% training accuracy without passing closely to any individual input.
The orange curve in the following picture shows such a boundary. In contrast, the black curve comes very close to some data points. Even though both boundaries correctly classify all of the examples, the black curve is susceptible to adversarial examples, while the orange curve is not.</p>
<p style="text-align: center;"><img src="/assets/2020-05-04-local-lip/wig_boundary.png" width="40%" /></p>
<h3 id="perfect-accuracy-and-robustness-at-least-in-theory">Perfect accuracy and robustness, at least in theory</h3>
<p>We propose designing a classifier using the sign of a relatively smooth function. For separated data, this ensures that it’s impossible to change the label by slightly perturbing a true input. In other words, if the function value doesn’t change very quickly, then neither does the label.</p>
<p>More formally, we consider classifiers $g(x) = \mathsf{sign}(f(x))$, and we highlight the local Lipschitzness of $f$ as an important quantity. Simply put, the Lipschitz constant of a function measures how fast a function changes by dividing the difference between function values by the distance between inputs:
$\frac{|f(x) - f(y)|}{d(x,y)}.$
Here $d(x,y)$ can be any metric. It is most common to use $d(x,y) = \|x - y\|$ for some norm on $\mathbb{R}^d$.
Previous works (<a href="https://arxiv.org/abs/1811.05381">1</a>, <a href="https://arxiv.org/abs/1807.09705">2</a>) shows that enforcing global Lipschitzness is too strict. Instead, we consider when $f$ is $L$-locally Lipschitz, which means that it changes slowly, at rate $L$, in a small neighborhood of radius $r$ around it.</p>
<div class="definition">
A function $f: \mathcal{X} \rightarrow \mathbb{R}$ is $L$-Locally Lipschitz in a radius $r$ around a point $x \in \mathcal{X}$, if for all $x'$ such that $d(x,x') \leq r$, we have
$ |f(x) - f(x')| \leq L \cdot d(x, x').$
</div>
<p>Previous work by <a href="https://arxiv.org/abs/1705.08475">Hein and Andriushchenko</a> has shown that local Lipschitzness indeed guarantees robustness.
In fact, variants of Lipschitzness have been the main tool in certifying robustness with <a href="https://arxiv.org/abs/1902.02918">randomized smoothing</a> as well.
However, we are the first to identify a natural condition (data separation) that ensures both robustness and high test accuracy.</p>
<p>Our main theoretical result says that if the two classes are separated – in the sense that points from different classes are distance at least $2r$ apart, then there exists a $1/r$-locally Lipschitz function that is both robust to perturbations of distance $r$ and also 100% accurate.</p>
<p>For many real world datasets, the separation assumption in fact holds.</p>
<div style="text-align: center;">
<img src="/assets/2020-05-04-local-lip/cifar10_linf_hist.png" width="48%" style="margin: 0 auto" />
<img src="/assets/2020-05-04-local-lip/resImgNet_linf_hist.png" width="48%" style="margin: 0 auto" />
</div>
<p>For example, consider the CIFAR-10 and Restricted ImageNet datasets (for the latter, we removed a handful of images that appeared twice with different labels).
The figure shows the histogram of the $\ell_\infty$ distance of each training example to its closest differently-labeled example.
From the figure we can see that the dataset is $0.21$ separated, indicating that there exists a solution that’s both robust and accurate with a perturbation distance up to $0.105$.
Perhaps surprisingly, most work on adversarial examples considers small perturbations of size $0.031$ for CIFAR-10 and $0.005$ for Restricted ImageNet, which are both much less than the observed separation in these histograms.</p>
<div class="theorem">
If the data is $2r$-separated, then there always exists a classifier that is perfectly robust and accurate, which is based on a function with local Lipschitz constant $1/r$.
</div>
<p>We basically use a scaled version of the 1-nearest neighbor classifier in the infinite sample limit. The proof just uses the data separation along with a few applications of the triangle inequality. The next figure shows our theorem in action on the Spiral dataset. The classifier $g(x) = \mathsf{sign}(f(x))$ has high adversarial and clean accuracy, while the small local Lipschitz constant ensures that it gradually changes near the decision boundaries.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-05-04-local-lip/spiral.png" width="40%" style="margin: 0 auto" />
<figcaption>
Function and resulting classifier from our theorem.
The prediction is confident most of the time, and it gradually changes between classes (orange to blue).
</figcaption>
</figure>
<h3 id="encouraging-the-smoothness-of-neural-networks">Encouraging the smoothness of neural networks</h3>
<p>Now that we’ve made a big deal of local Lipschitzness, and provided some theory to back it up, we want to see how well this holds up in practice. Two questions drive our experiments:</p>
<ul>
<li>Is local Lipschitzness correlated with robustness and accuracy in practice?</li>
<li>Which training methods produce locally Lipschitz functions?</li>
</ul>
<p>We also need to explain how we measure Lipschitzness on real data. For simplicity, we consider the average local Lipschitzness, computed using</p>
<p>\[
\frac{1}{n}\sum_{i=1}^n\max_{x_i’\in\mathsf{Ball}(x_i,\epsilon)}\frac{|f(x_i)-f(x_i’)|}{\|x_i-x_i’\|_\infty}.
\]</p>
<p>The benefit is that we want the function to be smooth on average, even though there may be some outliers.
One of the best methods for adversarial examples is <a href="https://arxiv.org/abs/1901.08573">TRADES</a>, which encourages local Lipschitzness by minimizing the following loss function:</p>
<p>\[
\min_{f} \mathbb{E} \Big\{\mathcal{L}(f(X),Y)+\beta\max_{X’\in\mathsf{Ball}(X,\epsilon)} \mathcal{L}(f(X),f(X’))\Big\}.
\]</p>
<p>TRADES is different than <a href="https://arxiv.org/abs/1706.06083">Adversarial Training (AT)</a>, which optimizes the following:</p>
<p>\[
\min_{f} \mathbb{E} \Big\{\max_{X’\in\mathsf{Ball}(X,\epsilon)}\mathcal{L}(f(X’),Y)\Big\}.
\]</p>
<p>AT directly optimizes over adversarial examples, while TRADES encourages $f(X)$ and $f(X’)$ to be similar when $X$ and $X’$ are close to each other. The TRADES parameter $\beta$ controls the local smoothness (larger $\beta$ means a smaller Lipschitz constant).</p>
<p>We also consider two other plausible methods for achieving accuracy and robustness, along with local Lipschitzness:
<a href="https://arxiv.org/abs/1907.02610">Local Linear Regularization (LLR)</a>
and <a href="https://arxiv.org/abs/1905.11468">Gradient Regularization (GR)</a>.</p>
<h3 id="comparing-five-different-training-methods">Comparing five different training methods</h3>
<p>Here we provide experimental results for CIFAR-10 and Restricted ImageNet. See our paper for other datasets (MNIST and SVHN).</p>
<table>
<thead>
<tr>
<th style="text-align: left">CIFAR-10</th>
<th style="text-align: center">train accuracy</th>
<th style="text-align: center">test accuracy</th>
<th style="text-align: center">adv test accuracy</th>
<th style="text-align: center">test lipschitz</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Natural</td>
<td style="text-align: center">100.00</td>
<td style="text-align: center">93.81</td>
<td style="text-align: center">0.00</td>
<td style="text-align: center">425.71</td>
</tr>
<tr>
<td style="text-align: left">GR</td>
<td style="text-align: center">94.90</td>
<td style="text-align: center">80.74</td>
<td style="text-align: center">21.32</td>
<td style="text-align: center">28.53</td>
</tr>
<tr>
<td style="text-align: left">LLR</td>
<td style="text-align: center">100.00</td>
<td style="text-align: center">91.44</td>
<td style="text-align: center">22.05</td>
<td style="text-align: center">94.68</td>
</tr>
<tr>
<td style="text-align: left">AT</td>
<td style="text-align: center">99.84</td>
<td style="text-align: center">83.51</td>
<td style="text-align: center">43.51</td>
<td style="text-align: center">26.23</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=1)</td>
<td style="text-align: center">99.76</td>
<td style="text-align: center">84.96</td>
<td style="text-align: center">43.66</td>
<td style="text-align: center">28.01</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=3)</td>
<td style="text-align: center">99.78</td>
<td style="text-align: center">85.55</td>
<td style="text-align: center">46.63</td>
<td style="text-align: center">22.42</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=6)</td>
<td style="text-align: center">98.93</td>
<td style="text-align: center">84.46</td>
<td style="text-align: center">48.58</td>
<td style="text-align: center">13.05</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: left">Restricted ImageNet</th>
<th style="text-align: center">train accuracy</th>
<th style="text-align: center">test accuracy</th>
<th style="text-align: center">adv test accuracy</th>
<th style="text-align: center">test lipschitz</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Natural</td>
<td style="text-align: center">97.72</td>
<td style="text-align: center">93.47</td>
<td style="text-align: center">7.89</td>
<td style="text-align: center">32228.51</td>
</tr>
<tr>
<td style="text-align: left">GR</td>
<td style="text-align: center">91.12</td>
<td style="text-align: center">88.51</td>
<td style="text-align: center">62.14</td>
<td style="text-align: center">886.75</td>
</tr>
<tr>
<td style="text-align: left">LLR</td>
<td style="text-align: center">98.76</td>
<td style="text-align: center">93.44</td>
<td style="text-align: center">52.65</td>
<td style="text-align: center">4795.66</td>
</tr>
<tr>
<td style="text-align: left">AT</td>
<td style="text-align: center">96.22</td>
<td style="text-align: center">90.33</td>
<td style="text-align: center">82.25</td>
<td style="text-align: center">287.97</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=1)</td>
<td style="text-align: center">97.39</td>
<td style="text-align: center">92.27</td>
<td style="text-align: center">79.90</td>
<td style="text-align: center">2144.66</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=3)</td>
<td style="text-align: center">95.74</td>
<td style="text-align: center">90.75</td>
<td style="text-align: center">82.28</td>
<td style="text-align: center">396.67</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=6)</td>
<td style="text-align: center">93.34</td>
<td style="text-align: center">88.92</td>
<td style="text-align: center">82.13</td>
<td style="text-align: center">200.90</td>
</tr>
</tbody>
</table>
<p>For both datasets, we see correlation between accuracy, Lipschitzness, and adversarial accuracy. For example, on CIFAR-10, we see that TRADES($\beta$=6) achieves the highest adversarial test accuracy (48.58), and also the lowest Lipschitz constant (13.05). TRADES may not always perform better than AT, but it seems like a very effective method to produce classifiers with small local Lipschitz constants. One issue is that the training accuracy isn’t as high as it could be, and there are some issues with tuning the methods to prevent underfitting. In general, we focus on understanding the role of Lipschitzness.</p>
<p>Natural training has the lowest adversarial accuracy, and also the higher Lipschitz constant. GR has a fairly low training accuracy (possibly due to underfitting).
For LLR, AT, and TRADES, we see that smoother classifiers have higher adversarial test accuracy as well. However, this is only true up to some point. Increased local Lipschitzness helps, but with very high local Lipschitzness, the neural networks start underfitting which leads to loss in accuracy, for example, with TRADES($\beta$=6).</p>
<h3 id="robustness-requires-some-local-lipschitzness">Robustness requires some local Lipschitzness</h3>
<p>Our experimental results provide many insights into the role that Lipschitzness plays in classifier accuracy and robustness.</p>
<ul>
<li>
<p>A clear takeaway is that <em>very high</em> Lipschitz constants imply that the classifier is vulnerable to adversarial examples. We see this most clearly with natural training, but it is also evidenced by GR and LLR.</p>
</li>
<li>
<p>For both CIFAR and Restricted ImageNet, the experiments show that minimizing the Lipschitzness goes hand-in-hand with maximizing the adversarial accuracy. This highlights that Lipschitzness is just as important as training with adversarial examples when it comes to improving the adversarial robustness.</p>
</li>
<li>
<p>TRADES always leads to significantly smaller Lipschitz constants than most methods, and the smoothness increases with the TRADES parameter $\beta$. However, the correlation between smoothness and robustness suffers from diminishing returns. It is not optimal to minimize the Lipschitzness as much as possible.</p>
</li>
<li>
<p>The main downside of AT and TRADES is that the clean accuracy suffers. This issue may not be inherent to robustness, but rather it may be possible to achieve the best of both worlds. For example, LLR is consistently more robust than natural training, while simultaneously achieving state-of-the-art clean test accuracy. This leaves open the possibility of combining the benefits of both LLR and AT/TRADES into a classifier that does well across the board. This is the main future work!</p>
</li>
</ul>
<h3 id="more-details">More Details</h3>
<p>See <a href="https://arxiv.org/abs/2003.02460">our paper on arxiv</a> or <a href="https://github.com/yangarbiter/robust-local-lipschitz">our repository</a>.</p><a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a> and <a href='http://yyyang.me/'>Yao-Yuan Yang</a>Robustness often leads to lower test accuracy, which is undesirable. We prove that (i) if the dataset is separated, then there always exists a robust and accurate classifier, and (ii) this classifier can be obtained by rounding a locally Lipschitz function. Empirically, we verify that popular datasets (MNIST, CIFAR-10, and ImageNet) are separated, and we show that neural networks with a small local Lipschitz constant indeed have high test accuracy and robustness.