Jekyll2021-05-10T16:38:49+00:00https://ucsdml.github.io//feed.xmlUCSD Machine Learning GroupResearch updates from the UCSD community, with a focus on machine learning, data science, and applied algorithms.Location Trace Privacy Under Conditional Priors2021-05-10T19:00:00+00:002021-05-10T19:00:00+00:00https://ucsdml.github.io//jekyll/update/2021/05/10/location-trace-privacy<p>Imagine a mobile app that repeatedly records your geolocation over a short period of time – say a day. We call this sequence of locations a <em>location trace</em>. Ideally, the app would like to use these locations to send recommendations or ads or even reminders. But there is the issue of privacy; many people including myself, would feel uncomfortable if our exact locations were to be shared with and recorded by apps. One option may be to completely shut off all location services. But is it possible to have a happy medium? In other words, can we obscure a location trace of an user while still providing some privacy?</p>
<h3 id="rigorous-privacy-definitions-differential-and-inferential">Rigorous Privacy Definitions: Differential and Inferential</h3>
<p>Before we get to what privacy means in this case, let us look at how rigorous privacy definitions work. Broadly speaking, the literature has two main philosophies of rigorous definitions of statistical privacy — differential and inferential privacy. Differential privacy is an elegant privacy definition designed by cryptographers Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 2006. The philosophy here is that the participation of a single person in the data should not make a big difference to the probability of any outcome; this, in turn, implies that an adversary watching the output of a differentially private algorithm cannot determine for sure if a certain person is in the dataset or not. Differential privacy has many elegant properties — such as, robustness to auxiliary information, graceful composition and post processing invariance.</p>
<p>Inferential privacy in contrast means that an adversary with a certain prior knowledge does not gain a lot of extra knowledge after seeing the output of a private algorithm. While this notion is older than differential privacy, it was formalized by <a href="https://users.cs.duke.edu/~ashwin/pubs/pufferfish_TODS.pdf"> Kifer and Machanavajjhala in 2012 as the Pufferfish privacy framework</a>. Inferential privacy does not always have the elegant properties of differential privacy, but it tends to be more flexible in the sense that it can obscure some specific events. Besides, some inferential privacy frameworks or algorithms do have graceful composition and are robust to certain kinds of auxiliary information. There is a no-free-lunch theorem that states that inferential privacy against all manner of auxiliary information will imply no utility — and so there is a limit to how far this can extend.</p>
<h3 id="a-privacy-framework-for-location-traces">A Privacy Framework for Location Traces</h3>
<p>Coming back to the privacy of location traces, let us now think about some options on how to model them in a rigorous privacy framework. There are two interesting aspects about location traces. First, location is continuous spatial data — and for both privacy and utility, we may need to obscure it up to a certain distance. We call this the <em>spatiality aspect</em>. But the more challenging aspect is correlation. My location at 10am is highly correlated with my location at 10:05, and not building this into a privacy framework may lead to privacy leaks.</p>
<p>Our first option is to use local differential privacy (LDP), which is basically differential privacy applied to a single person’s data. This will mean that two traces — one in New York and one in California — will be almost indistinguishable. However, this involves adding considerable noise to each trace — so much so as to render them completely useless. We will have very good privacy, but almost no utility whatsoever.</p>
<p>Our second option is to realize that while most people may be uncomfortable sharing fine-grained location information, they may be okay with coarse-grained data. For example, since I work at UCSD, which is in La Jolla, CA, I may not mind someone knowing that I spend most of my working hours in La Jolla; but I would not want them to know my precise location. This is known as <em>geo-indistinguishability</em>, and is achieved by adding independent noise with a radius $r$ to each location. This improves utility, if we are releasing a single location, but still has challenges with traces. If we average the private locations at 10am and 10:05am, then we get a better estimate since the underlying true locations are highly correlated.</p>
<p style="text-align: center;">
<img src="/assets/2021-05-10-location-priv/plausible_solutions.png" width="80%" />
</p>
<h5 style="text-align: left;">Tradeoffs of three privacy definitions for location data: While DP prevents use of correlation, it does not allow for utility with individual traces. Geoindistinguishability works well for a single location, but cannot prevent an adversary from correlating points close by in time. Our definition (conditional inferential privacy) provides an intermediate: prevent inference against a class of priors while still offering valuable utility.</h5>
<h3 id="conditional-inferential-privacy">Conditional Inferential Privacy</h3>
<p>This brings us to our framework, Conditional Inferential Privacy (CIP). Here we aim to obscure each location to within a radius $r$, while taking into account correlation across time through a Gaussian Process prior. Gaussian processes effectively model a sequence of $n$ random variables as an $n$-dimensional vector drawn from a multivariate normal distribution (see <a href="http://www.gaussianprocess.org/gpml/chapters/RW2.pdf">Rasmussen Ch. 2</a> for more detail). In the location setting, the correlation between two locations increases with their proximity in time. Gaussian processes are frequently used to model trajectories (<a href="https://ieeexplore.ieee.org/document/7102794">Chen ‘15</a>, <a href="https://ieeexplore.ieee.org/document/1237448">Liang & Hass ‘03</a>, <a href="https://ieeexplore.ieee.org/document/709453">Liu ‘98</a>, <a href="https://ieeexplore.ieee.org/document/6126365">Kim ‘11</a>), so this serves as a good model for a prior. Through directly modeling correlations, we can ensure that we can obscure locations up to a radius $r$, even in the presence of these correlations.</p>
<p>Formally, our framework builds upon the Pufferfish inferential privacy framework. We have a set of basic secrets $S$ consisting of events $s_{x,t}$, which denotes “User was at location $x$ at time $t$”. These are the kinds of events that we would like to hide. In practice, we may choose to hide more complicated events — such as “User was at home at 10am and at the coffee shop at 10:05am”; these are modeled by a set of compound events $C$, which is essentially a set of tuples of the form $(s_{x_1, t_1}, s_{x_2, t_2}, …)$.</p>
<p>We then have the set of secret pairs $P$ which is a subset of $C \times C$ — these are the pairs of secrets that the adversary should not be able to distinguish between. Finally we have a set of priors $\Theta$, which is a set of Gaussian processes that presumably represents the adversary’s prior.</p>
<p>A mechanism $M$ is said to follow $(P, \Theta)$-CIP with parameters $(\lambda, \epsilon)$, if for all $\theta \in \Theta$ and all tuples in $(s, s’) \in P$, we have that:</p>
\[D_{\text{Renyi}, \lambda} \Big(\Pr(M(X) = Z | s, \theta ) , \Pr(M(X) = Z | s’, \theta)\Big) \leq \epsilon\]
<p>where $D_{\text{Renyi}, \lambda}$ is the Renyi divergence of order $\lambda$ (see <a href="https://arxiv.org/abs/1702.07476"> Mironov ‘17 </a> for background on Renyi divergence and its use in the privacy literature). Essentially what this means is that the distributions of the output of the mechanism $M$ are similar under the secret s and s’. Similar here means low Renyi divergence.</p>
<p>There are a couple of interesting things to note here. First, note that unlike differential privacy, here the privacy is over both the prior and the randomness in the mechanism; this is quite standard for inferential privacy. Second, observe that we use Renyi divergence in the definitions instead of the probability ratios or max divergence that is used in the standard differential privacy and Pufferfish privacy definition. This is because Renyi divergences have a natural synergy with Gaussians and Gaussian processes, which we use as our priors and mechanisms.</p>
<p>While not as elegant as differential privacy, this definition also has some good properties. We can show that we can get graceful decay of privacy for two trajectories of the same person from different time intervals — which is analogous to what is called parallel composition in the privacy literature. We also show that there is some robustness to side information. Details are in our paper.</p>
<p style="text-align: left;"><img src="/assets/2021-05-10-location-priv/three_traces.png" width="100%" /></p>
<h5 style="text-align: left;">Example of how CIP maintains high uncertainty at secret locations (times). Left: <a href="https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html">a real location trace unknowningly collected from an NYC mayoral staff member by apps on their phone</a>. The red dots indicate sensitive locations. Middle: demonstration of how Geoindistinguishability (adding independent isotropic gaussian noise to each location, as in the red trace) allows for high certainty of true location by correlation. The green envelope shows the posterior uncertainty of a Bayesian adversary with a Gaussian process prior (a <em>GP adversary</em>). Right: demonstration of how a CIP mechanism efficiently thwarts the same adversary's posterior at sensitive locations, given the same utility budget. The mechanism achieves this by both concentrating the noise budget near sensitive locations and by strategically correlating noise added.</h5>
<h4 id="related-work">Related Work</h4>
<p>It is worth noting that we are in no way the first to attempt to offer meaningful location privacy. However, our method is distinguished in that it works in a continuous spatiotemporal domain, offers local privacy within a radius $r$ for sensitive locations, and has a semantically meaningful inferential guarantee. A mechanism offered by <a href="https://ieeexplore.ieee.org/document/7546522"> Bindschaedler & Shokri</a> releases synthesized traces satisfying the notion of plausible deniability, but this is distinctly different from providing a radius of privacy in the local setting, as we do. Meanwhile, the frameworks proposed by <a href="https://arxiv.org/abs/1410.5919">Xiao & Xiong (2015)</a> and <a href="http://export.arxiv.org/pdf/1810.09152">Cao et al. (2019)</a> nicely characterize the risk of inference in location traces, but use only first-order Markov models of correlation between points, do not offer a radius of indistinguishability as in this work, and are not suited to continuous-valued spatiotemporal traces.</p>
<h3 id="results">Results</h3>
<p>With the definition in place, we can now measure the privacy loss of different mechanisms. The most basic mechanism is to add zero-mean isotropic Gaussian noise with equal standard deviation to every location in the trace and publish the result; if the added noise has standard deviation $\sigma$, then we can calculate the privacy loss under CIP, as well as the mean square error utility. If a certain utility is desired, we can calibrate $\sigma$ to it and obtain a certain privacy loss.</p>
<p>A more sophisticated mechanism is to add zero-mean Gaussian noise with different covariances to locations at different time points. It turns out that we can choose the covariances to minimize privacy loss for a given utility, and this can be done by solving a Semi-Definite Program. The derivation and more details are in our paper.</p>
<p>We provide below a snap-shot of what our results look like. On the x-axis, we are plotting a measure of how correlated our prior is. If the prior is highly correlated, then it is easy to leak privacy for mechanisms that add noise — and hence correlated priors are worse for privacy. On the y-axis, we are plotting the posterior confidence interval size of the adversary — higher means higher privacy. Both mechanisms are calibrated to the same mean-square error, and hence the privacy-utility tradeoff is better if the y-axis is higher. From the figure, we see that our SDP-based mechanism does lead to a better privacy-utility tradeoff, and as expected, privacy offered declines as the correlations grow worse.</p>
<p style="text-align: left;"><img src="/assets/2021-05-10-location-priv/experiments.png" width="100%" /></p>
<h5 style="text-align: left;">Our inferentially private mechanism (blue line) maintains higher posterior uncertainty for a Bayesian adversary with a Gaussian process prior (a <em>GP adversary</em>) as compared to two Geoindistinguishability-based baselines (orange and green). The x-axis indicates the degree of correlation anticipated by the GP adversary. The left panel shows the posterior uncertainty for a single basic secret. The middle panel shows uncertainty for a compound secret. The right panel shows posterior uncertainty when we design our mechanism to maintain privacy at every location (all basic secrets). The gray window shows a range of realistic degrees of dependence (correlation) gathered from human mobility data. </h5>
<p style="text-align: left;"><img src="/assets/2021-05-10-location-priv/covariance.png" width="100%" /></p>
<h5 style="text-align: left;">Examples of the noise covariance chosen by our mechanism: Each frame is a covariance matrix optimized by our SDP mechanism to thwart inference at either a single location basic secret or a compound secret of two locations. Noise drawn from a multivariate normal with this covariance is added along the 50 point trace. The two frames on the left show covariance chosen to thwart a GP prior with an RBF kernel. The two frames on the right show covariance chosen to thwart a GP prior with a periodic kernel.</h5>
<h3 id="conclusion">Conclusion</h3>
<p>In conclusion, we take a stab at a long-standing challenge in offering location privacy — temporal correlations — and we provide a way to model them cleanly and flexibly through Gaussian Process priors. This gives us a way to quantify the privacy loss for correlated location trajectories and devise new mechanisms for sanitizing them. Our experiments show that our mechanisms offer better privacy-accuracy tradeoffs than standard baselines.</p>
<p>There are many open problems, particularly in the space of mechanism design. Can we improve the privacy-utility tradeoff offered by our mechanisms through other means, such as subsampling the traces or interpolation? Can we make our definition and our methods more robust to side information? Finally, location traces are only one example of correlated and structured data; a remaining challenge is to build upon the methodology developed here to design privacy frameworks for more complex and structured data.</p><a href='https://cseweb.ucsd.edu/~kamalika/'>Kamalika Chaudhuri</a> and Casey MeehanProviding meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a Renyi divergence based privacy framework, "Conditional Inferential Privacy", that quantifies this privacy risk given a class of priors on the correlation between locations. Additionally, we demonstrate an SDP-based mechanism for achieving this privacy under any Gaussian process prior. This framework both exemplifies why dependent data is so challenging to protect and offers a strategy for preserving privacy to within a fixed radius for sensitive locations in a user’s trace.The Expressive Power of Normalizing Flow Models2020-11-16T17:00:00+00:002020-11-16T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/11/16/expressive-power-normalizing-flows<h3 id="background-generative-models-and-normalizing-flows">Background: Generative Models and Normalizing Flows</h3>
<p><a href="https://en.wikipedia.org/wiki/Generative_model">Generative models</a> are one kind of unsupervised learning model in machine learning. Given a set of training data – such as pictures of dogs, audio clips of human speakers, and articles from certain websites – a generative model aims to generate samples that look/sound like they are samples from the dataset, but are not exactly any one of them. We usually train a generative model by maximizing the probability, or likelihood, of the samples under the model.</p>
<p>To understand complicated training data, generative models usually use very large neural networks (so they are also called deep generative models). Popular deep generative models include <a href="https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf">generative adversarial networks</a> (GANs) and <a href="https://arxiv.org/pdf/1606.05908.pdf">variational autoencoders</a> (VAEs), which have achieved the state-of-the-art performances on most generative tasks. Below are examples showing that <a href="https://arxiv.org/abs/1812.04948">styleGAN</a> (left) and <a href="https://arxiv.org/abs/1906.00446">VQ-VAE</a> (right) can generate amazing high resolution images!</p>
<p style="text-align: center;"><img src="/assets/2020-11-16-nf/stylegan_demo.png" width="41%" />
<img src="/assets/2020-11-16-nf/vqvae_demo.png" width="55.2%" /></p>
<p>One might ask: as we already have powerful generative models, is everything done? No! There are many aspects in which we want to improve these models. Below are two points related to this blog.</p>
<p>First, we want to compute exact likelihood if possible. Both GANs and VAEs generate samples by applying a neural network transformation on a latent random variable $z$, which is usually a Gaussian. In this case, the sample likelihood <i> cannot </i> be exactly computed because complicated neural networks may map different $z$’s to the same output.</p>
<p>This is the reason why <a href="https://arxiv.org/abs/1908.09257">normalizing flows</a> (NFs) were proposed. An NF learns an <b>invertible</b> function $f$ (which is also a neural network) to convert a source distribution, such as a Gaussian, to the distribution of the training data. Since $f$ is invertible, we can <i> precisely </i> compute the likelihood through the change-of-variable formula! <a href="http://akosiorek.github.io/ml/2018/04/03/norm_flows.html">This post</a> includes the detailed math of the computation. Different from the decoder in VAEs and the generator in GANs (which usually transform a lower dimensional latent variable to the data distribution), the NF $f$ keeps the data dimension and $f^{-1}$ can map a sample back to the source distribution.</p>
<p>Second, we want a theoretical guarantee that these deep generative models are <i> potentially </i> able to learn an arbitrarily complicated data distribution. Without such theory, an <i> empirically </i> successful generative model might fail in another scenario, and we don’t want this risk to always exist! Despite its importance, this problem is super challenging due to the complicated structure of neural networks. For example, <a href="https://papers.nips.cc/paper/2018/file/9bd5ee6fe55aaeb673025dbcb8f939c1-Paper.pdf">this paper</a> analyzes GANs in transforming between very simple distributions.</p>
<p>This blog addresses the above two points by making a theoretical analysis to NFs. We provide a theoretical guarantee for NFs on $\mathbb{R}$ and some negative (impossibility) results for NFs on $\mathbb{R}^d$ where the dimension $d>1$.</p>
<h3 id="structure-of-normalizing-flows">Structure of Normalizing Flows</h3>
<p>In general, to model complex training data like images, the normalizing flow $f$ needs to be a very complicated function. In practice, $f$ is usually constructed via a sequence of simple, invertible transformations, which we call base flow layers. The figure below illustrates the middle stages within the transformation from a simple source distribution to a complicated target distribution (figure from <a href="https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html">this link</a>).</p>
<p style="text-align: center;"><img src="/assets/2020-11-16-nf/nf_model.png" width="80%" /></p>
<p>Examples of base flow layers include</p>
<ul>
<li>
<p><a href="https://arxiv.org/abs/1908.09257">planar layers</a>: $f_{\text{pf}}(z)=z+uh(w^{\top}z+b)$, where $u,w,z\in\mathbb{R}^d,b\in\mathbb{R}$;</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1908.09257">radial layers</a>: $f_{\text{rf}}(z)=z+\frac{\beta}{\alpha+\|z-z_0\|}(z-z_0)$, where $z,z_0\in\mathbb{R}^d,\alpha,\beta\in\mathbb{R}$;</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1803.05649">Sylvester layers</a>: $f_{\text{syl}}(z)=z+Ah(B^{\top}z+b)$, where $A,B\in\mathbb{R}^{d\times m}, z\in\mathbb{R}^d, b\in\mathbb{R}^m$;</p>
</li>
<li>
<p>and <a href="https://arxiv.org/abs/1611.09630">Householder layers</a>: $f_{\text{hh}}(z)=z-2vv^{\top}z$, where $v,z\in\mathbb{R}^d, v^{\top}v=1$.</p>
</li>
</ul>
<p>The number of layers is usually very large in practice. For instance, in the MNIST dataset experiments, <a href="https://arxiv.org/abs/1908.09257">this paper</a> uses 80 planar layers, and <a href="https://arxiv.org/abs/1803.05649">this paper</a> uses 16 Sylvester layers.</p>
<h3 id="defining-the-expressivity-of-normalizing-flows">Defining the Expressivity of Normalizing Flows</h3>
<p>The invertibility of NFs may hugely restrict their expressive power, but to what extent? Our <a href="http://proceedings.mlr.press/v108/kong20a/kong20a.pdf">recent paper</a> analyzes this through the following two questions:</p>
<ul>
<li>
<p><b>Q</b>1 (Exact transformation): Under what conditions is it possible to <b>exactly</b> transform the source distribution $q$ (e.g., a standard Gaussian) into the target distribution $p$ with a finite number of base flow layers?</p>
</li>
<li>
<p><b>Q</b>2 (Approximation): Since sometimes exact transformation may be hard, when is it possible to <b>approximate</b> the target distribution $p$ in <a href="https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures">total variation distance</a>? Do we need an incredibly large number of layers?</p>
</li>
</ul>
<p>Our findings:</p>
<ul>
<li>
<p>If $p$ and $q$ are defined on $\mathbb{R}$, then universal approximation can be achieved. That is, we can always transform $q$ to be arbitrarily close to any $p$.</p>
</li>
<li>
<p>If $p$ and $q$ are defined on $\mathbb{R}^d$ where $d>1$, both exact transformation and approximation may be hard. Having a large number of layers is a necessary (but not a sufficient) condition.</p>
</li>
</ul>
<h3 id="challenges">Challenges</h3>
<p>Our problem is very related to the universal approximation property: the ability of a function class to be arbitrarily close to any target function. Although we have this property for <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf">shallow neural networks</a>, <a href="https://arxiv.org/abs/1709.02540">fully connected networks</a>, and <a href="https://arxiv.org/abs/1806.10909">residual networks</a>, these results do not apply to NFs. Why? Because of the <b>invertibility</b>.</p>
<ul>
<li>
<p>First, a function class has the universal approximation property does <b>not</b> imply that its invertible subset can approximate between any pair of distributions. For instance, take the set of piecewise constant functions. Its invertible subset is the empty set!</p>
</li>
<li>
<p>On the other hand, a function class has limited capacity does <b>not</b> imply that its invertible subset <b>cannot</b> transform between any pair of distributions. For instance, take the set of triangular maps, which can perform powerful Knothe–Rosenblatt rearrangements (See page 17 of <a href="https://ljk.imag.fr/membres/Emmanuel.Maitre/lib/exe/fetch.php?media=b07.stflour.pdf">this book</a>).</p>
</li>
</ul>
<p><b>The way to get around this challenge:</b> instead of looking at the capacity of a function class in the function space, we directly analyze input–output distribution pairs.</p>
<h3 id="universal-approximation-when-d1">Universal Approximation When $d=1$</h3>
<p>As warm-up let us look at the one-dimensional case. We show planar layers can approximate between arbitrary pairs of distributions under mild assumptions. We analyze a specific kind of planar layer with the ReLU activation:
\[f_{\text{pf}}(z)=z+u\ \mathrm{ReLU}(wz+b)\]
where $u,w,b,z\in\mathbb{R}$, and $\text{ReLU}(x)=\max(x,0)$. The effect of this transformation on a density is first splitting its graph into two pieces, and then scaling one piece while keeping the other one unchanged. For example, in the figure below the first planar layer splits the blue line into the solid part and the dashed part, and scales the dashed part to the orange line. Similarly, the second planar layer splits the orange line into the solid part and the dashed part, and scales the dashed part to the green line.</p>
<p style="text-align: center;"><img src="/assets/2020-11-16-nf/tail_consistent_pwg.png" width="60%" /></p>
<p>In particular, if the blue line is Gaussian, then the orange line and the green line are also pieces of some Gaussian distributions. We call this a piecewise Gaussian distribution. Additionally, it has the consistency property: the integration of the transformed distribution should always be 1.</p>
<p>How does it relate to approximation? Here we use a fundamental result in real analysis: <a href="https://en.wikipedia.org/wiki/Lebesgue_integration">Lebesgue-integrable functions</a> can be approximated by piecewise constant functions. Given a piecewise constant distribution $q_{\text{pwc}}$ that is close to the target distribution $p$, we can iteratively construct a piecewise Gaussian distribution $q_{\text{pwg}}$ with the same group of pieces. We can additionally require $q_{\text{pwg}}$ to be very close to $q_{\text{pwc}}$ by carefully selecting the parameters $u,w,b$. Finally, as the pieces become smaller, $q_{\text{pwc}}\rightarrow p$ and $q_{\text{pwg}}\rightarrow q_{\text{pwc}}$, which implies $q_{\text{pwg}}\rightarrow p$.</p>
<p>In the following example, we demonstrate such approximation with 50(top) and 300(bottom) ReLU planar layers, respectively.</p>
<p style="text-align: center;"><img src="/assets/2020-11-16-nf/1d_ReLU_50.png" width="60%" />
<img src="/assets/2020-11-16-nf/1d_ReLU_300.png" width="60%" /></p>
<h3 id="exact-transformation-when-d1">Exact Transformation When $d>1$</h3>
<p>Next, we look at the more general case in higher-dimensional space, which is usually quite different from the one-dimensional case. We show exact transformation between distributions can be quite hard. Specifically, we analyze Sylvester layers, a matrix-form generalization of planar layers (note that on $\mathbb{R}$, planar layers and Sylvester layers are equivalent):
\[f_{\text{syl}}(z)=z+Ah(B^{\top}z+b)\]
where $A,B\in\mathbb{R}^{d\times m},z\in\mathbb{R}^d,b\in\mathbb{R}^m$ for some integer $m$. In particular, we call $m$ the number of neurons of $f_{\text{syl}}$ because its form is identical to a residual block with $m$ neurons in the hidden layer.</p>
<p>Now suppose we stack a number of Sylvester layers with $M$ neurons in total, and these layers sequentially transform an input distribution $q$ to output distribution $p$. For convenience, let $f$ be the function composed of all these Sylvester layers. We show that the distribution pairs $(q,p)$ must obey some necessary (but not sufficient) condition, which we call the <b>topology matching</b> condition.</p>
<ul>
<li><b>$h$ is a smooth function</b></li>
</ul>
<p>Let $L(z)=\log p(f(z))-\log q(z)$ be the log-det Jacobian term. Then, the topology matching condition says the dimension of the set of the gradient of $L$ is no more than the number of neurons. Formally,
\[\dim\{\nabla_z L(z):z\in\mathbb{R}^d\}\leq M\]
In other words, if $M$ is less than the above dimensionality then exact transformation is impossible no matter what smooth non-linearities $h$ are selected.
Since it is not easy to plot $\{\nabla_z L(z):z\in\mathbb{R}^d\}$, we demonstrate $L(z)$ in a few examples below. Each row is a group, containing plots of $q$, $p$, and $L$ from left to right. In these examples, $M=1$ so $\nabla_z L(z)$ is a multiple a constant vector.</p>
<p style="text-align: center;">→ <img src="/assets/2020-11-16-nf/general_topo_1.png" width="60%" /><br /><br />
→ <img src="/assets/2020-11-16-nf/general_topo_2.png" width="60%" /><br /><br />
→ <img src="/assets/2020-11-16-nf/general_topo_3.png" width="60%" /><br /><br />
→ <img src="/assets/2020-11-16-nf/general_topo_4.png" width="60%" /><br /><br /></p>
<p>Based on the topology matching condition, it can be shown that if the number of neurons $M$ is less than the dimension $d$, it may even be hard to transform between simple Gaussian distributions.</p>
<ul>
<li><b>When $h=\text{ReLU}$</b></li>
</ul>
<p>We then restrict to ReLU Sylvester layers. In this case, $f$ in fact performs a piecewise linear transformation in $\mathbb{R}^d$. As a result, for almost every $z\in\mathbb{R}^d$ (except for boundary points), $f$ is linear around $z$. This leads to the following (pointwise) topology matching condition: there exists a constant matrix $C$ (which is the Jacobian matrix of $f(z)$) around $z$ such that
\[C^{\top}\nabla_z\log p(f(z))=\nabla_z\log q(z)\]</p>
<p>We demonstrate this result with two examples below, where each row is a $(q,p)$ distribution pair. The red points ($z$) on the left are transformed to those ($f(z)$) on the right by $f$. Notice that these red points are peaks of $q$ and $p$, respectively. In these cases, both $\nabla_z\log p(f(z))$ and $\nabla_z\log q(z)$ are zero vectors, which is compatible with the topology matching condition.</p>
<p style="text-align: center;">→ <img src="/assets/2020-11-16-nf/ReLU_topo_1.png" width="60%" /><br /><br />
→ <img src="/assets/2020-11-16-nf/ReLU_topo_2.png" width="60%" /><br /><br /></p>
<p>As a corollary, we conclude that ReLU Sylvester layers generally do not transform between product distributions or mixture of Gaussian distributions except for very special cases.</p>
<h3 id="approximation-capacity-when-d1">Approximation Capacity When $d>1$</h3>
<p>It is not surprising that exact transformation between distributions is difficult. What if we loosen our goal to approximation between distributions, where we can use transformations from a certain class $\mathcal{F}$? We show that unfortunately, this is still hard under certain conditions.</p>
<p>The way to look at this problem is to bound the minimum depth that is needed to approximate between $q$ and $p$. In other words, if we use less than this number of transformations, then it is impossible to approximate $p$ given $q$ as the source, no matter what transformations in $\mathcal{F}$ are selected. Formally, for $\epsilon>0$, we define the minimum depth as
\[T_{\epsilon}(p,q,\mathcal{F})=\inf\{n: \exists \{f_i\}_{i=1}^n\in\mathcal{F}\text{ such that }\mathrm{TV}((f_1\circ\cdots\circ f_n)(q),p)\leq\epsilon\}\]
where $\mathrm{TV}$ is the total variance distance.</p>
<p>We conclude that if $\mathcal{F}$ is the set of $(i)$ planar layers $f_{\text{pf}}$ with bounded parameters and popular non-linearities including $\tanh$, sigmoid, and $\arctan$, or $(ii)$ all Householder layers $f_{\text{hh}}$, then $T_{\epsilon}(p,q,\mathcal{F})$ is not small. In detail, for any $\kappa>0$, there exists a pair of distributions $(q,p)$ on $\mathbb{R}^d$ and a constant $\epsilon$ (e.g., 0.5) such that
\[T_{\epsilon}(p,q,\mathcal{F})=\tilde{\Omega}(d^{\kappa})\]
Although this lower bound is polynomial in the dimension $d$, in many practical problems the dimension can be very large so the minimum depth is still an incredibly large number. This result tells us that planar layers and Householder layers are provably not very expressive under certain conditions.</p>
<h3 id="open-problems">Open Problems</h3>
<p>This is the end of <a href="http://proceedings.mlr.press/v108/kong20a/kong20a.pdf">our paper</a>, but is clearly just the beginning of the story. There are a large number of open problems on the expressive power of even simple normalizing flow transformations. Below are some potential directions.</p>
<ul>
<li>Just like neural networks, planar and Sylvester layers use non-linearities in their expressions. Is it possible that a certain combination of non-linearities (at different layers) can significantly improve capacity?</li>
<li>Our paper does not provide a result for very deep Sylvester flows (e.g., $>d$ layers) with smooth non-linearities. Therefore, it is interesting to provide some insights for deep Sylvester flows.</li>
<li>A more general problem is to understand if the universal approximation property of certain class of normalizing flows holds in converting between distributions. The result is meaningful even if we assume the depth can be arbitrarily large.</li>
<li>On the other hand, it is also helpful to analyze what these normalizing flows are good at. A good example is to show that they can easily transform between distributions in a certain class, especially by an elegant construction.</li>
</ul>
<h3 id="more-details">More Details</h3>
<p>See <a href="http://proceedings.mlr.press/v108/kong20a/kong20a.pdf">our paper</a> or <a href="https://arxiv.org/abs/2006.00392">the full paper on arxiv</a>.</p><a href='https://cseweb.ucsd.edu/~z4kong'>Zhifeng Kong</a> and <a href='http://cseweb.ucsd.edu/~kamalika'>Kamalika Chaudhuri</a>Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. However, there is little formal understanding of their representation power. In this work, we study some basic normalizing flows and show that (1) they may be highly expressive in one dimension, and (2) in higher dimensions their representation power may be limited.Explainable 2-means Clustering: Five Lines Proof2020-10-16T18:00:00+00:002020-10-16T18:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/10/16/explain_2_means<p><strong>TL;DR:</strong> we will show <em>why</em> only one feature is enough to define a good $2$-means clustering. And we will do it using only 5 inequalities (!)
In a <a href="explain_k_means.html">previous post</a>, we explained what is an explainable clustering.</p>
<h3 id="explainable-clustering">Explainable clustering</h3>
<p>In a <a href="explain_k_means.html">previous post</a>, we discussed why explainability is important, defined it as a small decision tree, and suggested an algorithm to find such a clustering. But why the resulting clustering is any good?? We measure “good” by <a href="https://en.wikipedia.org/wiki/K-means_clustering">$k$-means cost</a>. The cost of a clustering $C$ is defined as the sum of squared Euclidean distances of each point $x$ to its center $c(x)$. Formally,
\begin{equation}
cost(C)=\sum_x \|x-c(x)\|^2,
\end{equation} the sum is over all points $x$ in the dataset.</p>
<p>In this post, we focus on the $2$-means problem, where there are only two clusters. We want to show that for every dataset there is <strong>one</strong> feature $i$ and <strong>one</strong> threshold $\theta$ such that the following simple clustering $C^{i,\theta}=(C^{i,\theta}_1,C^{i,\theta}_2)$ has a low cost:
\begin{equation}
\text{if } x_i\leq\theta \text{ then } x\in C^{i,\theta}_1 \text{ else } x\in C^{i,\theta}_2.
\end{equation}
We call such a clustering a <em>threshold cut</em>. There might be many threshold cuts that are good, bad, or somewhere in between. We want to show that there is at least one that is good (i.e., low cost). In the <a href="https://arxiv.org/abs/2002.12538">paper,</a> we prove that there is always a threshold cut, $C^{i,\theta}$, that is almost as good as the optimal clustering:
\begin{equation}
cost(C^{i,\theta})\leq4\cdot cost(opt),
\end{equation}
where $cost(opt)$ is the cost of the optimal 2-means clustering. This means that there is a simple explainable clustering $C^{i,\theta}$ that is only $4$ times worse than the optimal one. It’s independent of the dimension and the number of points. Sounds crazy, right? Let’s see how we can prove it!</p>
<h3 id="the-minimal-mistakes-threshold-cut">The minimal-mistakes threshold cut</h3>
<p>We want to compare two clusterings: the optimal clustering and the best threshold cut. The best threshold cut is hard to analyze, so we introduce an intermediate clustering: <em>the minimal-mistakes threshold cut</em>, $\widehat{C}$. Even though this clustering will not be the best threshold cut, it will be good enough. In the paper we prove that $cost(\widehat{C})$ is at most $4cost(opt)$. For simplicity, in this post, we will show a slightly worse bound of $11cost(opt)$ instead of $4cost(opt)$.</p>
<!--Let's define what the minimal-mistakes cut is. -->
<p>We define the number of mistakes of a threshold cut $C^{i,\theta}$ as the number of points $x$ that are not in the same cluster as their optimal center $c(x)$ in $C^{i,\theta}$, i.e., number of points $x$ such that<br />
\begin{equation}
sign(\theta-x_i) \neq sign(\theta-c(x)_i).
\end{equation}
The <em>minimal-mistakes clustering</em> is the threshold cut that has the minimal number of mistakes. Take a look at the next figure for an example.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/mistakes_example.png" width="30%" style="margin: 0 auto" />
<figcaption>
Two optimal clusters are in red and blue. Centers are the stars. Split (in yellow) with one mistake. This is a minimal-mistakes threshold cut, as any threshold cut has at least $1$ mistake.
</figcaption>
</figure>
<h3 id="playing-with-cost-warm-up">Playing with cost: warm-up</h3>
<p>Before we present the proof, let’s familiarize ourselves with the $k$-means cost and explore several of its properties. It will be helpful later on!</p>
<h4 id="changing-centers">Changing centers</h4>
<p>If we change the centers of a clustering from their means (which are their optimal centers) to different centers $c=(c_1, c_2)$, then the cost can only increase. Putting this into math, denote by $cost(C,c)$ the cost of clustering $C=(C_1,C_2)$ when $c_1$ is the center of cluster $C_1$ and $c_2$ is the center of cluster $C_2$, then</p>
<p>\begin{align}
cost(C) &= \sum_{x\in C_1} \|x-mean(C_1)\|^2 + \sum_{x\in C_2} \|x-mean(C_2)\|^2 \newline &\leq \sum_{x\in C_1} \|x-c_1\|^2 + \sum_{x\in C_2} \|x-c_2\|^2 = cost(C,c).
\end{align}
What if we further want to change the centers from some arbitrary centers $(c_1, c_2)$ to other arbitrary centers $(m_1, m_2)$? How does the cost change? Can we bound it? To our rescue comes the (almost) triangle inequality that states that for any two vectors $x,y$:
\begin{equation}
\|x+y\|^2 \leq 2\|x\|^2+2\|y\|^2.
\end{equation}
This implies that the cost of changing the centers from $c=(c_1, c_2)$ to $m=(m_1, m_2)$ is bounded by
\begin{equation}
cost(C,c)\leq 2cost(C,m)+2|C_1|\|c_1-m_1\|^2+2|C_2|\|c_2-m_2\|^2.
\end{equation}</p>
<h4 id="decomposing-the-cost">Decomposing the cost</h4>
<p>The cost can be easily decomposed with respect to the data points and the features. Let’s start with the data points. For any partition of the points in $C$ to $S_1$ and $S_2$, the cost can be rewritten as
\begin{equation}
cost(C,c)=cost(C \cap S_1,c)+cost(C \cap S_2,c).
\end{equation}
The cost can also be decomposed with respect to the features, because we are using the squared Euclidean distance. To be more specific, the cost incur by the $i$-th feature is $cost_i(C,c)=\sum_{x}(x_i-c(x)_i)^2,$ and the total cost is equal to
\begin{equation}
cost(C,c)=\sum_i cost_i(C,c).
\end{equation}
If the last equation is unclear just recall the definition of the cost ($c(x$) is the center of a point $x$):
\begin{equation}
cost(C,c)=\sum_{x}\|x-c(x)\|^2=\sum_i\sum_{x}(x_i-c(x)_i)^2=\sum_icost_i(C,c).
\end{equation}</p>
<h3 id="the-5-line-proof">The 5-line proof</h3>
<p>Now we are ready to prove that $\widehat{C}$ is only a constant factor worse than the optimal $2$-means clustering:
\begin{equation}
cost(\widehat{C})\leq 11\cdot cost(opt).
\end{equation}</p>
<p>To prove that the minimal-mistakes threshold cut $\widehat{C}$ gives a low-cost clustering, we will do something that might look strange at first. We analyze the quality of this clustering $\widehat{C}$ with the optimal centers of the optimal clustering. And not the optimal centers for $\widehat{C}$. This step will only increase the cost, so why are we doing it — because it will ease our analysis, and if there are not many mistakes, then the centers do not change much, like in the previous figure. So it’s not much of an increase. So, here comes the first step — change the centers of $\widehat{C}$ to the optimal centers $c^*=(mean(C^*_1),mean(C^*_2))$. Recall from the warm-up that this can only increase the cost:
\begin{equation}
cost(\widehat{C})\leq cost(\widehat{C},c^{*}) \quad (1)
\end{equation}
Next we use one of the decomposition properties of the cost. We partition the dataset into the set of points that are correctly labeled, $X^{cor}$, and those that are not, $X^{wro}$.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/mistakes_example_wrong.png" width="30%" style="margin: 0 auto" />
<figcaption>
The same dataset and split as before. Point with a grey circle is in the wrong cluster and is the only member in $X^{wro}$. All other points have the same cluster assignment as the optimal clustering and are in $X^{cor}$.
</figcaption>
</figure>
<p>Thus, we can rewrite the last term as
\begin{equation}
cost(\widehat{C},c^{*})=cost(\widehat{C}\cap X^{cor},c^{*})+cost(\widehat{C}\cap X^{wro},c^{*}) \quad (2)
\end{equation}</p>
<p>Let’s look at this sum. The first term contains all the points that have their correct center in $c^*$ (which is either $mean(C^*_1)$ or $mean(C^*_2)$). Hence, the first term in (2) is easy to bound: it’s at most $cost(opt)$. So from now on, we focus on the second term.</p>
<p>In the second term, all points are in $X^{wro}$, which means they were assigned to the incorrect optimal center. So let’s change the centers once more, so that $X^{wro}$ will have the correct centers. The correct centers of $X^{wro}$ are the same centers $c^*$, but the order is reversed, i.e., all points assigned to center $mean(C^*_1)$ are now assigned to $mean(C^*_2)$ and vice versa. Using the “changing centers” property of the cost we discussed earlier, we have <!--, the second term in (2) is at most--></p>
<p>\begin{equation}
cost(\widehat{C},c^{*}) \leq 3cost(opt)+2|X^{wro}|\cdot\|c^{*}_1-c^{*}_2\|^2 \quad (3)
\end{equation}</p>
<p>Now we’ve reached the main step in the proof. We show that the second term in (3) is bounded by $8cost(opt)$. We first decompose $cost(opt)$ using the features. Then, all we need to show is that:</p>
<p>\begin{equation}
cost_i(opt)\geq\left(\frac{|c^{*}_{1,i}-c^{*}_{2,i}|}{2}\right)^2|X^{wro}| \quad (4)
\end{equation}</p>
<p>The trick is, for each feature, to focus on the threshold cut defined by the middle point between the two optimal centers. Since $\widehat{C}$ is the minimal-mistakes clustering we know that in every threshold cut there are at least $|X^{wro}|$ mistakes. Each mistake contributes at least half the distance between the two centers.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_2_means/IMM_blog_pic_4.png" width="30%" style="margin: 0 auto" />
<figcaption>
Proving step $4.$ Projecting to feature $i$. Points in blue belong to the first cluster, and in red to the second. We focus on the cut that is the mid-point between the two optimal centers.
</figcaption>
</figure>
<p>This figure shows how to prove step (4). We see that there is $1$ mistake, which is the minimum possible. This means that even the optimal clustering must pay for at least half the distance between the centers for each of these mistakes. This gives us a lower bound on $cost_i(opt)$ in this feature. Then we can sum over all the features to see that the second term of (3) is at most $8cost(opt)$, which is what we wanted. <!--Since the whole expression in (3) is at most $10cost(opt)$, and we lose another $cost(opt)$ from the first term of (2), we can put these together to get-->
<!--Summing everything together we achieve our goal:-->
Putting everything together, we get exactly what we wanted to prove in this post:
\begin{equation}
cost(\widehat{C})\leq1 1\cdot cost(opt) \quad (5)
\end{equation}
<!--That's it!--></p>
<h3 id="epilogue-improvements">Epilogue: improvements</h3>
<p>The bound that we got, $11$, is not the best possible. With more tricks we can get a bound of $4$. One of them is using Hall’s theorem. Similar ideas provide a $2$-approximation to the optimal $2$-medians clustering as well.
To complement our upper bounds, we also prove lower bounds showing that any threshold cut must incur almost $3$-approximation for $2$-means and almost $2$-approximation for $2$-medians. You can read all about it in our <a href="https://proceedings.icml.cc/paper/2020/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf">paper</a>.</p><a href='https://sites.google.com/view/michal-moshkovitz'>Michal Moshkovitz</a>, <a href='mailto:navefrost@mail.tau.ac.il'>Nave Frost</a>, <a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a>In a previous post, we discussed tree-based clustering and how to develop explainable clustering algorithms with provable guarantees. Now we will show why only one feature is enough to define a good 2-means clustering. And we will do it using only 5 inequalities (!)Explainable k-means Clustering2020-10-16T17:00:00+00:002020-10-16T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/10/16/explain_k_means<p><strong>TL;DR:</strong>
Explainable AI has gained a lot of interest in the last few years, but effective methods for unsupervised learning are scarce. And the rare methods that do exist do not have provable guarantees. We present a new algorithm for explainable clustering that is provably good for $k$-means clustering — the Iterative Mistake Minimization (IMM) algorithm. Specifically, we want to build a clustering defined by a small decision tree. Overall, this post summarizes our new paper: <a href="https://arxiv.org/pdf/2002.12538.pdf">Explainable $k$-Means and $k$-Medians clustering</a>.</p>
<h3 id="explainability-why">Explainability: why?</h3>
<p>Machine learning models are mostly “black box”. They give good results, but their reasoning is unclear. These days, machine learning is entering fields like healthcare (e.g., for a better understanding of <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6543980/#:~:text=In%20the%20medical%20field%2C%20clustering,in%20labeled%20and%20unlabeled%20datasets.&text=The%20aim%20is%20to%20provide,AD%20based%20on%20their%20similarity.">Alzheimer’s Disease</a> and <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118453#sec013">Breast Cancer</a>), transportation, or law. In these fields, quality is not the only objective. No matter how well a computer is making its predictions, we can’t even imagine blindly following computer’s suggestion. Can you imagine blindly medicating or performing a surgery on a patient just because a computer said so? Instead, it would be much better to provide insight into what parts of the data the algorithm used to make its prediction.</p>
<h3 id="tree-based-explainable-clustering">Tree-based explainable clustering</h3>
<!--Despite the popularity of explainability, there is limited work in unsupervised learning. To remedy it, -->
<p>We study a prominent problem in unsupervised learning, $k$-means clustering. We are given a dataset, and the goal is to partition it to $k$ clusters such that the <a href="https://en.wikipedia.org/wiki/K-means_clustering">$k$-means cost</a> is minimal. The cost of a clustering $C=(C^1,\ldots,C^k)$ is the sum of all points from their optimal centers, $mean(C^i)$:</p>
<p>\[cost(C)=\sum_{i=1}^k\sum_{x\in C^i} \lVert x-mean(C^i)\rVert ^2.\]</p>
<p>For any cluster, $C^i$, one possible explanation of this cluster is $mean(C^i)$. In a low-cost clustering, the center is close to its points, and they are close to each other. For example, see the next figure.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_1.png" width="40%" style="margin: 0 auto" />
<figcaption>
Near optimal 5-means clustering
</figcaption>
</figure>
<p>Unfortunately, this explanation is not as useful as it could be. The centers themselves may depend on all the data points and all the features in a complicated way. We instead aim to develop a clustering method that is explainable by design. To explain why a point is in a cluster, we will only need to look at small number of features, and we will just evaluate a threshold for each feature one by one. This allows us to extract information about which features cause a point to go to one cluster compared to another. This method also means that we can derive an explanation that does not depend on the centers.</p>
<p>More formally, at each step we test if $x_i\leq \theta$ or not, for some feature $i$ and threshold $\theta$. We call this test a <strong>split</strong>. According to the test’s result, we decide on the next step. In the end, the algorithm returns the cluster identity. This procedure is exactly a decision tree where the leaves correspond to clusters.</p>
<p>Importantly, for the tree to be explainable it should be <strong>small</strong>. The smallest decision tree has $k$ leaves since each cluster must appear in at least one leaf. We call a clustering defined by a decision tree with $k$ leaves a <strong>tree-based explainable clustering</strong>. See the next tree for an illustration.</p>
<p align="center">
<tr>
<td> <img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_2.png" width="40%" style="margin: 0 auto" /> </td>
<td> <img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_3.png" width="40%" style="margin: 0 auto" /> </td>
</tr>
</p>
<!--
{:refdef: style="text-align: center;"}
<figure class="image">
<img src="/assets/2020-06-06/intro_IMM_blog_pic_2.png" width="40%" style="margin: 0 auto">
<figcaption>
Decision tree
</figcaption>
</figure>
{:refdef}
{:refdef: style="text-align: center;"}
<figure class="image">
<img src="/assets/2020-06-06/intro_IMM_blog_pic_3.png" width="40%" style="margin: 0 auto">
<figcaption>
Geometric representation of the decision tree
</figcaption>
</figure>
{:refdef}
-->
<p>On the left, we see a decision tree that defines a clustering with $5$ clusters. On the right, we see the geometric representation of this decision tree. We see that the decision tree imposes a partition to $5$ clusters aligned to the axis. The clustering looks close to the optimal clustering that we started with. Which is great. But can we do it for all datasets? How?</p>
<p>Several algorithms are trying to find a tree-based explainable clustering like <a href="https://link.springer.com/chapter/10.1007/11362197_5">CLTree</a> and <a href="https://www.researchgate.net/profile/Ricardo_Fraiman/publication/47744381_Clustering_using_Unsupervised_Binary_Trees_CUBT/links/09e41508aeaf39a453000000/Clustering-using-Unsupervised-Binary-Trees-CUBT.pdf">CUBT</a>. But we are the first to give formal guarantees. We first need to define the quality of an algorithm. It’s common that unsupervised learning problems are <a href="http://cseweb.ucsd.edu/~dasgupta/papers/kmeans.pdf">NP-hard</a>. Clustering is no exception. So it is common to settle for an approximated solution. A bit more formal, an algorithm that returns a tree-based clustering $T$ is an <em>$a$-approximation</em> if $cost(T)\leq a\cdot cost(opt),$ where $opt$ is the clustering that minimizes the $k$-means cost.</p>
<h3 id="general-scheme">General scheme</h3>
<p>Many supervised learning algorithms learn a decision tree, can we use one of them here? Yes, after we transform the problem into a supervised learning problem! How might you ask? We can use any clustering algorithm that will return a good, but not explainable clustering. This will form the labeling. Next, we can use a supervised algorithm that learns a decision tree. Let’s summarize these three steps:</p>
<ol>
<li>Find a clustering using some clustering algorithm</li>
<li>Label each example according to its cluster</li>
<li>Call a supervised algorithm that learns a decision tree</li>
</ol>
<p>Which algorithm can we use in step 3? Maybe the popular ID3 algorithm?</p>
<h3 id="can-we-use-the-id3-algorithm">Can we use the ID3 algorithm?</h3>
<p>Short answer: no.</p>
<p>One might hope that in step 3, in the previous scheme, the known <a href="https://link.springer.com/content/pdf/10.1007/BF00116251.pdf">ID3</a> algorithm can be used (or one of its variants like <a href="https://link.springer.com/article/10.1007/BF00993309">C4.5</a>). We will show that this does not work. There are datasets where ID3 will perform poorly. Here is an example:</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/intro_IMM_blog_pic_4.png" width="40%" style="margin: 0 auto" />
<figcaption>
ID3 performs poorly on this dataset
</figcaption>
</figure>
<p>The dataset is composed of three clusters, as you can see in the figure above. Two large clusters (0 and 1 in the figure) have centers (-2, 0) and (2, 0) accordingly and small noise. The third cluster (2 in the figure) is composed of only two points that are very, very (very) far away from clusters 0 and 1. Given these data, ID3 will prefer to maximize the information gain and split between clusters 0 and 1. Recall that the final tree has only three leaves. This means that in the final tree, one point in cluster 2 must be with cluster 0 or cluster 1. Thus the cost is enormous.
To solve this problem, we design a new algorithm called <a href="https://proceedings.icml.cc/paper/2020/file/8e489b4966fe8f703b5be647f1cbae63-Paper.pdf"><em>Iterative Mistake Minimization (IMM)</em></a>.</p>
<h3 id="imm-algorithm-for-explainable-clustering">IMM algorithm for explainable clustering</h3>
<p>We learned that the ID3 algorithm cannot be used in step 3 at the general scheme. Before we give up on this scheme, can we use a different decision-tree algorithm? Well, since we wrote this post, you probably know the answer: there is such an algorithm, the IMM algorithm.</p>
<p>We build the tree greedily from top to bottom. Each step we take the split (i.e., feature and threshold) that minimizes a new parameter called a <strong>mistake</strong>. A point $x$ is a mistake for node $u$ if $x$ and its center $c(x)$ reached $u$ and then separated by $u$’s split. See the next figure for an example of a split with one mistake.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-10-16-explain_k_means/mistakes_example.png" width="40%" style="margin: 0 auto" />
<figcaption>
Split (in yellow) with one mistake. Two optimal clusters are in red and blue. Centers are the stars.
</figcaption>
</figure>
<!--For another example of the mistakes concept, let's go back to the previous dataset where ID3 failed. Focus on the first split again. The ID3 split has one mistake since one of the points in cluster $2$ will be separated from its center. On the other hand, the horizontal split has $0$ mistakes: the two large clusters will go with their centers to one side of the tree, and the small cluster will go with its center to the other side of the tree. -->
<p>To summarize, the high-level description of the IMM algorithm:
<!--<center>
<span style="font-family:Papyrus; font-size:2em;align-self: center;">As long as there is more than one center
<br> find the split with minimal number of mistakes</span>
</center>
--></p>
<center>
<span style="font-size:larger;">
As long as there is more than one center
<br /> find the split with minimal number of mistakes
</span>
</center>
<p> </p>
<!--What if there are no mistakes.
The main definition that we need is a mistake:
Creare a different figure that explains a mistake with small number of points
-->
<!--
<center>
<span style="font-family:Papyrus; font-size:2em;align-self: center;">If a point and its center diverge,
<br> then it counts as a mistake</span>
</center>
<div class="definition"> [mistake at node $u$].
If a point and its center end up at different leafs, then it counts as a mistake.
</div>
... Explain what is a split early on ...
-->
<!---
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">):</span>
<span class="n">node</span> <span class="o">=</span> <span class="n">new</span> <span class="n">Node</span><span class="p">()</span>
<span class="k">if</span> <span class="o">|</span><span class="n">centers</span><span class="o">|</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">i</span><span class="p">,</span> <span class="n">theta</span> <span class="o">=</span> <span class="n">find_split</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">)</span>
<span class="n">node</span><span class="p">.</span><span class="n">condition</span> <span class="o">=</span> <span class="s">'x_i <= theta'</span>
<span class="n">points_left_mask</span> <span class="o">=</span> <span class="n">points</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span> <span class="o"><=</span> <span class="n">theta</span>
<span class="n">centers_left_mask</span> <span class="o">=</span> <span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">]</span> <span class="o"><=</span> <span class="n">theta</span>
<span class="n">node</span><span class="p">.</span><span class="n">left</span> <span class="o">=</span> <span class="n">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">[</span><span class="n">points_left_mask</span><span class="p">],</span> <span class="n">centers</span><span class="p">[</span><span class="n">centers_left_mask</span><span class="p">])</span>
<span class="n">node</span><span class="p">.</span><span class="n">right</span> <span class="o">=</span> <span class="n">IMM</span><span class="p">(</span><span class="n">points</span><span class="p">[</span><span class="o">~</span><span class="n">points_left_mask</span><span class="p">],</span> <span class="n">centers</span><span class="p">[</span><span class="o">~</span><span class="n">centers_left_mask</span><span class="p">])</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">node</span><span class="p">.</span><span class="n">label</span> <span class="o">=</span> <span class="n">centers</span>
<span class="k">return</span> <span class="n">node</span>
<span class="k">def</span> <span class="nf">find_split</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">centers</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">d</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">])</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">centers</span><span class="p">[:,</span><span class="n">i</span><span class="p">])</span>
<span class="n">i</span><span class="p">,</span><span class="n">theta</span> <span class="o">=</span> <span class="n">argmin_</span><span class="p">{</span><span class="n">i</span><span class="p">,</span><span class="n">l</span> <span class="o"><=</span> <span class="n">theta</span> <span class="o"><</span> <span class="n">r</span><span class="p">}</span> <span class="n">mistakes</span><span class="p">(</span><span class="n">i</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="k">return</span> <span class="n">i</span><span class="p">,</span><span class="n">theta</span></code></pre></figure>
-->
<p>Here is an illustration of the IMM algorithm. We use $k$-means++ with $k=5$ to find a clustering for our dataset. Each point is colored with its cluster label. At each node in the tree, we choose a split with a minimal number of mistakes. We stop where each of the $k=5$ centers is in its own leaf. This defines the explainable clustering on the left.</p>
<center>
<img src="/assets/2020-10-16-explain_k_means/imm_example_slow.gif" width="600" height="320" />
</center>
<p>The algorithm is guaranteed to perform well. For any dataset. See the next theorem.</p>
<div class="theorem">
IMM is an $O(k^2)$-approximation to the optimal $k$-means clustering.
</div>
<p>This theorem shows that we can always find a small tree, with $k$ leaves, such that the tree-based clustering is only $O(k^2)$ times worse in terms of the cost. IMM efficiently find this explainable clustering. Importantly, this approximation is independent of the dimension and the number of points. A proof for the case $k=2$ will appear in a <a href="explain_2_means.html">follow-up post</a>, and you can read the proof for general $k$ in the paper. Intuitively, we discovered that the number of mistakes is a good indicator for the $k$-means cost, and so, minimizing the number of mistakes is an effective way to find a low-cost clustering. <!-- Surprisingly, we can also use a tree with $k$ leaves, which means that IMM produces an explainable clustering.--></p>
<h4 id="running-time">Running Time</h4>
<p>What is the running time of the IMM algorithm? With an efficient implementation, using dynamic programming, the running time is $O(kdn\log(n)).$ Why? For each of the $k-1$ inner nodes and each of the $d$ features, we can find the split that minimizes the number of mistakes for this node and feature, in time $O(n\log(n)).$</p>
<p>For $2$-means one can do better than running IMM: going over all possible $(n-1)d$ cuts and find the best one. The running time is $O(nd^2+nd\log(n))$.</p>
<h3 id="results-summary">Results Summary</h3>
<p>In each cell in the following table, we write the approximation factor. We want this value to be small for the upper bounds and large for the lower bounds. In $2$-medians, the upper and lower bounds are pretty tight, about $2$. But, there is a large gap for $k$-means and $k$-median: the lower bound is $\log(k)$, while the upper bound is $\mathsf{poly}(k)$.</p>
<center>
<table style="text-align: center">
<thead>
<tr>
<th></th>
<th colspan="2" style="text-align: center">$k$-medians</th>
<th colspan="2" style="text-align: center">$k$-means</th>
</tr>
<tr>
<th></th>
<th> $k=2$ </th>
<th> $k>2$ </th>
<th> $k=2$ </th>
<th> $k>2$ </th>
</tr>
</thead>
<tbody>
<tr>
<td> <strong>Lower</strong> </td>
<td> $2-\frac1d$ </td>
<td> $\Omega(\log k)$ </td>
<td> $3\left(1-\frac1d\right)^2$ </td>
<td> $\Omega(\log k)$ </td>
</tr>
<tr>
<td> <strong>Upper</strong> </td>
<td> $2$ </td>
<td> $O(k)$ </td>
<td> $4$ </td>
<td> $O(k^2)$ </td>
</tr>
</tbody>
</table>
</center>
<h3 id="whats-next">What’s next</h3>
<ol>
<li>IMM exhibits excellent results in practice on many datasets, see <a href="https://arxiv.org/abs/2006.02399">this</a>. It’s running time is comparable to KMeans implemented in sklearn. We implemented the IMM algorithm, it’s <a href="https://github.com/navefr/ExKMC">here</a>. Try it yourself.</li>
<li>We plan to have several posts on explainable clusterings, here is the <a href="explain_2_means.html">second</a> in the series, stay tuned for more!</li>
<li>In a follow-up work, we explore the tradeoff between explainability and accuracy. If we allow a slightly larger tree, can we get a lower cost? We introduce the <a href="https://arxiv.org/abs/2006.02399">ExKMC</a>, “Expanding Explainable $k$-Means Clustering”, algorithm that builds on IMM.</li>
<li>Found cool applications of IMM? Let us know!</li>
</ol><a href='https://sites.google.com/view/michal-moshkovitz'>Michal Moshkovitz</a>, <a href='mailto:navefrost@mail.tau.ac.il'>Nave Frost</a>, <a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a>Popular algorithms for learning decision trees can be arbitrarily bad for clustering. We present a new algorithm for explainable clustering that has provable guarantees --- the Iterative Mistake Minimization (IMM) algorithm. This algorithm exhibits good results in practice. It's running time is comparable to KMeans implemented in sklearn. So our method gives you explanations basically for free. Our code is available on github.Towards Physics-informed Deep Learning for Turbulent Flow Prediction2020-08-23T00:00:00+00:002020-08-23T00:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/08/23/TF-Net<h3 id="prediction-visualization">Prediction Visualization</h3>
<p>We propose a novel hybrid model for turbulence prediction, $\texttt{TF-Net}$, that unifies a popular <a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics">Computational fluid dynamics (CFD)</a> technique, RANS-LES coupling, with custom-designed U-net. The following two videos show the ground truth and the predicted U (left) and V (right) velocity fields from $\texttt{TF-Net}$ and three best baselines. We see that the predictions by $\texttt{TF-Net}$ are the closest to the target based on the shape and the frequency of the motions. Baselines generate smooth predictions and miss the details of small scale motion.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/U_prediction.gif" width="49%" style="margin: 0 auto" />
<img src="/assets/2020-08-23-TF-Net/V_prediction.gif" width="49%" style="margin: 0 auto" />
</div>
<p><br /></p>
<h3 id="introduction">Introduction</h3>
<p>Modeling the spatiotemporal dynamics over a wide range of space and time scales is a fundamental task in science, especially atmospheric science, marine science and aerodynamics. <a href="https://en.wikipedia.org/wiki/Computational_fluid_dynamics">Computational fluid dynamics (CFD)</a> is at the heart of climate modeling and has direct implications for understanding and predicting climate change. Recently, deep learning have demonstrated great success in the <a href="https://www.nature.com/articles/s41586-019-0912-1">automation, acceleration, and streamlining of highly compute-intensive workflows for science</a>. We hope deep learning can accelerate the turbulence simulation since the current CFD is purely physics-based and computationally-intensive, requiring significant computational resources and expertise.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/imgs.png" width="90%" style="margin: 0 auto" />
</div>
<p><br />
But purely data-driven methods are mainly statistical with no underlying physical knowledge incorporated, and are yet to be proven to be successful in capturing and predicting accurately the complex physical systems. Incorporating physics knowledge into deep learning models can improve not only prediction accuracy, but more importantly, physical consistency. Thus, developing deep learning methods that can incorporate physical laws in a systematic manner is a key element in advancing AI for physical sciences.</p>
<p><a href="https://uknowledge.uky.edu/me_textbooks/2/">Computational techniques</a> are at the core of present-day turbulence investigations, which are a branch of fluid mechanics that uses numerical method to analyze and predict fluid flows. In physics, people use the following <a href="https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations">Navier–Stokes equations</a> to describe the motion of viscous fluid dynamics.</p>
\[\nabla \cdot \pmb{w} = 0 \qquad\qquad\qquad\qquad\qquad\qquad\qquad \text{Continuity Equation}\]
\[\frac{\partial \pmb{w}}{\partial t} + (\pmb{w} \cdot \nabla) \pmb{w} = -\frac{1}{\rho_0} \nabla p + \nu \nabla^2 \pmb{w} + f \quad\text{Momentum Equation}\]
\[\frac{\partial T}{\partial t} + (\pmb{w} \cdot \nabla) T = \kappa \nabla^2 T \qquad\qquad\qquad\quad \text{Temperature Equation}\]
<p>where $\pmb{w}(t)$ is the vector velocity field of the flow, which is what we want to predict. $p$ and $T$ are pressure and temperature respectively, $\kappa$ is the coefficient of heat conductivity, $\rho_0$ is density at temperature at the beginning, $\alpha$ is the coefficient of thermal expansion, $\nu$ is the kinematic viscosity, $f$ the body force that is due to gravity.</p>
<p><br /></p>
<h3 id="turbulent-flow-net">Turbulent-Flow Net</h3>
<p>For turbulent flows, the range of length scales and complexity of phenomena involved in turbulence make <a href="https://en.wikipedia.org/wiki/Direct_numerical_simulation">Direct Numerical Simulation (DNS)</a> approaches prohibitively expensive. Great emphasis was then placed on the alternative approaches including Large-Eddy Simulation (LES), Reynolds-averaged Navier Stokes (RANS) as well as <a href="https://link.springer.com/article/10.1007/s10494-017-9828-8">Hybrid RANS-LES Coupling</a> that combines both RANS and LES approaches in order to take advantage of both methods. These methods decompose the fluid flow into different scales in order to directly simulate large scales while model small ones.</p>
<p>Hybrid RANS-LES Coupling decomposes the flow velocity into three scales: mean flow, resolved fluctuations and unresolved fluctuations. It applies the spatial filtering operator $S$ and the temporal average operator $T$ sequentially.</p>
\[\pmb{w^*}(\pmb{x},t) = S \ast\pmb{w} = \sum_{\pmb{\xi}} S(\pmb{x}|\pmb{\xi})\pmb{w}(\pmb{\xi},t)\]
\[\pmb{\bar{w}}(\pmb{x},t) = T \ast \pmb{w^*} = \frac{1}{n}\sum_{s = t-n}^tT(s) \pmb{w^*} (\pmb{x}, s)\]
<p>then $\pmb{\tilde{w}}$ can be defined as the difference between $\pmb{w^*}$ and $\pmb{\bar{w}}$:</p>
\[\pmb{\tilde{w}} = \pmb{w^*} - \pmb{\bar{w}}, \quad \pmb{w'} = \pmb{w} - \pmb{w^{*}}\]
<p>Finally we can have the three-level decomposition of the velocity field.</p>
<p>\begin{equation}
\pmb{w} = \pmb{\bar{w}} + \pmb{\tilde{w}} + \pmb{w’}
\end{equation}</p>
<p>The figure below shows this three-level decomposition in wavenumber space. $k$ is the wavenumber, the spatial frequency in the Fourier domain. $E(k)$ is the energy spectrum describing how much kinetic energy is contained in eddies with wavenumber $k$. Small $k$ corresponds to large eddies that contain most of the energy. The slope of the spectrum is negative and indicates the transfer of energy from large scales of motion to the small scales.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/decompose.png" width="40%" style="margin: 0 auto" />
</div>
<p><br />
Inspired by the hybrid RANS-LES Coupling, we propose a hybrid deep learning framework, $\texttt{TF-Net}$, based on the multilevel spectral decomposition. Specifically, we decompose the velocity field into three scales using the spatial filter $S$ and the temporal filter $T$. Unlike traditional CFD, both filters in $\texttt{TF-Net}$ are trainable neural networks. The motivation for this design is to explicitly guide the DL model to learn the non-linear dynamics of both large and small eddies. We design three identical convolutional encoders to encode the three scale components separately and use a shared convolutional decoder to learn the interactions among these three components and generate the final prediction. The figure below shows the overall architecture of our hybrid model $\texttt{TF-Net}$.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/model.png" width="98%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>Since the turbulent flow under investigation has zero divergence, we include $\Vert\nabla \cdot \pmb{w}\Vert^2$ as a regularizer to constrain the predictions, leading to a constrained TF-Net, $\texttt{Con TF-Net}$.</p>
<p><br />
<br /></p>
<h3 id="results">Results</h3>
<p>We compare our model with four purely data-driven deep learning models, including <a href="https://arxiv.org/abs/1512.03385">$\texttt{ResNet}$</a>, <a href="https://arxiv.org/abs/1506.04214">$\texttt{ConvLSTM}$</a>, <a href="https://arxiv.org/abs/1505.04597">$\texttt{U-net}$</a> and <a href="https://arxiv.org/abs/1406.2661">$\texttt{GAN}$</a>, and two hybrid physics-informed models, including <a href="https://arxiv.org/abs/1801.06637">$\texttt{DHPM}$</a> and <a href="https://arxiv.org/abs/1711.07970">$\texttt{SST}$</a>. All the models trained to make one step ahead prediction given the historic frames and we use them autoregressively to generate multi-step forecasts.</p>
<p>$\textbf{Accuracy}$ The following figure show the growth of RMSE with prediction horizon up to 60 time steps ahead. We can see that $\texttt{TF-Net}$ consistently outperforms all baselines, and constraining it with divergence free regularizer can further improve the performance.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/rmse_horizon.png" width="55%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>$\textbf{Physical Consistency}$ The left figure below is the averages of absolute divergence over all pixels at each prediction step and the right figure below is the energy spectrum curves. $\texttt{TF-Net}$ predictions are in fact much closer to the target even without additional divergence free constraint, which suggests that $\texttt{TF-Net}$ can generate predictions that are physically consistent with the ground truth.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/divergence.png" width="48%" style="margin: 0 auto" />
<img src="/assets/2020-08-23-TF-Net/spec_ci_square.png" width="48%" style="margin: 0 auto" />
</div>
<p><br /></p>
<p>$\textbf{Efficiency}$ This figure shows the average time to produce one 64 × 448 2d velocity field for all models on single V100 GPU. We can see that $\texttt{TF-net}$, $\texttt{U_net}$ and $\texttt{GAN}$ are faster than the numerical Lattice Boltzmann method. $\texttt{TF-Net}$ will show greater advantage of speed on higher resolution data.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/avg_time.png" width="60%" style="margin: 0 auto" />
</div>
<p><br />
$\textbf{Ablation Study}$ We also perform an ablation study to understand each component of $\texttt{TF-Net}$ and investigate whether the model has actually learned the flow with different scales. During inference, we applied each small U-net in $\texttt{TF-Net}$ with the other two encoders removed to the entire input domain. The video below includes $\texttt{TF-Net}$ predictions and the outputs of each small $\texttt{U-net}$ while the other two encoders are zeroed out. We observe that the outputs of each small $\texttt{U-net}$ are the flow with different scales, which demonstrates that $\texttt{TF-Net}$ can learn multi-scale behaviors.</p>
<div style="text-align: center;">
<img src="/assets/2020-08-23-TF-Net/Ablation_Study.gif" width="70%" style="margin: 0 auto" />
</div>
<p><br />
<br /></p>
<h3 id="conclusion-and-future-work">Conclusion and Future Work</h3>
<p>We presented a novel hybrid deep learning model, $\texttt{TF-Net}$, that unifies representation learning and turbulence simulation techniques. $\texttt{TF-Net}$ exploits the multi-scale behavior of turbulent flows to design trainable scale-separation operators to model different ranges of scales individually. We provide exhaustive comparisons of $\texttt{TF-Net}$ and baselines and observe significant improvement in both the prediction error and desired physical quantifies, including divergence, turbulence kinetic energy and energy spectrum. Future work includes extending these techniques to very high-resolution, 3D turbulent flows and incorporating additional physical variables, such as pressure and temperature, and additional physical constraints, such as conservation of momentum, to improve the accuracy and faithfulness of deep learning models.</p>
<h3 id="more-details">More Details</h3>
<h4 id="see-our-paper-or-our-repository">See <a href="https://arxiv.org/abs/1911.08655">our paper</a> or our <a href="https://github.com/Rose-STL-Lab/Turbulent-Flow-Net">repository</a>.</h4><a href='mailto:ruw020@ucsd.edu'>Rui Wang</a>, <a href='mailto:kkashinath@lbl.gov'>Karthik Kashinath</a>, <a href='mailto:mmustafa@lbl.gov'>Mustafa Mustafa</a>, <a href='mailto:aalbert@lbl.gov'>Adrian Albert</a> and <a href='mailto:roseyu@eng.ucsd.edu'>Rose Yu</a>While deep learning has shown tremendous success in a wide range of domains, it remains a grand challenge to incorporate physical principles in a systematic manner to the design, training, and inference of such models. In this paper, we aim to predict turbulent flow by learning its highly nonlinear dynamics from spatiotemporal velocity fields of large-scale fluid flow simulations of relevance to turbulence modeling and climate modeling. We adopt a hybrid approach by marrying two well-established turbulent flow simulation techniques with deep learning. Specifically, we introduce trainable spectral filters in a coupled model of Reynolds-averaged Navier-Stokes (RANS) and Large Eddy Simulation (LES), followed by a specialized U-net for prediction. Our approach, which we call turbulent-Flow Net (TF-Net), is grounded in a principled physics model, yet offers the flexibility of learned representations. We compare our model, TF-Net, with state-of-the-art baselines and observe significant reductions in error for predictions 60 frames ahead. Most importantly, our method predicts physical fields that obey desirable physical characteristics, such as conservation of mass, whilst faithfully emulating the turbulent kinetic energy field and spectrum, which are critical for accurate prediction of turbulent flows.How to Detect Data-Copying in Generative Models2020-08-03T19:00:00+00:002020-08-03T19:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/08/03/how-to-detect-data-copying-in-generative-models<p>In our <a href="https://arxiv.org/abs/2004.05675">AISTATS 2020 paper</a>, professors <a href="https://cseweb.ucsd.edu/~kamalika/">Kamalika Chaudhuri</a>, <a href="https://cseweb.ucsd.edu/~dasgupta/">Sanjoy Dasgupta</a>, and I propose some new definitions and test statistics for conceptualizing and measuring overfitting by generative models.</p>
<p>Overfitting is a basic stumbling block of any learning process. Take learning to cook for example. In quarantine, I’ve attempted ~60 new recipes and can recreate ~45 of them consistently. The recipes are my training set and the fraction I can recreate is a sort of training error. While this training error is not exactly impressive, if you ask me to riff on these recipes and improvise, the result (i.e. dinner) will be dramatically worse.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/supervised_overfitting_2.png" width="75%" /></p>
<p>It is well understood that our models tend to do the same – deftly regurgitating their training data, yet struggling to generalize to unseen examples similar to the training data. Learning theory has nicely formalized this in the supervised setting. Our classification and regression models start to overfit when we observe a gap between training and (held-out) test prediction error, as in the above figure for the overly complex models.</p>
<p>This notion of overfitting relies on being able to measure prediction error or perhaps log likelihood of the labels, which is rarely a barrier in the supervised setting; supervised models generally output low dimensional, simple predictions. Such is not the case in the generative setting where we ask models to output original, high dimensional, complex entities like images or natural language. Here, we certainly lack any notion of prediction error and likelihoods are intractable for many of today’s generative models like VAEs and GANs: VAEs only provide a lower bound of the data likelihood, and GANs only leave us with their samples. This prevents us from simply measuring the gap between train and test accuracy/likelihood and calling it a day as we do with supervised models.</p>
<p>Instead, we evaluate generative models by comparing their generated samples with those of the true distribution, as in the following figure. Here, a two-sample test only uses a training sample and a generated sample. A three-sample test uses an additional held out test sample from the true distribution.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/unsupervised_setting_2.png" width="75%" /></p>
<p>This practice is well established by existing two-sample generative model tests like the <a href="https://arxiv.org/abs/1706.08500">Frechet Inception Distance</a>, <a href="https://arxiv.org/abs/1611.04488">Kernel MMD</a>, and <a href="https://arxiv.org/abs/1806.00035">Precision & Recall test</a>. But in absence of ground truth labels, what exactly are we testing for? We argue that unlike supervised models, generative models exhibit two varieties of overfitting: <strong>over-representation</strong> and <strong>data-copying</strong>.</p>
<h3 id="data-copying-vs-over-representation">Data-copying vs. Over-representation</h3>
<p>Most generative model tests like those listed above check for over-representation: the tendency of a model to over-emphasize certain regions of the instance space by assigning more probability mass there than it should. Consider a data distribution $P$ over an instance space $\mathcal{X}$ of cat cartoons. Region $\mathcal{C} \subset \mathcal{X}$ specifically contains cartoons of cats with bats. Using training set $T \sim P$, we train a generative model $Q$ from which we draw a sample $Q_m \sim Q$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/overrepresentation.png" width="95%" /></p>
<p>Evidently, the model $Q$ really likes region $\mathcal{C}$, generating an undue share of cats with bats. More formally, we say $Q$ is over-representing some region $\mathcal{C}$ when</p>
<p>\[ \Pr_{x \sim Q}[x \in \mathcal{C}] \gg \Pr_{x \sim P}[x \in \mathcal{C}] \]</p>
<p>This can be measured with a simple two-sample hypothesis test, as was done in Richardson & Weiss’s <a href="https://arxiv.org/abs/1805.12462">2018 paper</a> demonstrating the efficacy of Gaussian mixture models in high dimension.</p>
<p>Data-copying, on the other hand, occurs when $Q$ produces samples that are <em>closer to training set $T$</em> than they should be. To test for this, we equip ourselves with a held-out test sample $P_n \sim P$ in addition to some distance metric $d(x,T)$ that measures proximity to the training set of any $x \in \mathcal{X}$. We then say that $Q$ is data-copying training set $T$ when examples $x \sim Q$ are on average closer to $T$ than are $x \sim P$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/data_copying_1_.png" width="95%" /></p>
<p>We define proximity to training set $d(x,T)$ to be the distance between $x$ and its nearest neighbor in $T$ according to some metric $d_\mathcal{X}:\mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}$. Specifically</p>
<p>\[ d(x,T) = \min_{t \in T}d_\mathcal{X}(x,t) \]</p>
<p>At a first glance, the generated samples in the above figure look perfectly fine, representing the different regions nicely. But taken alongside its training and test sets, we see that it has effectively copied the cat with bat in the lower right corner (for visualization, we let Euclidean distance $d_\mathcal{X}$ be a proxy for similarity).</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/data_copying_2.png" width="95%" /></p>
<p>More formally, $Q$ is data-copying $T$ in some region $\mathcal{C} \subset \mathcal{X}$ when</p>
<p>\[ \Pr_{x \sim Q, z \sim P}[d(x,T) < d(z,T) \mid x,z \in \mathcal{C}] \gg \frac{1}{2}\]</p>
<p>The key takeaway here is that data-copying and over-representation are <em>orthogonal failure modes</em> of generative models. A model that exhibits over-representation may or may not data-copy and vice versa. As such, it is critical that we test for both failure modes when designing and training models.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/orthogonal_concepts_2_.png" width="70%" /></p>
<p>Returning to my failed culinary ambitions, I tend to both data-copy recipes I’ve tried <em>and</em> over-represent certain types of cuisine. If you look at the ‘true distribution’ of recipes online, you will find that there is a tremendous diversity of cooking styles and cuisines. However, put in the unfortunate circumstance of having me cook for you, I will most likely produce some slight variation of a recipe I’ve recently tried. And, even though I have attempted a number of Indian, Mexican, Italian, and French dishes, I tend to over-represent bland pastas and salads when left to my own devices. To cook truly original food, one must both be creative enough to go beyond the recipes they’ve seen <em>and</em> versatile enough to make a variety of cuisines. So, be sure to test for both data-copying and over-representation, and do not ask me to cook for you.</p>
<h3 id="a-three-sample-test-for-data-copying">A Three-Sample Test for Data-Copying</h3>
<p>Adding another test to one’s modeling pipeline is tedious. The good news is that data-copying can be tested with a single snappy three-sample hypothesis test. It is non-parametric, and concentrates nicely with both increasing test-set and generated samples.</p>
<p>As described in the previous section, we use a training sample $T \sim P$, a held-out test sample $P_n \sim P$, and a generated sample $Q_m \sim Q$. We additionally need some distance metric $d_\mathcal{X}(x,z)$. In practice, we choose $d_\mathcal{X}(x,z)$ to be the Euclidean distance between $x$ and $z$ after being embedded by $\phi$ into some lower-dimensional perceptual space: $d_\mathcal{X}(x,z) = \| \phi(x) - \phi(z) \|_2$. The use of such embeddings is common practice in testing generative models as exhibited by several existing over-representation tests like <a href="https://arxiv.org/abs/1706.08500">Frechet Inception Distance</a> and <a href="https://arxiv.org/abs/1806.00035">Precision & Recall</a>.</p>
<p>Following intuition, it is tempting to check for data-copying by simply differencing the expected distance to training set:</p>
<div>
$$
\mathbb{E}_{x \sim Q} [d(x,T)] - \mathbb{E}_{x \sim P} [d(x,T)] \approx \frac{1}{m} \sum_{x_i \in Q_m} d(x_i, T) - \frac{1}{n} \sum_{x_i \in P_n}d(x_i, T) \ll 0
$$
</div>
<p>where, to reiterate, $d(x,T)$ is the distance $d_\mathcal{X}$ between $x$ and its nearest neighbor in $T$. This statistic — an expected distance — is a little too finicky: the variance is far out of our control, influenced by both the choice of distance metric and by outliers in both $P_n$ and $Q_m$. So, instead of probing for how <em>much</em> closer $Q$ is to $T$ than $P$ is, we probe for how <em>often</em> $Q$ is closer to $T$ than $P$ is:</p>
<div>
$$
\mathbb{E}_{x \sim Q, z \sim P} [\mathbb{1}_{d(x,T) > d(z,T)}] \approx \frac{1}{nm} \sum_{x_i \in Q_m, z_j \in P_n} \mathbb{1} \big( d(x_i, T) > d(z_j, T) \big) \ll \frac{1}{2}
$$
</div>
<p>This statistic — a probability — is closer to what we want to measure, and is more stable. It tells us how much more likely samples in $Q_m$ are to fall near samples in $T$ relative to the held out samples in $P_n$. If it is much less than a half, then significant data-copying is occurring. This statistic is much more robust to outliers and is lower variance. Additionally, by measuring a probability instead of an expected distance, the value of this statistic is interpretable. Regardless of the data domain or distance metric, less than half is overfit, half is good, and over half is underfit (in the sense that the generated samples are further from the training set than they should be). We are also able to show that this indicator statistic has nice concentration properties agnostic to the chosen distance metric.</p>
<p>It turns out that the above test is an instantiation of the <a href="https://en.wikipedia.org/wiki/Mann-Whitney_U_test">Mann-Whitney hypothesis test</a>, proposed in 1947, for which there are computationally efficient implementations in packages like <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html">SciPy</a>. By $Z$-scoring the Mann-Whitney statistic, we normalize its mean to zero and variance to one. We call this statistic $Z_U$. As such, a generative model $Q$ with $Z_U \ll 0$ is heavily data-copying and a score $Z_U \gg 0$ is underfitting. Near 0 is ideal.</p>
<h3 id="handling-heterogeneity">Handling Heterogeneity</h3>
<p>An operative phrase that you may have noticed in the above definition of data-copying is “on average”. Is the generative model closer to the training data than it should be <em>on average</em>? This, unfortunately, is prone to false negatives. If $Z_U \ll 0$, then $Q$ is certainly data-copying in some region $\mathcal{C} \subset \mathcal{X}$. However, if $Z_U \geq 0$, it may still be excessively data-copying in one region and significantly underfitting in another, leading to a test score near 0.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/bins_1.png" width="33%" /></p>
<p>For example, let the $\times$’s denote training samples and the red dots denote generated samples. Even without observing a held-out test sample, it is clear that $Q$ is data-copying in pink region and underfitting in the green region. $Z_U$ will fall near 0, suggesting the model is performing well despite this highly undesirable behavior.</p>
<p>To prevent this misreading, we employ an algorithmic tool seen frequently in non-parametric testing: binning. Break the instance space into a partition $\Pi$ consisting of $k$ ‘bins’ or ‘cells’ $\pi \in \Pi$ and collect $Z_U^\pi$ each cell $\pi$.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/bins_2.png" width="33%" /></p>
<p>The statistic maintains its concentration properties within each cell. The more test and generated samples we have ($n$ and $m$), the more bins we can construct, and the more we can precisely pinpoint a model’s data-copying behavior. The ‘goodness’ of model’s fit is an inherently multidimensional entity, and it is informative to explore the range of $Z_U^\pi$ values seen in all cells $\pi \in \Pi$. Our experiments indicate that VAEs and GANs both tend to data-copy in some cells and underfit in others. However, to boil all this down into a single statistic for model comparisons, we simply take an average of the $Z_U^\pi$ values weighted by the number of test samples in the cell:</p>
<div>
$$
C_T = \sum_{\pi \in \Pi} \frac{\#\{P_n \in \pi\}}{n} Z_U^\pi
$$
</div>
<p>(In practice, we restrict ourselves to cells with a sufficient number of generated samples. See the <a href="https://arxiv.org/abs/2004.05675">paper</a>.). Intuitively, this statistic tells us whether the model tends to data-copy in the regions most heavily emphasized by the true distribution. It does not tell us whether or not the model $Q$ data-copies <em>somewhere</em>.</p>
<h3 id="experiments-data-copying-in-the-wild">Experiments: data-copying in the wild</h3>
<p>Observing data-copying in VAEs and GANs indicates that the $C_T$ statistic above serves as an instructive tool for model selection. For a more methodical interrogation of the $C_T$ statistic and comparison with baseline tests, be sure to check out the <a href="https://arxiv.org/abs/2004.05675">paper</a>.</p>
<p>To test how VAE complexity relates to data-copying, we train 20 VAEs on MNIST with increasing width as indicated by the latent dimension. For each model $Q$, we draw a sample of generated images $Q_m$, and compare with a held out test set $P_n$ to measure $C_T$. Our distance metric is given by the 64d latent space of an autoencoder we trained with a VGG perceptual loss produced by <a href="https://arxiv.org/abs/1801.03924">Zhang et al.</a>. The purpose of this alternative latent space is to provide an embedding that both provides a perceptual distance between images and is independent of the VAE embeddings. For partitioning, we simply take the Voronoi cells induced by the $k$ centroids found by $k$-means run on the embedded training dataset.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/VAE_overfitting.png" width="49%" />
<img src="/assets/2020-08-03-data-copying/VAE_gen_gap.png" width="46%" /></p>
<h5 style="text-align: center;">The data-copying $C_T$ statistic (left) captures overfitting in overly complex VAEs. The train/test gap in ELBO (right), meanwhile, does not.</h5>
<p>Recall that $C_T \ll 0$ indicates data-copying and $C_T \gg 0$ indicates underfitting. We see (above, left) that overly complex models (towards the left of the plot) tend to copy their training set, and simple models (towards the right of the plot) tend to underfit, just as we might expect. Furthermore, $C_T = 0$ approximately coincides with the maximum ELBO, the VAE’s likelihood lower bound. For comparison, take the generalization gap of the VAEs’ ELBO on the training and test sets (above, right). The gap remains large for both overly complex models ($d > 50$) and simple models ($d < 50$). With the ELBO being a lower bound to the likelihood, it is difficult to interpret precisely why this happens. Regardless, it is clear that the ELBO gap is a compartively imprecise measure of overfitting.</p>
<p>While the VAEs exhibit increasing data-copying with model complexity <em>on average</em>, most of them have cells that are over- and underfit. Poking into the individual cells $\pi \in \Pi$, we can take a look at the difference between a $Z_U^\pi \ll 0$ cell and a $Z_U^\pi \gg 0$ cell:</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/VAE_cells.png" width="90%" /></p>
<h5 style="text-align: center;"> A VAE's datacopied (left) vs. underfit (right) cells of the MNIST instance space.</h5>
<p>The two strips exhibit two regions of the same VAE. The bottom row of each shows individual generated samples from the cell, and the top row shows their training nearest neighbors. We immediately see that the data-copied region (left, $Z_U^\pi = -8.54$) practically produces blurry replicas of its training nearest neighbors, while the underfit region (right, $Z_U^\pi = +3.3)$ doesn’t appear to produce samples that look like any training image.</p>
<p>Extending these tests to a more complex and practical domain, we check the ImageNet-trained <a href="https://arxiv.org/abs/1809.11096">BigGAN</a> model for data-copying. Being a conditional GAN that can output images of any single ImageNet 12 class, we condition on three separate classes and treat them as three separate models: Coffee, Soap Bubble, and Schooner. Here, it is not so simple to re-train GANs of varying degrees of complexity as we did before with VAEs. Instead, we modulate the model’s ‘trunction threshold’: a level beyond which all inputs are resampled. A larger truncation threshold allows for higher variance latent input, and thus higher variance outputs.</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/GAN_overfitting.png" width="60%" /></p>
<h5 style="text-align: center;"> BigGan, an ImageNet12 conditional GAN, appears to significantly data-copy for all but its highest truncation levels, which are said to trade off between variety and fidelity. </h5>
<p>Low truncation thresholds restrict the model to producing samples near the mode – those it is most confident in. However it appears that in all image classes, this also leads to significant data copying. Not only are the samples less diverse, but they hang closer to the training set than they should. This contrasts with the BigGAN authors’ suggestion that truncation level trades off between ‘variety and fidelity’. It appears that it might trade off between ‘copying and not copying’ the training set.</p>
<p>Again, even the least copying models with maximized truncation (=2) exhibit data-copying in <em>some</em> cells $\pi \in \Pi$:</p>
<p style="text-align: center;"><img src="/assets/2020-08-03-data-copying/GAN_cells.png" width="95%" /></p>
<h5 style="text-align: center;"> Examples from BigGan's data-copied (left) and underfit (right) cells of the 'coffee' (top) and 'soap bubble' (bottom) classes.</h5>
<p>The left two strips show show data-copied cells of the coffee and bubble instance spaces (low $Z_U^\pi$), and right two strips show underfit cells (high $Z_U^\pi$). The bottom row of each strip shows a subset of generated images from that cell, and the top row training images from the cell. To show the diversity of the cell, these are not necessarily the generated samples’ training nearest neighbors as they were in the MNIST example.</p>
<p>We see that the data-copied cells on the left tend to confidently produce samples of one variety, that linger too closely to some specific examples it caught in the training set. In the coffee case, it is the teacup/saucer combination. In the bubble case, it is the single large suspended bubble with blurred background. Meanwhile, the slightly underfit cells on the right arguably perform better in a ‘generative’ sense. The samples, albeit slightly distorted, are more original. According to the inception space distance metric, they hug less closely to the training set.</p>
<h3 id="data-copying-is-a-real-failure-mode-of-generative-models">Data-copying is a real failure mode of generative models</h3>
<p>The moral of these experiments is that data-copying indeed occurs in contemporary generative models. This failure mode has significant consequences for user privacy and for model generalization. With that said, it is a failure mode not identified by most prominent generative model tests in the literature today.</p>
<ul>
<li>
<p>Data-copying is <em>orthogonal to</em> over-representation; both should be tested when designing and training generative models.</p>
</li>
<li>
<p>Data-copying is straightforward to test efficiently when equipped with a decent distance metric.</p>
</li>
<li>
<p>Having identified this failure mode, it would be interesting to see modeling techniques that actively try to minimize data-copying in training.</p>
</li>
</ul>
<p>So be sure to start probing your models for data-copying, and don’t be afraid to venture off-recipe every once in a while!</p>
<h3 id="more-details">More Details</h3>
<p>Check out <a href="https://arxiv.org/abs/2004.05675">our AISTATS paper on arxiv</a>, and <a href="https://github.com/casey-meehan/data-copying">our data-copying test code on GitHub</a>.</p><a href='mailto:cmeehan@eng.ucsd.edu'>Casey Meehan</a>What does it mean for a generative model to overfit? We formalize the notion of 'data-copying', when a generative model produces only slight variations of the training set and fails to express the diversity of the true distribution. To catch this form of overfitting, we propose a three-sample hypothesis test that is entirely model agnostic. Our experiments indicate that several standard tests condone data-copying, and contemporary generative models like VAEs and GANs can commit data-copying.The Power of Comparison: Reliable Active Learning2020-07-27T17:00:00+00:002020-07-27T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/07/27/rel-comp<p>With the surge of widely available massive online datasets, we have become <em>very</em> good at building algorithms which distinguish between the following:</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/cat-dog.png" width="90%" /></p>
<p>But what about solving classification problems which may have large unlabeled datasets, but whose labeling requires expert advice? What about situations like examining an MRI scan or recognizing a pedestrian, where a mistake in classification could be the difference between life and death? In these situations, it would be great if we could build a classification algorithm with the following properties:</p>
<ol>
<li>The algorithm requires <strong>very few labeled data points</strong> to train.</li>
<li>The algorithm <strong>never makes a mistake</strong>.</li>
</ol>
<p>This raises an obvious question: is building an efficient algorithm with such strong guarantees even possible? It turns out the answer is <strong>yes</strong>—just not in the standard learning model. In <a href="https://arxiv.org/pdf/1907.03816.pdf">recent joint work</a> with <a href="https://cseweb.ucsd.edu/~dakane/">Daniel Kane</a> and <a href="https://cseweb.ucsd.edu/~slovett/home.html">Shachar Lovett</a>, we show that while it is impossible to have such guarantees using only the <em>labels</em> of data points, achieving the goal becomes easy if you give the algorithm <strong>a little more power</strong>.</p>
<h3 id="comparison-queries">Comparison Queries</h3>
<p>Our work explores the additional power of algorithms which are allowed to <strong>compare data</strong>. In slightly more detail, imagine points in $\mathbb{R}^d$ are labeled by a linear classifier: that is $\text{sign}(f)$ for some affine linear function $f(x) = \langle x, w \rangle + b$.</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/linear-classifier.png" width="80%" /></p>
<p>A comparison between two data points $x,y \in \mathbb{R}^d$ asks which point is <em>closer</em> to the decision boundary (e.g. the purple line in Figure 2). More formally, a <strong>comparison query</strong> asks:
\[
f(x) - f(y) \overset{?}{\geq} 0.
\]
On the other hand, a standard <strong>label query</strong> on $x \in \mathbb{R}^d$ only asks which <em>side</em> of the decision boundary $x$ lies on, i.e.
\[
f(x) \overset{?}{\geq} 0.
\]
Comparison queries are natural from a human perspective—think how often throughout your day you compare objects, ideas, or alternatives. In fact, it has even been shown that in many practical circumstances, we may be <a href="https://link.springer.com/chapter/10.1007/978-3-642-14125-6_4#:~:text=The%20learning%20by%20pairwise%20comparison,preference%20modeling%20and%20decision%20making.&text=We%20explain%20how%20to%20approach,within%20the%20framework%20of%20LPC.">better at accurately comparing objects</a> than we are at labeling them! Since we are allowing our algorithm access to an expert (possibly human) oracle, it makes sense to allow the algorithm to ask the expert to compare data.</p>
<h3 id="the-algorithm">The Algorithm</h3>
<p>How can we use comparisons to learn with <strong>few queries</strong> and <strong>no mistakes</strong>? It turns out that a remarkably simple algorithm suffices! Imagine you are given a finite sample $S \subset \mathbb{R}^d$, and would like to find the label of every point in $S$ without making any errors. Consider the following basic procedure:</p>
<ol>
<li>Draw a small subsample $S’ \subset S$</li>
<li>Send $S’$ to the oracle to learn both labels and comparisons</li>
<li>Remove points from $S$ whose labels are learned in Step 2, and repeat.</li>
</ol>
<p>How exactly does Step 2 “learn labels”? Formally, this is done through a linear program whose constraints are given by the oracle responses on $S’$. Informally, this has a nice geometric interpretation. Let’s consider first the two dimensional case, originally studied by <a href="https://arxiv.org/abs/1704.03564">Kane, Lovett, Moran, and Zhang</a> (KLMZ). Figure 3 shows how comparisons allow us to infer the labels of points in $S$ by building cones (one red, one blue) based on the query results on $S’$. In essence, comparison queries allow us to find the points in $S’$ closest to the decision boundary (Figure 3(c)), which we call <strong>minimal</strong>. By drawing a cone stemming from a minimal point to others of the same label (Figure 3(d)), we can infer that every point <em>inside</em> the cone must share the same label as well (Figure 3(e)).</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/Infer.png" width="90%" /></p>
<p>Why does this process satisfy our guarantees? Let’s first discuss why we never mislabel a point, which follows from the fact that our cones stem from minima. Because our classifier is linear, this guarantees that the edges of our cones do not cross the decision boundary (i.e. change labels). Thus, the label of any point inside such a cone must be the same as its base point! Notice that this only remains true so long as the base point of our cone is minimal, which explains why comparison queries, the mechanism through which we find minima, are crucial to the algorithm.</p>
<p>The second guarantee, ensuring that we make few queries overall, is a bit more subtle, and requires the combinatorial theory of <em>inference dimension</em>.</p>
<h3 id="inference-and-average-inference-dimension">Inference and Average Inference Dimension</h3>
<p>Inference dimension is a complexity parameter introduced by KLMZ to measure how large a subsample $S’$ must be in order to learn a constant fraction of $S$.</p>
<div class="definition">
Given a set $X\subseteq \mathbb{R}^d$ and a family of classifiers $H$, the inference dimension of the pair $(X,H)$ is the smallest $k$ such that any sample $S'$ of size $k$ contains a point $x$ whose label may be inferred by queries on $S' \setminus \{x\}$. In other words, $x$ lies in a cone stemming from some minimal point to other points of the same label (as seen above in Figure 3, or below in Figure 4)
</div>
<p>Let’s take a look at an example, linear classifiers in two dimensions. Figure 4 shows that this class has inference dimension at most 7. Why? A sample of size 7 will always have at least 4 points with the same label, and the label of one of these points can always be inferred from labels and comparisons on the rest!</p>
<p style="text-align: center;"><img src="/assets/2020-07-27-rel-comp/inf-dim.png" width="90%" /></p>
<p>KLMZ show that by picking the size of $S’$ to be just a constant times larger than the inference dimension, the resulting cones will usually cover a constant fraction of our distribution. In other words, in two dimensions, every round of our algorithm infers a constant fraction of the remaining points, which means we only need $O(\log(|S|))$ rounds before our algorithm has labeled everything!</p>
<p>Unfortunately, in 3+ dimensions, linear classifiers become harder to deal with—indeed KLMZ show that their inference dimension is infinite. In <a href="https://arxiv.org/pdf/1907.03816.pdf">our recent work</a>, we circumvent this issue by applying a standard assumption from the data science and learning literature: we assume that our sample $S$ is drawn from some restricted range of natural distributions. The core idea of our analysis is then based off of a simple lemma, which informally states that even if $(X,H)$ has infinite inference dimension, samples from $X$ may still have small inference dimension with high probability!</p>
<div class="lemma">
If the probability that a sample $S$ of size $k$ contains no point which may be inferred from the rest is at most $g(k)$, then size $n$ finite samples have inference dimension $k$ with probability:
</div>
<p>\[
\Pr[\text{Inference dimension of} \ (S,H) \leq k] \geq 1-{ n \choose k}g(k).
\]</p>
<p>The main technical difficulty then becomes showing that $g(k)$, which we term <em>average inference dimension</em>, is indeed small over natural distributions. We confirm that this is the case for the class of s-concave distributions $(s \geq -\frac{1}{2d+3})$, a wide ranging generalization of Gaussians that includes fatter-tailed distributions like the Pareto and t-distribution.</p>
<div class="theorem">
If $S \subseteq \mathbb{R}^d$, $|S|=k$, is drawn from an s-concave distribution, the probability that $|S|$ contains no point which may be inferred from the rest is at most:
</div>
<p>\[
g(k) \leq 2^{-\tilde{\Omega}\left(\frac{k^2}{d}\right)}.
\]</p>
<p>Plugging this into our observation, we see that as long as $S$ is reasonably large, it will have inference dimension $\tilde{O}(d\log(|S|))$ with high probability! This allows us to efficiently learn the labels of $S$ through the algorithm we discussed before, so long as the distribution is s-concave. As a corollary, we get the following result:</p>
<div class="theorem">
Using comparisons, the process described in Figure 3 learns the labels of a sample $S \subset \mathbb{R}^d$ with respect to any linear classifier in only
</div>
<p>\[
\tilde{O}(d\log(|S|)^2)
\]
<em>expected queries, as long as $S$ is drawn from an s-concave distribution.</em></p>
<p>While we have focused in the above on learning finite samples, it turns out satisfying similar guarantees over all of $\mathbb{R}^d$ (under natural distributions) is also possible via the same argument. In this case, rather than trying to learn the label of every point, we allow our algorithm to respond “I don’t know” on an $\varepsilon$ fraction of samples. This type of algorithm goes by many names in the literature, perhaps the catchiest of which is a <a href="http://icml2008.cs.helsinki.fi/papers/627.pdf">“Knows What It Knows”</a> (KWIK) learner. The above then (with a bit of work) more or less translates into a KWIK-learner<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> that uses only $d\log(1/\varepsilon)^2$ calls to the oracle.</p>
<h3 id="lower-bound">Lower Bound</h3>
<p>Why doesn’t this algorithm work with only labels? It’s long been known that even in two dimensions, <a href="https://cseweb.ucsd.edu/~dasgupta/papers/sample.pdf">achieving these learning guarantees is impossible</a> for certain adversarial distributions such as $S^1$. Let’s take a look at how our results match up on a less adversarial example with a long history in learning theory: the d-dimensional unit ball.</p>
<div class="theorem">
Using comparisons, the process described in Figure 3 KWIK-learns linear classifiers over the d-dimensional unit ball in only
</div>
<p>\[
\tilde{O}(d\log^2(1/\varepsilon))
\]
oracle calls. On the other hand, using only labels takes at least
\[
\left (\frac{1}{\varepsilon}\right)^{\Omega(d)}
\]
oracle calls<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>
<p>This simple example shows <strong>the exponential power of comparisons</strong>: moving KWIK-learning from intractable to highly efficient. As a final note, implementing the algorithm amounts to running a series of small linear programs, and as a result is computationally efficient as well, only taking about $\text{poly}\left(\frac{1}{\varepsilon},d\right)$ time.</p>
<p>It remains to be seen whether this type of efficient, comparison-based KWIK-learning will be useful in practice. Comparison queries have <a href="https://arxiv.org/abs/1206.4674">already been shown</a> to provide practical improvements over labels for similar learning problems, and have been used to great effect in other areas such as <a href="https://dl.acm.org/doi/10.1007/11823865_4">recommender systems</a>, <a href="https://papers.nips.cc/paper/4381-randomized-algorithms-for-comparison-based-search">search</a>, and <a href="https://arxiv.org/abs/1606.08842">ranking</a> as well. Since we have recently extended our results to more realistic noisy scenarios in <a href="https://arxiv.org/abs/2001.05497">joint work</a> with <a href="https://gomahajan.github.io/">Gaurav Mahajan</a>, we are optimistic that our techniques will remain as powerful in practice as they are in theory.</p>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>It’s worth noting that we have abused notation here. Formally, our algorithms do not fall into the KWIK-model, but are built in a similar learning-theoretic framework called <a href="https://people.csail.mit.edu/rivest/pubs/RS88b.pdf">RPU-learning</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>This follows from a standard cap packing argument—we can divide up the ball into many disjoint caps, and note that any KWIK-learner must query a point in at least half of them to be successful. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div><a href='http://cseweb.ucsd.edu/~nmhopkin/'>Max Hopkins</a>In the world of big data, large but costly to label datasets dominate many fields. Active learning, a semi-supervised alternative to the standard PAC-learning model, was introduced to explore whether adaptive labeling could learn concepts with exponentially fewer labeled samples. Unfortunately, it is well known that in standard models, active learning provides little improvement over passive learning for the foundational classes such as linear separators. We discuss how empowering the learner to compare points resolves not only this issue, but also allows us to build efficient algorithms which make no errors at all!Adversarial Robustness for Non-Parametric Classifiers2020-07-20T17:00:00+00:002020-07-20T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/07/20/adversarial-pruning<p>In a previous <a href="/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html">post</a>,
we discussed the relationship between accuracy and robustness
for separated data.
A classifier trained on $r$-separated data can be both accurate and robust with radius $r.$
What if the data are not $r$-separated?
In our <a href="https://arxiv.org/abs/1906.03310">recent paper</a>, we look at how to deal with this case.</p>
<p>Many datasets with natural data like images or audio are $r$-separated [<a href="https://arxiv.org/abs/2003.02460">1</a>].
In contrast, datasets with artificially-extracted features are often not.
Non-parametric methods like nearest neighbors,
random forest, etc. perform well on these kind of datasets.
In this post, we focus on the discussion of non-parametric methods on non-$r$-separated datasets.</p>
<p>We first present a defense algorithm – adversarial pruning –
that can increase the robustness of many non-parametric methods.
Then we dive into how adversarial pruning deals with non-$r$-separated data.
Finally, we present a generic attack algorithm that works well across many non-parametric methods
and use it to evaluate adversarial pruning.</p>
<h3 id="defense">Defense</h3>
<p>Let us start by visualizing the
decision boundaries of a $1$-nearest neighbor ($1$-NN) and a random forest (RF) classifier on a toy dataset.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn.png" width="28%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_rf.png" width="28%" /></p>
<p>We see that the decision boundaries are highly non-smooth, and lie close to many data points,
resulting in a non-robust classifier.
This is caused by the fact that many differently-labeled examples are near each
other.
Next, let us consider a modified dataset in which the red and blue examples are more separated.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn_ap30.png" width="28%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_rf_ap30.png" width="28%" /></p>
<p>Notice that the boundaries become smoother as examples move
further away from the boundaries.
This makes the classifier more robust as the predicted label stays the same
if data are perturbed a little.</p>
<h4 id="adversarial-pruning">Adversarial Pruning</h4>
<p>From these figures, we can see that these non-parametric methods are
more robust when data are better separated.
Given a dataset, to make it more separated, we need to remove examples.
To preserve information in the dataset, we do not want to remove too many examples.
We design our defense algorithm to minimally remove examples from the dataset
so that differently-labeled examples are well-separated from each other.
After this modification, we can train a non-parametric classifier on it.
We call this defense algorithm <em>adversarial pruning (AP)</em>.</p>
<p>More formally, given a robustness radius $r$ and a training set $\mathcal{S}$, AP computes
a maximum subset $\mathcal{S}^{AP} \subseteq \mathcal{S}$ such that differently-labeled
examples in $\mathcal{S}^{AP}$ have distance at least $r$.
We show that known graph algorithms can be used to efficiently compute $\mathcal{S}^{AP}$.
We build a graph $G=(V, E)$ as follows.
First, each training example is a vertex in the graph.
We connect pairs of differently-labeled examples (vertices) $\mathbf{x}$ and $\mathbf{x}’$ with an edge whenever $|\mathbf{x} − \mathbf{x}’| \leq 2r$.
Then, computing $\mathcal{S}^{AP}$ is reduced to removing as few examples as possible so that no more edges remain.
This is equivalent to solving the <a href="https://mathworld.wolfram.com/VertexCover.html">minimum vertex cover</a> problem.
When dealing with binary classification problem, the graph $G$ is bipartite and
standard algorithms like the <a href="https://en.wikipedia.org/wiki/Hopcroft%E2%80%93Karp_algorithm">Hopcroft–Karp algorithm</a>
can be used to solve this problem.
With multi-class classification, minimum vertex cover is NP Hard in general, and
<a href="https://networkx.github.io/documentation/stable/_modules/networkx/algorithms/approximation/vertex_cover.html">approximation algorithms</a>
have to be applied.</p>
<h4 id="theoretical-justification">Theoretical Justification</h4>
<p>It happens that Adversarial Pruning has a nice theoretical interpretation -
we can show that it can be interpreted as a finite sample approximation to the optimally robust and accurate classifier.
To understand this, first, let us try to understand what the goal of robust classification is.
We assume the data is sampled from a distribution $\mu$ on $\mathcal{X} \times [C]$, where $\mathcal{X}$ is the feature
space and $C$ is the number of classes.
Normally, the ultimate limit of accurate classification is the Bayes optimal classifier which maximizes the accuracy on the underlying data distribution.
But the Bayes optimal may not be very robust.</p>
<p>Let us look at the figure below. The blue curve is the decision boundary of the Bayes optimal classifier.
We see that this blue curve is close to the data distribution and thus not the most robust.
An alternative decision boundary is the black curve, which is further away from the distribution while still being accurate.</p>
<figure class="image" style="text-align: center;">
<span>
<img src="/assets/2020-07-20-adversarial-pruning/r-opt.png" width="60%" style="margin: 0 auto" />
</span>
</figure>
<p>We define the astuteness of a classifier as its accuracy on examples where it is robust with
radius $r$.
The objective of a robust classifier is to maximize the
astuteness under $\mu$, which is the probability that the classifier is both $r$-robust and accurate for a new sample $(\mathbf{x}, y)$ [<a href="https://arxiv.org/abs/1706.06083">1</a>, <a href="https://arxiv.org/abs/1706.03922">2</a>].</p>
<div class="definition" style="overflow-x: auto;">
Let $\mathbb{B}(\mathbf{x}, r)$ be the ball with radius $r$ around $\mathbf{x}$ and
$S_j(f,r) := \{\mathbf{x} \in \mathcal{X} \mid f(\mathbf{x}') = j \text{ s.t. } \forall \mathbf{x}' \in \mathbb{B}(\mathbf{x}, r)\}$.
For distribution $\mu$ on $\mathcal{X} \times [C]$, the astuteness is defined as
$$
ast_\mu(f,r) = \sum_{j=1}^{C} \int_{\mathbf{x} \in S_j(f,r)} Pr(y = j \mid \mathbf{x}) d \mu.
$$
</div>
<p>Next, we present the $r$-optimal classifier that achieves optimal astuteness.
By comparing it with the classic Bayes optimal classifier, which
achieves optimal accuracy, the $r$-optimal classifier is a <em>Robust Analogue to the Bayes optimal classifier</em>.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="">
<tr>
<th>$r$-optimal classifier (black curve)</th>
<th>Bayes optimal classifier (blue curve)</th>
</tr>
<tr>
<td>Optimal astuteness</td>
<td>Optimal accuracy</td>
</tr>
<tr>
<td>
\begin{split}
\max_{S_1,\ldots, S_c} & \sum_{j=1}^{c} \int_{\mathbf{x} \in S_j} Pr(y = j \mid \mathbf{x}) d\mu \\
\mbox{ s.t. } \quad & d(S_j, S_{j'}) \geq 2r \quad \forall j \neq j' \\
& d(S_j, S_{j'}) := \min_{u \in S_j, v \in S_{j'}} \| u-v\|_p
\end{split}
</td>
<td>
\begin{split}
\max_{S_1,\ldots, S_c} & \sum_{j=1}^{c} \int_{\mathbf{x} \in S_j} Pr(y = j \mid \mathbf{x}) d\mu \\
\end{split}
</td>
</tr>
</table>
</div>
<p>We observe that
AP can be interpreted as a finite sample approximation to the $r$-optimal classifier.
If $S_j$ are sets of examples, then
the solution to the $r$-optimal classifier is maximum subsets of
training data with differently-labeled examples being $2r$ apart.
As long as the training set $S$ is representative of $\mu$, these subsets ($S_j$) approximate
the optimal subsets ($S^*_j$).
Hence, we posit that non-parametric methods trained
on $S^{AP}$ should approximate the r-optimal classifier</p>
<p>For more about the $r$-optimal classifier,
please refer to this <a href="https://arxiv.org/abs/2003.06121">paper</a>.</p>
<h4 id="adversarial-pruning-generates-r-separated-datasets">Adversarial pruning generates $r$-separated datasets</h4>
<p>What AP does is remove the minimum number of examples so that the dataset
becomes $r$-separated.
In our previous
<a href="/jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness.html">post</a>,
we show that there is no intrinsic trade-off between robustness and accuracy when the
dataset is $r$-separated.
This means that there exists a classifier that achieves
perfect robustness and accuracy.
However, the solution may make mistake on the examples removed by AP and we can
think about the removed examples as the trade-off between robustness and accuracy.</p>
<h3 id="evaluating-ap-an-attack-method">Evaluating AP: An Attack Method</h3>
<p>In this section, we provide an attack algorithm to evaluate the robustness
of non-parametric methods.
For parametric classifiers such as neural networks, generic gradient-based attacks exist.
Our goal is to develop an analogous general attack method, which applies to and
works well for multiple non-parametric classifiers.</p>
<p>The attack algorithm is called region-based attack (RBA).
Given an example $\mathbf{x}$, RBA can find the closest example to $\mathbf{x}$ with different prediction,
in other words, RBA achieves the optimal attack.
In addition, RBA can be applied to many non-parametric methods while
many prior attacks for non-parametric methods
[<a href="https://arxiv.org/abs/1605.07277">1</a>, <a href="https://arxiv.org/abs/1509.07892">2</a>] are classifier specific.
<a href="https://arxiv.org/abs/1605.07277">1</a> only applies to $1$ nearest neighbors and
<a href="https://arxiv.org/abs/1509.07892">2</a> only applies to tree-based classifiers.</p>
<p style="text-align: center;"><img src="/assets/2020-07-20-adversarial-pruning/moon_1nn_voronoi.png" width="30%" />
<img src="/assets/2020-07-20-adversarial-pruning/moon_dt_regions.png" width="30%" />
<img src="/assets/2020-07-20-adversarial-pruning/region_pert_2.png" width="28%" /></p>
<p>To understand how RBA works, let us look at the figures above.
The figures above show the decision boundaries of $1$-NN and decision tree on a toy dataset.
We see that the feature space is divided into many regions, where
examples in the same region have the same prediction
(meaning we can assign a label to each region).
These regions are convex for nearest neighbors and tree-based classifiers.</p>
<p>Suppose the example we want to attack is $\mathbf{x}$ and $y$ is its label.
RBA works as follows.
Suppose we could find the region $P_i$ that is the closest to $\mathbf{x}$ and
its label is not $y$.
Then, the closest example in $P_i$ to $\mathbf{x}$ would be the optimal adversarial example.
RBA finds the closest region $P_i$ by iterating through each region that is labeled differently from $y$.
More formally, given a set of regions and its corresponding label $(P_i, y_i)$, the RBA solves
the following optimization problem:</p>
<div style="overflow-x: auto;">
\[
\underset{i : f(\mathbf{x}) \neq y_i }{\textcolor{red}{min}} \
\underset{\mathbf{x}_{adv} \in P_i}{\textcolor{ForestGreen}{min}} \|\mathbf{x} - \mathbf{x}_{adv}\|_p
\]
</div>
<p>The $\textcolor{red}{\text{outer $min$}}$ can be solved by iterating through all regions.
The $\textcolor{ForestGreen}{\text{inner $min$}}$ can be solved with standard linear programming when $p=1$ and $\infty$ and quadratic programming when $p=2$.
When this optimization problem is solved exactly, we call it RBA-Exact.</p>
<p>Interestingly, concurrent works [<a href="https://arxiv.org/abs/1810.07481">1</a>,
<a href="https://arxiv.org/abs/1809.03008">2</a>, <a href="https://arxiv.org/abs/1711.07356">3</a>,
<a href="https://arxiv.org/abs/1903.08778">4</a>] have also shown that the decision regions of
ReLU networks are also decomposable into convex regions and developed attacks based on this property.</p>
<p><strong>Speeding up RBA.</strong>
Different non-parametric methods divide the feature space into different numbers of regions.
When attacking $k$-NN, there would be $O(\binom{N}{k})$ regions, where $N$ is the number of training
examples.
When attacking RF, there is an exponential number of regions with growing number of trees.
It is computationally infeasible to solve RBA-Exact when the number of regions is large.</p>
<p>We develop an approximate version of RBA (RBA-Approx.) to speed up the process and make our algorithm applicable
to real datasets.
We relax the $\textcolor{red}{\text{outer $min$}}$ by iterating over only a fixed number of regions based on
the following two criteria.
First, a region has to have at least one training example in it to be considered.
Second, if $\mathbf{x}_i$ is the training example in the region $P_i$, then the
regions with smaller $\|\mathbf{x}_i - \mathbf{x}\|_p$ are considered first
until we exceed the number of regions we want to search.
We found that empirically using these two criteria to search $50$ regions can find
adversarial examples very close to the target example.</p>
<h3 id="empirical-results">Empirical Results</h3>
<p>We empirically evaluate the performance of our attack (RBA) and defense (AP) algorithms.</p>
<p><strong>Evaluation criteria for attacks.</strong>
We use the distance between an input $\mathbf{x}$ and its generated adversarial example
$\mathbf{x}_{adv}$ to evaluate the performance of the attack algorithm.
We call this criterion <em>empirical robustness (ER)</em>
The lower ER is, the better the attack algorithm is.
We calculate the average ER over correctly predicted test examples.</p>
<p><strong>Evaluation criteria for defenses.</strong>
To evaluate the performance of a defense algorithm,
we use the ratio of the distance between an input $\mathbf{x}$ and its closest
adversarial example being found before and after the defense algorithm is applied.
We call this criterion <em>defense score ($\text{defscore}$)</em>.
More formally,</p>
<div style="overflow-x: auto;">
$$
\text{defscore}(\mathbf{x}) =
\frac{\text{defended dist. from } \mathbf{x} \text{ to } \mathbf{x}_{adv}}{\text{undefended dist. from } \mathbf{x} \text{ to } \mathbf{x}_{adv}}
= \frac{\text{ER w/ defense}}{\text{ER w/o defense}}.
$$
</div>
<p>We calculate the average defscore over the correctly predicted test examples.
A larger defscore means that the attack algorithm needs a larger perturbation to change the label.
Thus, the more effective the defense algorithm is.
If the defscore is larger than one, then the defense is effectively making
the classifier more robust.</p>
<p>We consider the following non-parametric classifiers:
$1$ nearest neighbor ($1$-NN), $3$ nearest neighbor ($3$-NN), and random forest (RF).</p>
<p><strong>Attacks.</strong>
To evaluate RBA, we compare with other attack algorithms for non-parametric methods.
<a href="https://arxiv.org/abs/1605.07277">Direct attack</a> is designed to attack nearest neighbor classifiers.
<a href="https://arxiv.org/abs/1807.04457">Black box attack (BBox)</a> is another algorithm that applies to many
non-parametric methods.
However, as a black-box attack, it does not use the
internal structure of the classifier.
It appears that BBox is the state-of-the-art algorithm for attacking non-parametric methods.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="font-size: 80%; ">
<tr>
<th colspan="1"></th>
<th colspan="4">$1$-NN</th>
<th colspan="3">$3$-NN</th>
<th colspan="2">RF</th>
</tr>
<tr>
<th>Dataset</th>
<th>Direct</th> <th>BBox</th> <th>RBA Exact</th> <th>RBA Approx.</th>
<th>Direct</th> <th>BBox</th> <th>RBA Approx.</th>
<th>BBox</th> <th>RBA Approx.</th>
</tr>
<tr>
<td>cancer</td>
<td>.223</td> <td>.364</td> <td style="font-weight: bold">.137</td> <td style="font-weight: bold">.137</td>
<td>.329</td> <td>.376</td> <td style="font-weight: bold">.204</td>
<td>.451</td> <td style="font-weight: bold">.383</td>
</tr>
<tr>
<td>covtype</td>
<td>.130</td> <td>.130</td> <td style="font-weight: bold">.066</td> <td>.067</td>
<td>.200</td> <td>.259</td> <td style="font-weight: bold">.108</td>
<td>.233</td> <td style="font-weight: bold">.214</td>
</tr>
<tr>
<td>diabetes</td>
<td>.074</td> <td>.112</td> <td style="font-weight: bold">.035</td> <td style="font-weight: bold">.035</td>
<td>.130</td> <td>.143</td> <td style="font-weight: bold">.078</td>
<td style="font-weight: bold">.181</td> <td>.184</td>
</tr>
<tr>
<td>halfmoon</td>
<td>.070</td> <td>.070</td> <td style="font-weight: bold">.058</td> <td style="font-weight: bold">.058</td>
<td>.105</td> <td>.132</td> <td style="font-weight: bold">.096</td>
<td>.182</td> <td style="font-weight: bold">.149</td>
</tr>
</table>
</div>
<p>From the result, we see that the RBA algorithm is able to perform well across many non-parametric methods
and datasets (for results on more datasets and classifiers, please refer our
<a href="https://arxiv.org/abs/1906.03310">paper</a>).
For $1$-NN, RBA-Exact performed the best as expected since its optimal.
For $3$-NN and RF, RBA-Approx. also performed the best among the baselines.</p>
<p><strong>Defenses.</strong>
For baseline, we consider <a href="https://arxiv.org/abs/1706.03922">WJC</a> for the defense of $1$-NN and
<a href="https://arxiv.org/abs/1902.10660">robust splitting (RS)</a> for tree-based classifiers.
Another baseline is the <a href="https://arxiv.org/abs/1706.06083">adversarial training (AT)</a>,
which has a lot of success in parametric classifiers.
We use RBA-Exact to attack $1$-NN and RBA-Approx to attack $3$-NN and RF for the calculation
of defscore.</p>
<div style="width: 100%; overflow-x: auto;">
<table style="width:100%; font-size: 80%;">
<tr>
<th colspan="1"></th>
<th colspan="3">$1$-NN</th>
<th colspan="2">$3$-NN</th>
<th colspan="3">RF</th>
</tr>
<tr>
<th>Dataset</th>
<th>AT</th> <th>WJC</th> <th>AP</th>
<th>AT</th> <th>AP</th>
<th>AT</th> <th>RS</th> <th>AP</th>
</tr>
<tr>
<td>cancer</td>
<td>0.82</td> <td>1.05</td> <td style="font-weight: bold">1.41</td>
<td>1.06</td> <td style="font-weight: bold">1.39</td>
<td>0.87</td> <td style="font-weight: bold">1.54</td> <td>1.26</td>
</tr>
<tr>
<td>covtype</td>
<td>0.61</td> <td style="font-weight: bold">4.38</td> <td style="font-weight: bold">4.38</td>
<td>0.88</td> <td style="font-weight: bold">3.31</td>
<td>1.02</td> <td>1.01</td> <td style="font-weight: bold">2.13</td>
</tr>
<tr>
<td>diabetes</td>
<td>0.83</td> <td style="font-weight: bold">4.69</td> <td style="font-weight: bold">4.69</td>
<td>0.87</td> <td style="font-weight: bold">2.97</td>
<td>1.19</td> <td>1.25</td> <td style="font-weight: bold">2.22</td>
</tr>
<tr>
<td>halfmoon</td>
<td>1.05</td> <td>2.00</td> <td style="font-weight: bold">2.78</td>
<td>0.93</td> <td style="font-weight: bold">1.92</td>
<td>1.04</td> <td>1.01</td> <td style="font-weight: bold">1.82</td>
</tr>
</table>
</div>
<p>From the table, we see that AP performs well across different classifiers.
AP always generates above $1.0$ defscore, which means the classifier becomes more robust after the defense.
This shows that AP is applicable to many non-parametric classifiers as oppose to
WJC and RS, which are classifier-specific defenses.
AT performs poorly for non-parametric classifiers (this is aligned with previous
<a href="https://arxiv.org/abs/1902.10660">findings</a>.)
This result demonstrates that AP can serve as a good baseline for a new non-parametric
classifier.</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we consider adversarial examples for non-parametric
classifiers and presented generic defenses and attacks.
The defense algorithm – adversarial pruning – bridges the gap between
$r$-separated and non-$r$-separated data by removing the minimum number of examples
to make the data well-separated.
Adversarial pruning can be interpreted as a finite sample approximation to the
$r$-optimal classifier, which is the most robust classifier under attack radius $r$.
The attack algorithm – region-based attack – finds the closest adversarial example
and achieves the optimal attack.
On the experiment side, we show that both these algorithms are able to perform well across
multiple non-parametric classifiers.
They can be good candidates for baseline evaluation of robustness for newly designed
non-parametric classifiers.</p>
<h3 id="more-details">More Details</h3>
<p>See <a href="https://arxiv.org/abs/1906.03310">our paper on arxiv</a> or <a href="https://github.com/yangarbiter/adversarial-nonparametrics">our repository</a>.</p><a href='http://yyyang.me/'>Yao-Yuan Yang</a>Adversarial robustness has received much attention recently. Prior defenses and attacks for non-parametric classifiers have been developed on a classifier-specific basis. In this post, we take a holistic view and present a defense and an attack algorithm that are applicable across many non-parametric classifiers. Our defense algorithm, adversarial pruning, works by preprocessing the dataset so the data is better separated. It can be interpreted as a finite sample approximation to the optimally robust classifier. The attack algorithm, region-based attack, works by decomposing the feature space into convex regions. We show that our defense and attack have good empirical performance over a range of datasets.Adversarial Robustness Through Local Lipschitzness2020-05-04T17:00:00+00:002020-05-04T17:00:00+00:00https://ucsdml.github.io//jekyll/update/2020/05/04/adversarial-robustness-through-local-lipschitzness<p>Neural networks are very susceptible to adversarial examples, a.k.a., small perturbations of normal inputs that cause a classifier to output the wrong label.
The standard defense against adversarial examples is <a href="https://arxiv.org/abs/1706.06083">Adversarial Training</a>, which trains a classifier using adversarial examples close to training inputs.
This improves test accuracy on adversarial examples, but it often lowers clean accuracy, sometimes by a lot.</p>
<p>Several recent papers investigate whether an accuracy-robustness trade-off is necessary.
Some <a href="https://arxiv.org/abs/1805.12152">pessimistic work</a> says that unfortunately this may be the case, possibly <a href="https://arxiv.org/abs/1801.02774">due to high-dimensionality</a> or <a href="https://arxiv.org/abs/1805.10204">computational infeasibility</a>.</p>
<p>If a trade-off is unavoidable, then we have a dilemma: should we aim for higher accuracy or robustness or somewhere in between?
Our <a href="https://arxiv.org/abs/2003.02460">recent paper</a> explores an optimistic perspective: we posit that robustness and accuracy should be attainable together for real image classification tasks.</p>
<p>The main idea is that we should use a locally smooth classifier, one that doesn’t change its value too quickly around the data. Let’s walk through some theory about why this is a good idea. Then, we will explore how to use this in practice.</p>
<h3 id="the-problem-with-natural-training">The problem with natural training</h3>
<p>The reason why we see a trade-off between robustness and accuracy is due to training methods. The best neural network optimization methods lead to functions that change very rapidly, as this allows the network to closely fit the data.</p>
<p>Since we care about robustness, we actually want to move as slowly as possible from class to class. This is especially true for separated data. Think about an image dataset. Cats look different than dogs, and pandas look different than gibbons. Quantitatively, different animals should be far apart (for example, in $\ell_{\infty}$ and $\ell_2$ distance). It follows that we should be able to classify them robustly. If we are very confident in our prediction, then as long as we don’t modify a true image too much, we should output the same, correct label.</p>
<p>So why do adversarial perturbations lead to a high error rate? This is a very active area of research, and there’s no easy answer.
As a step towards a better understanding, we present theoretical results on achieving perfect accuracy and robustness by using a locally smooth function. We also explore how well this works in practice.</p>
<p>As a motivating example, consider a simple 2D binary classification dataset. The goal is to find a decision boundary that has 100% training accuracy without passing closely to any individual input.
The orange curve in the following picture shows such a boundary. In contrast, the black curve comes very close to some data points. Even though both boundaries correctly classify all of the examples, the black curve is susceptible to adversarial examples, while the orange curve is not.</p>
<p style="text-align: center;"><img src="/assets/2020-05-04-local-lip/wig_boundary.png" width="40%" /></p>
<h3 id="perfect-accuracy-and-robustness-at-least-in-theory">Perfect accuracy and robustness, at least in theory</h3>
<p>We propose designing a classifier using the sign of a relatively smooth function. For separated data, this ensures that it’s impossible to change the label by slightly perturbing a true input. In other words, if the function value doesn’t change very quickly, then neither does the label.</p>
<p>More formally, we consider classifiers $g(x) = \mathsf{sign}(f(x))$, and we highlight the local Lipschitzness of $f$ as an important quantity. Simply put, the Lipschitz constant of a function measures how fast a function changes by dividing the difference between function values by the distance between inputs:
$\frac{|f(x) - f(y)|}{d(x,y)}.$
Here $d(x,y)$ can be any metric. It is most common to use $d(x,y) = \|x - y\|$ for some norm on $\mathbb{R}^d$.
Previous works (<a href="https://arxiv.org/abs/1811.05381">1</a>, <a href="https://arxiv.org/abs/1807.09705">2</a>) shows that enforcing global Lipschitzness is too strict. Instead, we consider when $f$ is $L$-locally Lipschitz, which means that it changes slowly, at rate $L$, in a small neighborhood of radius $r$ around it.</p>
<div class="definition">
A function $f: \mathcal{X} \rightarrow \mathbb{R}$ is $L$-Locally Lipschitz in a radius $r$ around a point $x \in \mathcal{X}$, if for all $x'$ such that $d(x,x') \leq r$, we have
$ |f(x) - f(x')| \leq L \cdot d(x, x').$
</div>
<p>Previous work by <a href="https://arxiv.org/abs/1705.08475">Hein and Andriushchenko</a> has shown that local Lipschitzness indeed guarantees robustness.
In fact, variants of Lipschitzness have been the main tool in certifying robustness with <a href="https://arxiv.org/abs/1902.02918">randomized smoothing</a> as well.
However, we are the first to identify a natural condition (data separation) that ensures both robustness and high test accuracy.</p>
<p>Our main theoretical result says that if the two classes are separated – in the sense that points from different classes are distance at least $2r$ apart, then there exists a $1/r$-locally Lipschitz function that is both robust to perturbations of distance $r$ and also 100% accurate.</p>
<p>For many real world datasets, the separation assumption in fact holds.</p>
<div style="text-align: center;">
<img src="/assets/2020-05-04-local-lip/cifar10_linf_hist.png" width="48%" style="margin: 0 auto" />
<img src="/assets/2020-05-04-local-lip/resImgNet_linf_hist.png" width="48%" style="margin: 0 auto" />
</div>
<p>For example, consider the CIFAR-10 and Restricted ImageNet datasets (for the latter, we removed a handful of images that appeared twice with different labels).
The figure shows the histogram of the $\ell_\infty$ distance of each training example to its closest differently-labeled example.
From the figure we can see that the dataset is $0.21$ separated, indicating that there exists a solution that’s both robust and accurate with a perturbation distance up to $0.105$.
Perhaps surprisingly, most work on adversarial examples considers small perturbations of size $0.031$ for CIFAR-10 and $0.005$ for Restricted ImageNet, which are both much less than the observed separation in these histograms.</p>
<div class="theorem">
If the data is $2r$-separated, then there always exists a classifier that is perfectly robust and accurate, which is based on a function with local Lipschitz constant $1/r$.
</div>
<p>We basically use a scaled version of the 1-nearest neighbor classifier in the infinite sample limit. The proof just uses the data separation along with a few applications of the triangle inequality. The next figure shows our theorem in action on the Spiral dataset. The classifier $g(x) = \mathsf{sign}(f(x))$ has high adversarial and clean accuracy, while the small local Lipschitz constant ensures that it gradually changes near the decision boundaries.</p>
<figure class="image" style="text-align: center;">
<img src="/assets/2020-05-04-local-lip/spiral.png" width="40%" style="margin: 0 auto" />
<figcaption>
Function and resulting classifier from our theorem.
The prediction is confident most of the time, and it gradually changes between classes (orange to blue).
</figcaption>
</figure>
<h3 id="encouraging-the-smoothness-of-neural-networks">Encouraging the smoothness of neural networks</h3>
<p>Now that we’ve made a big deal of local Lipschitzness, and provided some theory to back it up, we want to see how well this holds up in practice. Two questions drive our experiments:</p>
<ul>
<li>Is local Lipschitzness correlated with robustness and accuracy in practice?</li>
<li>Which training methods produce locally Lipschitz functions?</li>
</ul>
<p>We also need to explain how we measure Lipschitzness on real data. For simplicity, we consider the average local Lipschitzness, computed using</p>
<p>\[
\frac{1}{n}\sum_{i=1}^n\max_{x_i’\in\mathsf{Ball}(x_i,\epsilon)}\frac{|f(x_i)-f(x_i’)|}{\|x_i-x_i’\|_\infty}.
\]</p>
<p>The benefit is that we want the function to be smooth on average, even though there may be some outliers.
One of the best methods for adversarial examples is <a href="https://arxiv.org/abs/1901.08573">TRADES</a>, which encourages local Lipschitzness by minimizing the following loss function:</p>
<p>\[
\min_{f} \mathbb{E} \Big\{\mathcal{L}(f(X),Y)+\beta\max_{X’\in\mathsf{Ball}(X,\epsilon)} \mathcal{L}(f(X),f(X’))\Big\}.
\]</p>
<p>TRADES is different than <a href="https://arxiv.org/abs/1706.06083">Adversarial Training (AT)</a>, which optimizes the following:</p>
<p>\[
\min_{f} \mathbb{E} \Big\{\max_{X’\in\mathsf{Ball}(X,\epsilon)}\mathcal{L}(f(X’),Y)\Big\}.
\]</p>
<p>AT directly optimizes over adversarial examples, while TRADES encourages $f(X)$ and $f(X’)$ to be similar when $X$ and $X’$ are close to each other. The TRADES parameter $\beta$ controls the local smoothness (larger $\beta$ means a smaller Lipschitz constant).</p>
<p>We also consider two other plausible methods for achieving accuracy and robustness, along with local Lipschitzness:
<a href="https://arxiv.org/abs/1907.02610">Local Linear Regularization (LLR)</a>
and <a href="https://arxiv.org/abs/1905.11468">Gradient Regularization (GR)</a>.</p>
<h3 id="comparing-five-different-training-methods">Comparing five different training methods</h3>
<p>Here we provide experimental results for CIFAR-10 and Restricted ImageNet. See our paper for other datasets (MNIST and SVHN).</p>
<table>
<thead>
<tr>
<th style="text-align: left">CIFAR-10</th>
<th style="text-align: center">train accuracy</th>
<th style="text-align: center">test accuracy</th>
<th style="text-align: center">adv test accuracy</th>
<th style="text-align: center">test lipschitz</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Natural</td>
<td style="text-align: center">100.00</td>
<td style="text-align: center">93.81</td>
<td style="text-align: center">0.00</td>
<td style="text-align: center">425.71</td>
</tr>
<tr>
<td style="text-align: left">GR</td>
<td style="text-align: center">94.90</td>
<td style="text-align: center">80.74</td>
<td style="text-align: center">21.32</td>
<td style="text-align: center">28.53</td>
</tr>
<tr>
<td style="text-align: left">LLR</td>
<td style="text-align: center">100.00</td>
<td style="text-align: center">91.44</td>
<td style="text-align: center">22.05</td>
<td style="text-align: center">94.68</td>
</tr>
<tr>
<td style="text-align: left">AT</td>
<td style="text-align: center">99.84</td>
<td style="text-align: center">83.51</td>
<td style="text-align: center">43.51</td>
<td style="text-align: center">26.23</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=1)</td>
<td style="text-align: center">99.76</td>
<td style="text-align: center">84.96</td>
<td style="text-align: center">43.66</td>
<td style="text-align: center">28.01</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=3)</td>
<td style="text-align: center">99.78</td>
<td style="text-align: center">85.55</td>
<td style="text-align: center">46.63</td>
<td style="text-align: center">22.42</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=6)</td>
<td style="text-align: center">98.93</td>
<td style="text-align: center">84.46</td>
<td style="text-align: center">48.58</td>
<td style="text-align: center">13.05</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: left">Restricted ImageNet</th>
<th style="text-align: center">train accuracy</th>
<th style="text-align: center">test accuracy</th>
<th style="text-align: center">adv test accuracy</th>
<th style="text-align: center">test lipschitz</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Natural</td>
<td style="text-align: center">97.72</td>
<td style="text-align: center">93.47</td>
<td style="text-align: center">7.89</td>
<td style="text-align: center">32228.51</td>
</tr>
<tr>
<td style="text-align: left">GR</td>
<td style="text-align: center">91.12</td>
<td style="text-align: center">88.51</td>
<td style="text-align: center">62.14</td>
<td style="text-align: center">886.75</td>
</tr>
<tr>
<td style="text-align: left">LLR</td>
<td style="text-align: center">98.76</td>
<td style="text-align: center">93.44</td>
<td style="text-align: center">52.65</td>
<td style="text-align: center">4795.66</td>
</tr>
<tr>
<td style="text-align: left">AT</td>
<td style="text-align: center">96.22</td>
<td style="text-align: center">90.33</td>
<td style="text-align: center">82.25</td>
<td style="text-align: center">287.97</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=1)</td>
<td style="text-align: center">97.39</td>
<td style="text-align: center">92.27</td>
<td style="text-align: center">79.90</td>
<td style="text-align: center">2144.66</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=3)</td>
<td style="text-align: center">95.74</td>
<td style="text-align: center">90.75</td>
<td style="text-align: center">82.28</td>
<td style="text-align: center">396.67</td>
</tr>
<tr>
<td style="text-align: left">TRADES($\beta$=6)</td>
<td style="text-align: center">93.34</td>
<td style="text-align: center">88.92</td>
<td style="text-align: center">82.13</td>
<td style="text-align: center">200.90</td>
</tr>
</tbody>
</table>
<p>For both datasets, we see correlation between accuracy, Lipschitzness, and adversarial accuracy. For example, on CIFAR-10, we see that TRADES($\beta$=6) achieves the highest adversarial test accuracy (48.58), and also the lowest Lipschitz constant (13.05). TRADES may not always perform better than AT, but it seems like a very effective method to produce classifiers with small local Lipschitz constants. One issue is that the training accuracy isn’t as high as it could be, and there are some issues with tuning the methods to prevent underfitting. In general, we focus on understanding the role of Lipschitzness.</p>
<p>Natural training has the lowest adversarial accuracy, and also the higher Lipschitz constant. GR has a fairly low training accuracy (possibly due to underfitting).
For LLR, AT, and TRADES, we see that smoother classifiers have higher adversarial test accuracy as well. However, this is only true up to some point. Increased local Lipschitzness helps, but with very high local Lipschitzness, the neural networks start underfitting which leads to loss in accuracy, for example, with TRADES($\beta$=6).</p>
<h3 id="robustness-requires-some-local-lipschitzness">Robustness requires some local Lipschitzness</h3>
<p>Our experimental results provide many insights into the role that Lipschitzness plays in classifier accuracy and robustness.</p>
<ul>
<li>
<p>A clear takeaway is that <em>very high</em> Lipschitz constants imply that the classifier is vulnerable to adversarial examples. We see this most clearly with natural training, but it is also evidenced by GR and LLR.</p>
</li>
<li>
<p>For both CIFAR and Restricted ImageNet, the experiments show that minimizing the Lipschitzness goes hand-in-hand with maximizing the adversarial accuracy. This highlights that Lipschitzness is just as important as training with adversarial examples when it comes to improving the adversarial robustness.</p>
</li>
<li>
<p>TRADES always leads to significantly smaller Lipschitz constants than most methods, and the smoothness increases with the TRADES parameter $\beta$. However, the correlation between smoothness and robustness suffers from diminishing returns. It is not optimal to minimize the Lipschitzness as much as possible.</p>
</li>
<li>
<p>The main downside of AT and TRADES is that the clean accuracy suffers. This issue may not be inherent to robustness, but rather it may be possible to achieve the best of both worlds. For example, LLR is consistently more robust than natural training, while simultaneously achieving state-of-the-art clean test accuracy. This leaves open the possibility of combining the benefits of both LLR and AT/TRADES into a classifier that does well across the board. This is the main future work!</p>
</li>
</ul>
<h3 id="more-details">More Details</h3>
<p>See <a href="https://arxiv.org/abs/2003.02460">our paper on arxiv</a> or <a href="https://github.com/yangarbiter/robust-local-lipschitz">our repository</a>.</p><a href='https://sites.google.com/site/cyrusrashtchian/'>Cyrus Rashtchian</a> and <a href='http://yyyang.me/'>Yao-Yuan Yang</a>Robustness often leads to lower test accuracy, which is undesirable. We prove that (i) if the dataset is separated, then there always exists a robust and accurate classifier, and (ii) this classifier can be obtained by rounding a locally Lipschitz function. Empirically, we verify that popular datasets (MNIST, CIFAR-10, and ImageNet) are separated, and we show that neural networks with a small local Lipschitz constant indeed have high test accuracy and robustness.