The kernel density estimator (KDE) is a simple and popular tool for nonparametric density estimation. In one-dimension it is given by

\[ \hat{p}_{KDE}(x) = \frac{1}{Nh} \sum_{i=1}^NK\left(\frac{x-X_i}{h}\right). \]

\(K\) is a kernel (let’s say a variance 1 density for simplicity). It has a simple closed form, and there is extensive literature on theoretical justifications for KDE. One conceptual difficulty with KDE is that it is not represented as a solution to an optimization problem. Most statistics and ML algorithms, from PCA to SVM to k-means are either formulated as an optimization or may be alternatively formulated as the solution to one. For me, this gives me better intuition, and often provides a decision-theoretic justification for the problem. For example, the popular ML algorithm AdaBoost, the first boosting algorithm, benefited from new insights and extensions when it was discovered that it is essentially a greedy algorithm for optimizing the exponential classification loss.

A few years ago I came across the paper What do Kernel Density Estimators Optimize? by Koenker et al. It has some interesting connections between the heat equation and KDEs, but the theorem I find most interesting is the following:

Theorem 1: Let \(E_n[\cdot]\) denote the expectation operator with respect to the empirical distribution of a sample \(X_1,\ldots,X_N\). The solution to

\[\min_f \left\{ -\mathbb{E}_n[f(x,\lambda)]+ \frac{1}{2} \intop f(x,\lambda)^2 dx +\frac{\lambda}{2} \intop \left(\frac{\partial f(x,\lambda)}{\partial x}\right)^2 dx \right\}\]

is given by

\[ \hat{p}(x) = \frac{1}{2\sqrt{\lambda n}} \sum_{i=1}^N \exp \{ -\mid x-X_i \mid /\sqrt{\lambda} \}. \]

This is precisely the KDE with a Laplace kernel. In section 4 of the paper, the authors show that by changing the third term in the optimization to a different penalty, the solution is a KDE with a different choice of kernel. Like with other kernel methods such as kernel SVM , each kernel corresponds to a norm in a reproducing kernel Hilbert space. Changing the penalty term to that particular norm will result in a kernel estimator with the corresponding kernel for that RKHS. For example, we could change the penalty from

\( \intop \left(\frac{\partial f(x,\lambda)}{\partial x}\right)^2 dx \) to \( \intop \left(\frac{\partial^2 f(x,\lambda)}{\partial x^2}\right)^2 dx\) and we would get a KDE with a kernel of the form \(K_\sigma(x)=\frac{1}{2}\exp\{-\mid x/\sigma\mid/\sqrt{2}\}\sin(\mid x/\sigma\mid/\sqrt{2}+\pi/4)/\sigma\).

This is a pretty fascinating result, which says that the KDE is in fact the solution to an optimization problem. Moreover, this optimization has some deep foundations which I will describe in the next section

Scoring rules

A scoring rule is a function \( s(x,Q)\) which measures the accuracy of predicting an observation \(x\) from an unknown distribution \(P\) with a distribution \(Q\). A scoring rule is proper if the expected score is minimized by the choice of \(Q=P\):

\[ P = \text{argmin}_Q \mathbb{E}_{X\sim P} [s(X,Q)].\]

Every proper scoring rule has a corresponding entropy:

\[ H(P) = \mathbb{E}_{X\sim P} [s(X,P)], \]

and a divergence:

\[ D(P,Q) =  \mathbb{E}_{X\sim P}[ s(X,P)] -  \mathbb{E}_{X\sim P}[ s(X,Q)], \]

so there are intimate ties between proper scoring rules and information theory. I won’t get into too much detail about proper scoring rules here, I talk about them in-depth in my dissertation, including examples of several estimators which correspond to optimizing certain proper scoring rules. But in general,  an estimator which optimizes a proper scoring rule is going to be consistent (this can be made rigorous). From the above formula, minimizing a proper scoring rule with respect to the data \( \min_Q\mathbb{E}_n[s(X,Q)]\) is the same as finding a \(Q\) which  minimizes the divergence to the empirical distribution \(\hat{P}\), \(\min_Q D(\hat{P},Q)\). Basically, any proper scoring rule will find an estimate “close” to the empirical distribution of the data, but the measure of closeness will differ by scoring rule.

By far the most common proper scoring rule is the log score,

\[ l(x,Q) = -\log q(x), \]

where \(q = Q’\). Minimizing the empirical log score is the same as minimizing the Kullback-Leibler Divergence, \(KL(P,Q)= \mathbb{E}_P[\log p(X)/q(X)]\). More familiarly, this is equivalent to the ubiquitous maximum likelihood. Maximum likelihood is so popular because under mild conditions it has very strong guarantees (asymptotically unbiased, efficient, normal). Other scoring rules may produce a sub-optimal estimator, but there are often situations (such as there being computations limitations) which lead to considering another scoring rule.

Tsallis score

One class of proper scoring rules is the Tsallis score:

\[ t_{\gamma}(x,Q) = (\gamma-1)\intop q(y)^\gamma dy - \gamma q(x)^{\gamma-1},\]

Up until now the only interesting thing I’ve known about the Tsallis score is it’s entropy is the Tsallis entropy which is used in statistical mechanics. I have never seen it used in statistical inference. One curious thing about the Tsallis score is it is not local, which means the score at an observation \(x\) also depends on evaluating the density at points besides \(x\) (given by the integral over \(q\)). 

Now, by simple observation we can show that the KDE is the solution to

\[ \min_q \left\{ \frac{1}{2} \mathbb{E}_n [ t_2(X,q) ] + \frac{\lambda}{2} R(q) \right\}, \]

where \(R(q)\) is some norm in a reproducing kernel Hilbert space (RKHS). In words, the KDE is the solution to minimizing the Tsallis score of the data, plus some roughness penalty. This shows a close connection to other common density estimators such as the penalized maximum likelihood estimator, which minimizes the penalized Log score:

\[ \min_q \left\{-\mathbb{E}_n [ \log q(X) ] + \frac{\lambda}{2} R(q)\right\}, \]

with the added constraint that \(\intop q(x) dx =1\).

Conclusion

There are two interesting insights to be learned here. The first is that the KDE has solid decision-theoretic foundations, though I’ve never seen that expressed. KDE is usually interpreted as a “smoothed histogram”, which is true, but it has deeper justifications. That the KDE pops out as the solution to optimizing the Tsallis score, an arbitrary scoring rule I’ve never seen used before, seems perplexing.

The second insight is that optimizing the regularized Tsallis score gives an estimator with a closed-form. I describe several other regularized scoring rule estimators in my dissertation, but they don’t have such a simple solution. Granted, finding the solution required solving the heat equation PDE, but I wonder if there are more insights and connections to be made using some knowledge of PDEs and variational calculus.

image