-
A CLT For the Differential entropy of high-dimensional Gaussian Distributions
In graduate school most of my research was broadly in the area of multivariate analysis, of which covariance estimation is an important subject. I gave a seminar talk on this paper when I was in grad school. Often results in random matrix theory can be quite complicated, but the proofs for this problem are surprisingly elegant.
The differential entropy is defined for a density \(p\) as
\[ H(p) = -\mathbb{E}_p[\log p(X)] . \]
For a \(D-\)dimensional Gaussian \(N(\mu,\Sigma)\), this is given by the formula
\[ H(p) = \frac{D}{2}+\frac{D\log (2\pi)}{2} +\frac{\log \mid \Sigma \mid}{2},\]
where \( \mid \cdot \mid\) denotes the determinant. So for the Gaussian problem, estimating entropy amounts to estimating the log-determinant of the covariance matrix. Note that one representation for the log-determinant is as the sum of the log-eigenvalues:
\[ \log \mid \hat{\Sigma} \mid = \sum_i \log \lambda_i .\]
-
Pitman Closeness, a strange alternative to risk
Here is the curious story of a one-time alternative to the accepted notions of statistical optimality. Today, when we talk about decision theory, we think of the risk, the expected loss of a particular decision rule. However at one point in the history of Statistics, there was another candidate. Pitman Closeness makes a lot of sense conceptually, and generated quite a bit of interest in past decades. However, it can lead you to some strange conclusions. As such, it has not lasted the test of time.
Statistical decision theory begins by considering an observation \(x\) drawn from a distribution \(F(x\mid \theta)\) parametrized by \(\theta\), a decision rule \(\delta\) which is a measurable function of the data \(x\), and a loss function \(L(\theta,\delta(x))\), which measures the loss from taking some action \(\delta\). The risk is defined as
-
The kernel density estimator minimizes the regularized Tsallis score
The kernel density estimator (KDE) is a simple and popular tool for nonparametric density estimation. In one-dimension it is given by
\[ \hat{p}_{KDE}(x) = \frac{1}{Nh} \sum_{i=1}^NK\left(\frac{x-X_i}{h}\right). \]
\(K\) is a kernel (let’s say a variance 1 density for simplicity). It has a simple closed form, and there is extensive literature on theoretical justifications for KDE. One conceptual difficulty with KDE is that it is not represented as a solution to an optimization problem. Most statistics and ML algorithms, from PCA to SVM to k-means are either formulated as an optimization or may be alternatively formulated as the solution to one. For me, this gives me better intuition, and often provides a decision-theoretic justification for the problem. For example, the popular ML algorithm AdaBoost, the first boosting algorithm, benefited from new insights and extensions when it was discovered that it is essentially a greedy algorithm for optimizing the exponential classification loss.
A few years ago I came across the paper What do Kernel Density Estimators Optimize? by Koenker et al. It has some interesting connections between the heat equation and KDEs, but the theorem I find most interesting is the following: