A CLT For the Differential entropy of high-dimensional Gaussian Distributions

In graduate school most of my research was broadly in the area of multivariate analysis, of which covariance estimation is an important subject. I gave a seminar talk on this paper when I was in grad school. Often results in random matrix theory can be quite complicated, but the proofs for this problem are surprisingly elegant.

The differential entropy is defined for a density \(p\) as

\[ H(p) = -\mathbb{E}_p[\log p(X)] . \]

For a \(D-\)dimensional Gaussian \(N(\mu,\Sigma)\), this is given by the formula

\[ H(p) = \frac{D}{2}+\frac{D\log (2\pi)}{2} +\frac{\log \mid \Sigma \mid}{2},\]

where \( \mid \cdot \mid\) denotes the determinant. So for the Gaussian problem, estimating entropy amounts to estimating the log-determinant of the covariance matrix. Note that one representation for the log-determinant is as the sum of the log-eigenvalues:

\[ \log \mid \hat{\Sigma} \mid = \sum_i \log \lambda_i .\]

Since the determinant depends on all of the eigenvalues of the random matrix, which are generally dependent, we might expect getting a limiting distribution would be difficult. In fact, we can derive a relatively simple finite-sample expression for the log-det, which naturally leads to a CLT.

The determinant and log-determinant of a Gaussian matrix appear frequently in multivariate analysis, such as multivariate ANOVA for comparing two multivariate samples and quadratic discriminant analysis, which is a generative method for classification.

The sample covariance for a sample of i.i.d. random Gaussian vectors \(X_1,\ldots,X_N\) is given by

\[ \hat{\Sigma} = \frac{1}{N} \sum_{i=1}^N (X_i - \bar{X})(X_i - \bar{X})^\top. \]

Our estimator of the log-determinant of the population covariance is simply the log-determinant of the sample covariance:

\[\hat{T} = \log \mid \hat{\Sigma} \mid.\]

The main theorem of the paper derives a central limit theorem for the log-determinant of the sample covariance matrix

Theorem 1:

Suppose that \(D\leq N\). Then the log-determinant of the sample covariance satisfies as \(N\rightarrow \infty\),

\[ \frac{\log \mid \hat{\Sigma}\mid - \tau_{N,D} - \log \mid \Sigma \mid}{\sigma_{N,D}} \rightarrow N(0,1), \]

where \(\tau_{N,D} = \sum_{k=1}^D \left(\psi(\frac{N-k+1}{2}) - \log(\frac{N}{2})\right) \), \(\psi\) is the digamma function, and \(\sigma_{N,D}^2=\sum_{k=1}^D \frac{1}{N-k+1}\).

Comments

The first interesting observation is the bias to the log-det sample covariance. When \(D\) is fixed as \(N\) grows, the bias disappears asymptotically. In particular, if \(D\) is fixed, the bias and standard deviation are given by

\[\begin{array}{ll} \tau_{N,D} = \frac{D(D+1)}{2N}, && \sigma_{N,D} = \sqrt{2D/N} \end{array}. \]

Proof (Sketch)

Note that \(\hat{\Sigma}\) has the same distribution as a sum \(\frac{1}{N}\sum_{i} Z_iZ_i^\top\), where \(Z_i\) are i.i.d. \(N(0,\Sigma)\). So we have

\[\begin{array}{ll} \log \mid \hat{\Sigma}\mid - \log \mid \Sigma \mid &=& \log \mid \Sigma^{-½}\hat{\Sigma}\Sigma^{-½} \mid \\ &=& \log \det \mid \frac{1}{N}\sum_i Y_i Y_i^\top \mid, \\ &=& \log \det \hat{I} \end{array}\]

where \(Y_i \sim N(0,1)\) and \(\hat{I} = \sum_i Y_i Y_i^\top \). Using the Bartlett decomposition, we can Cholesky factorize \(\hat{I}\) by

\[ \hat{I} = AA^\top, \]

where \(A\) is a lower-triangular matrix with independent random entries. Since the Cholesky factor is triangular, it’s determinant is the product of the diagonals, and \(\mid \hat{I} \mid = \mid A\mid ^2\), so the log-determinant of \(\hat{I}\) will be a particular sum of i.i.d. random variables. In particular, it may be expressed as a particular sum of independent log-chi-square distributions, given by

\[ \log \det (N \hat{I}) \sim \sum_{i=1}^D \log (\chi_{N-i+1}^2). \]

Finding the particular forms of the bias and standard deviation of the estimator for the different regimes requires computing the expected value and variance of this random variable, which you can read in Section 5.1.

Theorem 2

In addition to a CLT, the authors derive the (non-asymptotic) risk of this estimator with respect to squared-error loss.

If \(D \leq N\), the estimator satisfies

\[ \mathbb{E}[(\hat{T}-\log \mid \Sigma\mid)^2] \leq -2\log(1-D/N)+\frac{10D}{3N(N-D)}. \]

The paper has some interesting insights on the risk for this problem. If \(D/N\rightarrow 0\), then our estimator asymptotically achieves the minimax risk both in rate and in constant, which is \(2D/N\): it is asymptotically optimal in the minimax sense. If \(D/N\rightarrow r\in (0,1]\), the minimax risk for this problem is non-vanishing, so it is not possible to consistently estimate the log-determinant.

Finally, a surprising result. We may think that we can improve these results in the high-dimensional setting if we add some constraints on the form of \(\Sigma\), and then design a specialized estimate of \(\Sigma\) for that class. There have been many such considerations in the literature:such as assuming sparsity of \(\Sigma\), or sparsity of \(\Sigma^{-1}\).

Theorem 5

Suppose \(\Sigma\) belongs to the collection of bounded diagonal matrices. Then the minimax risk is lower bounded by \(c D/N\), where \(0<c\leq 2\).

Since any class of covariance matrices of interest includes bounded diagonal matrices as a subset, we cannot hope to consistently estimate the log-det if \(D/N \not\rightarrow 0\). This is a disappointing result, because there is a huge literature on high-dimensional estimators for covariance matrices. While the estimates themselves may be consistent by some notion of risk, they cannot be “plugged-in” to estimate the log-det consistently.

I have been to talks where it has been stated, “often we are primarily interested in the high-dimensional covariance matrix as an input to something else, like a QDA classifier”, but the theory suggests this is misguided.