Extremal bounds for Gaussian trace estimation
-
Abstract: This work derives extremal tail bounds for the Gaussian trace estimator applied to a real symmetric matrix. We define a partial ordering on the eigenvalues, so that when a matrix has greater spectrum under this ordering, its estimator will have worse tail bounds. This is done for two families of matrices: positive semidefinite matrices with bounded effective rank, and indefinite matrices with bounded 2-norm and fixed Frobenius norm. In each case, the tail region is defined rigorously and is constant for a given family.
1 Introduction
Let be a symmetric matrix with eigenvalues , and let be i.i.d standard Gaussian vectors. The Gaussian trace estimator is given by
(1) |
It is well known that this estimator is unbiased and has variance .
Often, it is useful to know how many samples are needed to produce an estimate that satisfies a given error tolerance. Cortinovis and Kressner [1] have used concentration inequalities to derive good results on this problem, with slight improvements by Persson, Cortinovis, and Kressner in [6].
Theorem 1 ([1], Thm. 1).
Let be symmetric. Then for all ,
For nonzero symmetric positive semidefinite (SPSD) matrices, this result can be turned into a relative error estimate. The trace estimator will generally be more accurate on matrices with tightly clustered eigenvalues.
Definition 1.
The effective rank of a nonzero SPSD matrix is
The goal of this paper is to tighten these bounds as much as possible using techniques from [10, 9, 8]. The practical benefit may be marginal, since the above bounds are already quite good. Still, I found the techniques interesting.
1.1 Extremal bounds
Words like tight and optimal can be slippery. To be precise, I mean to find extremal tail probabilities: ones that can be expressed as the supremum or infimum over some set. The error bounds of Theorem 1 and Corollary 1 use only certain information about ; respectively, they use and . It is therefore worth considering the sets of all matrices with the same summary statistics:
where is the set of all symmetric matrices of any dimension, and is the set of all SPSD matrices of any dimension. For , the parameter is a lower bound on the effective rank of ; we can fix without loss of generality since the relative error of is scale invariant.
For each of the above sets , we define a partial ordering:
-
•
For , the partial ordering is the vector majorization order on the eigenvalues. This paper will show that implies that has worse upper and lower tail bounds than .
-
•
For , the partial ordering is related to the majorization order but is a little more complicated. This paper will show that implies that has worse upper tail bounds and that has worse lower tail bounds.
With these partial orderings, has a maximum element and has both maximum and minimum elements.111Strictly speaking, the maximum is unique only if we treat matrices as equivalent when they have the same nonzero eigenvalues.
1.2 The tail
It is not particularly useful to show that if then one trace estimate eventually has worse tail bounds than the other—this much can already be deduced just by comparing the asymptotic behavior of the probability distributions.
The critical result, then, is that we can define the “tail” regions so that their value depends only on . For error tolerances within these regions, the partial ordering works as advertised on all elements of .
Unfortunately, it has so far been beyond my ability to prove exactly where these tail regions begin. This paper contains some pessimistic bounds, but otherwise is restricted to conjecture.
1.3 The extremal matrices
For the relative error, the matrix with the worst tail bounds is
(2) |
For the absolute error, the matrix with the worst upper tail bounds is
(3) |
where is an upper bound on the stable rank of . By symmetry, the matrix with the worst lower tail bounds is .
Definition 2.
The stable rank of a nonzero matrix is
1.3.1 Extension to Gamma random variables
When and are not integers, the distributions of the trace estimators for (2) and (3) are not quite as elegant as I would have liked them to be. We can, however, relax the problem by considering more general linear combinations of Gamma random variables, as opposed to only those whose distribution can be expressed as the Gaussian trace estimate of some matrix as in (1).
1.4 Related work
This paper is primarily a spiritual successor to the works [10, 9, 8]. Székely [10] gives thorough tail bounds for the relative error when is a nonnegative sum of chi-squared r.v’s with no further restrictions (the worst case is typically ) and provides a majorization result as a corollary. Roosta-Khorasani, Székely, and Ascher [9] extend the main result to where is a nonnegative sum of Gamma r.v’s with arbitrary shape and scale parameters. This allows them to consider the effects of using a larger sampling number for trace estimation. Finally, Roosta-Khorasani and Székely [8] generalize the worst-case bounds of [9] by obtaining a majorization result for nonnegative sums of Gamma r.v’s. The present work’s Theorem 2, in particular, is more or less a restatement of a result from [8].
This current paper is novel in two main ways. First, it considers how the above bounds might be strengthened when the matrix in question has a large effective rank. Second, it obtains absolute error bounds for indefinite matrices —that is, when the sum of Gamma random variables is not necessarily nonnegative.
2 Majorization
In [8], Roosta-Khorasani and Székely use the majorization order on the eigenvalues of a matrix as a measure of the “skewness” of its spectrum. Their general observation is that the Gaussian trace estimator performs worse when the spectrum is highly skewed. In other words: the majorization order is the partial ordering that they (and we) use to determine for which matrices the Gaussian trace estimator yields the worst relative error bounds.
Given a nonnegative vector , let denote its indices in sorted order, so that . We say that majorizes , denoted , if
(6a) | ||||
(6b) |
Similarly, we say that weakly majorizes , denoted , just as long as (6a) holds. If and do not have the same length, pad the shorter one with zeroes as necessary.
If weakly majorizes but does not majorize it, then . For our proofs, it will be useful to identify the indices for which the inequalities in (6a) are strict.
Definition 3.
Given , the leading slack index is the smallest number such that
This index has the property that (and no larger entry of ) can be increased by a nonzero amount without violating the majorization condition.
In deriving their main results, Roosta-Khorasani and Székely make use of the following lemma [7, 12.5a].
Lemma 1.
If , then there exists a finite sequence of vectors
so that and differ in two coordinates only for .
This lemma lets us assume without loss of generality that and differ in two coordinates only, satisfying
2.1 Indefinite majorization
For indefinite matrices with fixed Frobenius norm as in , the majorization condition is somewhat more complicated. Here is why: a Gamma random variable is nonnegative, so its lower tail can’t get too far from the mean. The average of many Gamma random variables, however, looks more like a normal distribution which has tails in both directions. This suggests that will have worse absolute upper tail bounds when the nonnegative eigenvalues are highly skewed and when the negative eigenvalues are tightly clustered.
We will also see that has worse upper tail bounds than , for indefinite . Analogous results for lower tail bounds follow by symmetry.
Now we describe the majorization condition more precisely.
Definition 4.
For a vector , define
where the min and max operations are done elementwise.
Definition 5.
For vectors and , we say that if
(7a) | ||||
(7b) | ||||
(7c) |
Condition (7a) specifies that the positive entries of are more skewed, and perhaps have more total weight, than those of . Condition (7b) specifies the opposite for the negative entries. Condition (7c) means that the matrices associated with and have the same Frobenius norm. This last condition also means that the associated trace estimators have the same variance.
Lemma 2.
If , then there exists a finite sequence of vectors
so that and differ in two coordinates only for . Furthermore, the difference between consecutive vectors may be assumed to take one of the following forms:
-
1.
;
-
2.
;
-
3.
.
Once again, we may assume without loss of generality that and differ in two coordinates only, this time satisfying .
3 Gamma random variables
In this section we cover some relevant properties of Gamma random variables, and explain how to reformulate the original trace estimation problem in terms of such variables.
Since real symmetric matrices are orthogonally diagonalizable and Gaussian vectors are rotationally invariant, the Gaussian trace estimator (1) can be written in the form
(8) |
where is a chi-squared variable with degrees of freedom.
The chi-squared distribution is a special case of the Gamma distribution, and so we can generalize (8) as follows:
Definition 6.
Given a vector , denote
When and are clear from context, we will use for short.
With the notation above, has the same distribution as .
3.1 Basic properties
Here are a few other facts about the Gamma distribution:
-
•
A Gamma random variable with shape and rate has the PDF
(9) The tail decays more slowly than the tail of a normal distribution.
-
•
If then and . Furthermore, is unimodal with mode if and 0 otherwise.
-
•
Gamma random variables follow a scaling law: if and , then .
-
•
Gamma random variables with the same rate parameter have an additive property: if and , then .
-
•
Conversely, Gamma random variables are infinitely divisible: if then for any positive integer , has the same distribution as , where .
3.2 Generalizing the distribution families
In order to define and as extensions of and , we first generalize the notions of 2-norm and effective rank.
Definition 7.
For a random variable , we define the scale of as
It is worth mentioning that is a property of the distribution itself, not just its representation in terms of .
Definition 8.
For a random variable with nonnegative weights , we define the effective shape of as
Now we can extend the definition of :
Definition 9.
For fixed and , define
where the weights are nonnegative.
The parameter in is a lower bound on the effective shape of . This definition generalizes the case of matrix trace estimation since
The definition of can be extended similarly:
Definition 10.
For fixed and , define
In this case, we have
3.3 Infinite division
Our strategy for relaxing the original trace estimation problem is to use the fact that Gamma random variables are infinitely divisible (see Section 3.1). Any Gamma random variable may be rewritten as a sum of arbitrarily many Gamma random variables with smaller shape parameters , and this fact enables us to nest some distribution families within others.
Lemma 3.
For any integer , it holds that has the same distribution as , and consequently that
By considering the limit as , we can effectively do away with the constraint of sharing a single scale parameter, and in doing so obtain the more general sets and promised in (4) and (5).
Definition 11.
For , let be the set of all finite linear combinations of Gamma random variables
subject to the constraints and .
Definition 12.
For and , let be the set of all finite linear combinations of Gamma random variables
subject to the constraints and .
For any fixed and , for sufficiently large the set contains distributions arbitrarily close to any given distribution in ; the same goes for and . Thus any bounds we derive for the former set will also apply to the latter.
3.4 Fixed shape and scale parameters
The main results assume that parameters and are fixed in order to keep the majorization definitions as simple as possible. If and were allowed to differ for each random variable in the mixture , we could still map to a random variable that takes on values with probability , then make comparisons using the stochastic ordering. I don’t feel that the added complexity is worth it, so won’t pursue this approach further.
4 Majorization theorems
This section presents the main majorization results. The relative error result is essentially a restatement of results from [8]; to the best of my knowledge the absolute error result is novel.
4.1 Relative error
As mentioned above, this first result was essentially proved as part of [8, Thm. 1]. I’ve still included the proof (Appendix A.1) for the sake of completeness.
Theorem 2.
Fix , and . Define and to be the extreme values of the mode of the PDF of the distribution
where are exponential r.v’s independent of . If satisfy , then
That is, has worse upper and lower tail bounds.
4.1.1 Locating the tail
But what are the values of and ? In the most general case , [8, Thm. 1] shows that and . These values necessarily serve as bounds whenever , but I would like to obtain stronger results for distributions with greater effective shape.
Theorem 3 gives a pessimistic result: it tells us where the tails are not, rather than where they are.
Theorem 3.
If , the points and as defined in Theorem 2 satisfy
Proof.
For the bound on , set with . The distribution is , so its mode is as desired while .
For the bound on , take the same , but pad it with eigenvalues instead. ∎
Unfortunately, I have not so far been able to establish bounds in the opposite direction. I will therefore leave them as a conjecture:
Conjecture 1.
The points and as defined in Theorem 2 satisfy
4.2 Absolute error
The relative error analysis in the previous section has two significant limitations: first, it only applies when is SPSD. Second, may sometimes be significantly larger than , which is a better indicator of the quality of the Gaussian trace estimator since the variance of is .
We therefore turn to the case where may be indefinite but bounds on and are known. As before, it is useful to present the results in terms of more general Gamma random variables.
Here is the main result on the absolute error of the estimator (proof in Appendix A.2).
Theorem 4.
Fix , , and . Define to be the supremum over all inflection points of the PDF of the distribution
(10) |
where are exponential r.v’s independent of . If satisfy , then
This single equation (note the use of the absolute value in the condition) means that has worse upper tail bounds and that has worse lower tail bounds.
4.2.1 Locating the tail
Theorem 4 implies that the “tails” are the regions where all density functions of the form (10) are convex. For reference, a normal distribution has inflection points equal to the mean plus or minus one standard deviation.
How much further away could the tails begin? Theorem 5 gives a pessimistic result.
Theorem 5.
The point as defined in Theorem 4 satisfies
See Appendix A.3 for the proof. I’ll leave an upper bound as a conjecture.
Conjecture 2.
The point as defined in Theorem 4 satisfies
5 Application to trace estimation
In this section I’ll rephrase the main theorems (Theorems 2 and 4) more directly in terms of Gaussian trace estimation.
Here is the basic idea: the matrix from (2) (resp. from (3)) majorizes every other element of (resp. ). Thus by the main theorems, these two matrices have the worst tail bounds. Employing the infinite division strategy of Section (3.3) shows that the Gaussian trace estimators for these matrices are in turn tail-bounded by the Gamma distributions in (4) and (5), respectively. Finally, increasing the sampling number will make the tails take up a larger portion of the distribution, increasing the domain over which the bounds in this paper are effective.
First up is the relative error bound. Note that a little unit conversion is used here: if , then .
Theorem 6.
For nonzero SPSD and sampling number , let . Then there exists such that for ,
where is defined in (2) and .
Conjecture 3.
.
For comparison, Corollary 1 holds for all , but with marginally weaker bounds.
Next is the absolute error bound. Again there is a bit of unit conversion: if and , then and .
Theorem 7.
Let be symmetric with and . For fixed sampling number , there exists such that for ,
where is defined in (3) and with .
Conjecture 4.
.
For comparison, Theorem 1 holds for all , but with marginally weaker bounds.
The factor of 2 appearing in the bounds in Theorem 7 is due to the fact that the result uses for the upper tail bound and for the lower tail bound (resp. and ). Note also that in Conjecture 4 the terms and scale differently as the sampling number increases. If true, the conjecture implies that converges to as .
6 Conclusion
So what exactly are the practical implications of all of this? One takeaway, following from Theorem 6, is that the relative error bound for SPSD matrices depends on , so having a large effective rank is just as beneficial as using a large sampling number. One particular application is in estimating the Frobenius norm of a matrix. The worst-case bounds of [9] hold when the matrix has rank one, but with a lower bound on the stable rank of these bounds can be tightened. This could in turn reduce the number of samples required to estimate the Frobenius norm to a given tolerance, as in [5, Lemma 2.2].
Another takeaway is that the bounds by Cortinovis and Kressner in Theorem 1 and Corollary 1 are already quite good. The improvements made in this paper, which establishes bounds that are tight given certain information about , are fairly minor. If you want any further improvements to these tail bounds, you will have to use more information about the spectrum of .
In practice, you probably shouldn’t use the Gaussian trace estimator on its own. If your matrix has a small stable rank or effective rank, you’d do better to combine the trace estimator with a low-rank approximation such as in [4, 5, 3]. Furthermore, you can reduce the variance by using the Hutchinson estimator (sampling vectors with random entries), or by normalizing the Gaussian vectors to have unit length [2]. The main proof techniques in this paper do not work on the Hutchinson estimator because it has a discrete distribution. The techniques could potentially be applied to the normalized Gaussian estimator, but the proofs will be more complicated.
Finally, this paper successfully applies the proof techniques of [10, 9, 8] to obtain a majorization result and tight tail bounds for matrices with fixed -norm and Frobenius norm. The application to trace estimation may not yield results significantly better than those that can be obtained through concentration inequalities, but the approach—comparing the CDFs of distributions to solve an optimization problem directly—is markedly different.
Appendix A Proofs
Proof of Lemma 2.
First suppose that the inequalities in (7a) and (7b) are strict (either both are, or neither). Find the leading slack indices (see Definition 3) and for each inequality. Increase the corresponding nonnegative coordinate222If no such coordinate exists, pad with zeros as needed. For example, if and , we would get the sequence . Such padding allows us to keep the sum of squares constant while changing the vectors continuously. of while shrinking the negative coordinate toward zero, keeping their sum-of-squares constant, until one of the slack inequalities becomes an equality. Repeat a finite number of times to eliminate all slack.
Then, apply Lemma (1) to the cases and separately. ∎
A.1 Relative error majorization theorem
As a reminder, most of this proof is substantially the same as the one given in [8]. I provide it here in part to show how the proof techniques relate to those used for the absolute error case.
Proof of Theorem 2.
Lemma 1 implies that we can without loss of generality assume that and differ in two coordinates only, satisfying . For , define
This interpolation satisfies and . Our goal is to show that for certain fixed values of , the CDF is monotonic for .
Take the Laplace transform of as
where , the Laplace transform of , satisfies
(11) |
Differentiating with respect to yields
Recall that (since ), and see that
(12) |
It follows that
Substitute and apply the inverse Laplace transform to get
where are i.i.d exponential r.v’s which are also independent of , and is the probability density function (PDF). Now by assumption, and for any it holds that as well. Thus for each the left-hand side has the same sign as , the derivative of the density function.
It is known that the distribution of any linear combination of Gamma random variables is unimodal (see [10, Thm. 4]), so by the definition of and , the density function is increasing on and decreasing on . Therefore, for any in either of these regions, is monotonic with respect to . Since and , the desired inequalities follow. ∎
A.2 Absolute error majorization theorem
Proof of Theorem 4.
Lemma 2 implies that we can without loss of generality assume that and differ in two coordinates, satisfying . For , define
Note that by design. As for the term , Lemma 2 implies that we need only consider cases where and are both nonpositive or both nonnegative. This slightly awkward term therefore ensures that and . We also note that for and , we have
(13) |
Our goal is again to show that for certain fixed values of , the CDF is monotonic for .
Take the Laplace transform of as
where , the Laplace transform of , satisfies
defined over the strip
(14) |
As compared with (11), the extra term comes from our using the centered variables . The region of convergence in (14) is a strip rather than a half-plane because may include both positive and negative weights .
Differentiating with respect to and using (13) yields
Recall that , and again use (12) to get
Taking the inverse transform then gives
where are i.i.d exponential r.v’s which are also independent of , and is the probability density function (PDF).
By design, . Checking the three cases considered in Lemma 2 shows also that for . Thus the sign of is always opposite that of , which is the convexity of the density function. By the definition of , the density function is convex on , and so for any in this region, decreases monotonically for . Since and , the desired upper-tail inequality follows.
The lower-tail bound follows by symmetry. ∎
A.3 Absolute error tail results
Proof of Theorem 5.
Let , where , and . This is a valid choice of distribution since
and
Then , and the r.v has distribution , or equivalently, .
Now in general, a random variable has density proportional to for (see (9)). The inflection points are where the second derivative is equal to zero. Since
the nontrivial inflection points are the two zeros of the quadratic term. The larger of the two zeros is given by
Substitute for and the values and , and subtract to get
(15) |
∎
References
- [1] Alice Cortinovis and Daniel Kressner. On randomized trace estimates for indefinite matrices with an application to determinants. Foundations of Computational Mathematics, 22(3):875–903, 2022.
- [2] Ethan N Epperly. Don’t use gaussians in stochastic trace estimation. https://www.ethanepperly.com/index.php/2024/01/28/, Jan 2024. Accessed: 2024-11-01.
- [3] Ethan N. Epperly, Joel A. Tropp, and Robert J. Webber. Xtrace: Making the most of every sample in stochastic trace estimation. SIAM Journal on Matrix Analysis and Applications, 45(1):1–23, 2024.
- [4] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
- [5] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation. SIAM Journal on Matrix Analysis and Applications, 43(3):1162–1185, 2022.
- [6] David Persson, Alice Cortinovis, and Daniel Kressner. Improved variants of the hutch++ algorithm for trace estimation (preprint), 2022.
- [7] Josip E Pečarić and Yung Liang Tong. Convex functions, partial orderings, and statistical applications. Academic Press, 1992.
- [8] Farbod Roosta-Khorasani and Gábor J. Székely. Schur properties of convolutions of gamma random variables. Metrika, 78(8):997–1014, 2015.
- [9] Farbod Roosta-Khorasani, Gábor J. Székely, and Uri M. Ascher. Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables. SIAM/ASA Journal on Uncertainty Quantification, 3(1):61–90, 2015.
- [10] Gábor J. Székely and Nail K. Bakirov. Extremal probabilities for gaussian quadratic forms. Probability Theory and Related Fields, 126(2):184–202, 2003.