Fast, robust approximate message passing
Abstract
We give a fast, spectral procedure for implementing approximate-message passing (AMP) algorithms robustly. For any quadratic optimization problem over symmetric matrices with independent subgaussian entries, and any separable AMP algorithm , our algorithm performs a spectral pre-processing step and then mildly modifies the iterates of . If given the perturbed input for any supported on a principal minor, our algorithm outputs a solution which is guaranteed to be close to the output of on the uncorrupted , with where as depending only on .
Contents
1 Introduction
Approximate Message Passing (AMP) is a family of algorithmic methods which generalize matrix power iteration. Suppose we are given a symmetric matrix , and our goal is to maximize the quadratic form over vectors in some constraint set . The basic AMP algorithm starts from some initialization and computes iterates by setting ,111The relation hides a lower-order additive term, the “Onsager correction,” which depends on . For the sake of simplicity we ignore this in the present discussion. where the “denoiser” is a function (of the algorithm designers’ choosing) from applied coordinate-wise. The goal of the “powering” action, , is to increase the quadratic form, while the denoiser is chosen to bring close to the constraint set .
AMP algorithms are extremely popular in high-dimensional statistics. In this context, given a prior distribution over the matrix , it is often possible to optimize the design of the denoisers in such a way that AMP gives an FPTAS, in that obtains an -optimal solution for large enough as a function of . Introduced initially as a generalization of Belief Propagation methods from statistical physics [Bol14, DMM09, BM11], AMP algorithms are now state-of-the-art for a variety of average-case optimization problems, including compressed sensing [DMM09], sparse Principal Components Analysis (PCA) [DM14], linear regression [DMM09, BM11, KMS+12], non-negative PCA [MR15], and more (many additional examples may be found in the surveys [Mon12, FVRS22]). One especially notable recent application is the breakthrough work of Montanari for optimizing the Sherrington-Kirkpatrick Hamiltonian, an average-case version of max-cut [Mon21].
One major drawback of AMP algorithms is that they are not robust. The NP-hardness of quadratic optimization means that, obviously, one cannot hope for the optimality of AMP on average-case inputs to generalize to arbitrary inputs . But even structured perturbations can throw AMP off [CZK14, RSFS19]; for example, an additive perturbation to by a rank- matrix of large norm, or planting a principal minor of uniform sign (as described in [IS24]).
Our prior work addressing this issue [IS24] shows that for a certain class of adversarial corruptions, AMP can be simulated robustly by polynomial-sized semidefinite programming relaxations in the “local statistics hierarchy.” While this result is a proof of concept that a robust version of AMP is possible, it is perhaps more interesting from a complexity-theoretic perspective than an algorithmic one: the semidefinite programs are of size , where is the number of AMP iterations. When AMP is an FPTAS, the algorithm of [IS24] gives a robust PTAS, but the running time is too slow to feasibly implement on any computer.
In the present work, we obtain simple and fast spectral algorithms which run in time , while not just matching but even improving on the robustness guarantees of [IS24]. In the “spectral algorithms from sum-of-squares analyses” line of work (initiated in [HSSS16]), our result stands out as giving a particularly dramatic reduction in running time, as well as in yielding a significantly simpler analysis.
1.1 Setup and definitions
We give some necessary definitions of AMP and the noise model that we consider.
Definition 1.1 (AMP algorithm).
An Approximate Message Passing algorithm is specified by a sequence of denoiser functions , with for each . It takes as input a symmetric matrix , a number of iterations , and produces a sequence of iterates , with and
where is applied coordinate-wise, and is the Onsager correction term for decreasing correlations between iterates and is fully determined by (see Definition 2.1). AMP algorithms often also come with a rounding procedure which is applied to the final iterate, in order to ensure it satisfies the optimization constraints.
We note that we are considering separable AMP algorithms (where the denoisers are applied coordinate-wise) with fixed starting point . In full generality AMP may relax both of these criteria, but the majority of AMP analyses are compatible with these assumptions.
Example 1.2 (non-negative PCA).
In the non-negative principal components analysis (PCA) problem, one is given a matrix and asked to maximize over non-negative unit vectors . The AMP algorithm which starts from and uniformly chooses the separable denoiser , with , is an FPTAS for non-negative PCA on with i.i.d. subgaussian entries [MR15].222Technically may not be a unit vector nor non-negative, but AMP algorithms such as this one usually include a final “rounding” step—in this case, the rounding is just applying followed by projection to the unit ball. In this case, up to the Onsager correction, AMP coincides with projected gradient ascent with “infinite” step size.
We will allow adversarially-chosen perturbations in the following model.
Definition 1.3 (-principal minor corruption).
Given matrices , we say is an -principal minor corruption of if is supported on an -principal minor.
A mean- random variable is said to be -subgaussian if for each integer , . For example, a mean- Gaussian with variance is -subgaussian, and a uniformly random sign is -subgaussian. Note that rescaling a -subgaussian variable to for constant rescales the subgaussian parameter to .
1.2 Results
Our main theorem is the following.
Theorem 1.4 (Informal version of Theorem 3.1).
Suppose is a -step AMP algorithm with -Lipschitz or polynomial denoiser functions. Let be a symmetric matrix with i.i.d. -subgaussian entries having mean and variance , and let be the output of on . Then there exists an algorithm which when given access to an -principal minor corruption produces in time a vector satisfying
with probability over the randomness of , where if the denoisers are Lipschitz, and if the denoisers are degree polynomials.
In words, given access to an adversarially corrupted matrix , our algorithm can find a vector which is close to the output of AMP on the uncorrupted matrix .333Since has bounded operator norm, this implies that has objective value within an additive of the objective of . The result improves on that of [IS24] in that it (1) runs in time rather than , and (2) guarantees that for a function which is independent of (but does depend on ), whereas in [IS24] the function included a multiplicative factor of , and thus was trivial unless .
As noted in [IS24], an equivalent result is information-theoretically impossible under the stronger corruption model in which is supported on arbitrary entries (unless ).
As a direct corollary, we can robustly simulate Montanari’s algorithm [Mon21] for finding the ground state of the Sherrington-Kirkpatrick Hamiltonian—that is, an approximately optimal solution for Max-Cut with i.i.d. Gaussian edge weights.
Corollary 1.5 (Fast, robust Sherrington Kirkpatrick).
Suppose is a symmetric matrix with entries sampled i.i.d. from . Then there is an algorithm which when run on an -principal minor corruption of , with probability produces in time a unit vector achieving objective value .
The value is the objective value achieved by Montanari’s AMP algorithm; modulo a widely-believed conjecture in statistical physics, approaches as . The corollary follows from Theorem 1.4 because Montanari’s denoisers are Lipschitz, and the rounding scheme applied to place the final iterate in the hypercube is also Lipschitz.
In Section 4, we give a simple proof (along similar lines as the proof of Theorem 1.4) that AMP is robust to adversarial perturbations of small spectral norm. This fact is folklore, but we feel our proof is quite simple and may be of interest.
1.3 Experiments
Our algorithm is fast enough that it can be easily implemented and run on a laptop. We have run some experiments to demonstrate the utility of our method. We consider the non-negative PCA objective described in Example 1.2. In [MR15], it was shown that AMP with denoiser function is an FPTAS for .
In Figure 1, we show the result for , with the adversarial corruption given by perturbing an principal minor by sampling two independent rank Wishart matrices, each normalized to have expected Frobenius norm , and adding one and subtracting the other. Without having taken pains to optimize the running time, the implementation in Python on a laptop takes less than 5 minutes. We have plotted (1) the correlation of our algorithm’s output, , with , and (2) the objective value of the output for the uncorrupted matrix , , as a function of the number of iterations. For comparison, we plot in Figure 1 the performance of (a) AMP on the corrupt matrix, , and (b) AMP on a “naive” spectral cleaning of , given by deleting all larger-than-expected eigenvalues. Our procedure performs much better than AMP on the corrupt input. Empirically, the naive cleaning performance is comparable to ours, but unlike our algorithm, the naive procedure does not come with provable guarantees for arbitrary perturbations (and we suspect the naive procedure may be succeeding due to a small- effect).
1.4 Discussion
We give a fast spectral algorithm for simulating AMP under adversarial principal minor corruptions. Our algorithm is an implementation of the “spectral algorithms from sum-of-squares (SoS) analyses” strategy introduced in [HSSS16]. We find it to be a particularly striking example of this strategy—not only was the running time reduced from to , but also, the analysis very transparently mimics/distills that of [IS24] to yield a much cleaner argument. We draw a comparison to previous spectral-to-SoS analyses in robust statistics, most of which have been based on a “filtering” approach (e.g. [JLST21, DKK+19]); in the filtering algorithms, the non-SoS analysis required significant additional tools. Another fitting comparison is to recent works obtaining robust spectral algorithms for community recovery in the stochastic block model [MRW24, DdHS23, DdNS22], where it was important to have a very fine-grained understanding of the spectrum of specific matrices. In our case, we are able to get away with a much simpler analysis.
Though we have improved on the result in [IS24] in terms of running time and the robustness-accuracy tradeoff, we differ from our prior work in one aspect: we require a description of the denoisers used in the AMP algorithm , whereas the algorithm in [IS24] has access only to the low-degree moments of the joint distribution over . We find it unlikely that a fast algorithm could succeed without a description of , but we pose this as a question nonetheless.
Another question is whether our error guarantees are optimal, as a function of the number of AMP iterations . In our theorem, the hides factors that grow with the number of AMP iterations; however our experiments (Figure 1) seem to suggest that the error stabilizes—is this a small effect? Or perhaps an artifact of the specific perturbation from our experiments?
One clear direction for future work is making AMP robust when the input matrix has planted structure, rather than just having i.i.d. subgaussian entries. For example, AMP has been a successful algorithm for “spiked matrix models” in which with a Gaussian matrix and a rank-1 spike, the goal often being to find given . In this case, it is not completely clear which noise model to study. In some cases (e.g. when is sparse) a principal minor corruption could simply erase the spike . However, it is an interesting question whether our techniques can be extended to this case—currently, our algorithm incorporates information about i.i.d. subgaussian variables, which makes it inappropriate for planted models (the same is true of [IS24]).
Finally, it is interesting to consider alternative corruption models. The principal minor corruption is tractable to study, and the fact that it is adversarial makes it a powerful model. We know from [IS24] that a similar result is information-theoretically impossible under the strongest sparse adversarial corruption model, in which an arbitrary subset of entries is perturbed. However, it would be interesting to consider alternative corruption models that more faithfully model the distribution shift one expects to see in practice, for example in the application of compressed sensing.
1.5 Technical overview
Though the proof of Theorem 1.4 is not long, we briefly summarize the main ideas here. For the sake of simplicity, in this technical overview we pretend that the AMP iteration has the form , ignoring the Onsager correction and the dependence on more than one prior iterate.
Recall that we are given an -principal minor corruption of . The fact that has i.i.d. subgaussian entries of variance implies that with high probability, . The first step of our algorithm is a spectral procedure which removes rows and columns of , producing a matrix with . Then, we run a modified version of the AMP algorithm on the cleaned input matrix , producing iterates just as the original AMP algorithm would have, except that at each iteration we clip the entries , so that the magnitude of all entries of does not exceed the -quantile444In the proof we choose the threshold to not exactly correspond to the -quantile, but this choice would have also worked and is simpler for the sake of this overview. value of the entries in a typical iterate from a clean input matrix.
We argue that by induction on ; In the base case, , the iterates are identical as . Now for , suppose that is the (unobserved) iterate AMP would have produced on . Then
(1) |
The spectral cleaning ensures that . To further bound the first term in (1), consider the illustrative case of the denoiser . Then for any vectors , , for the entrywise product. Thus we have
(2) |
The first and second terms of (2) are bounded in a similar manner, we begin by explaining the first. Because of the clipping procedure, . Further, by the triangle inequality,
(3) |
The first term on the right of (3) can be bounded by from the inductive hypothesis, because the function is -Lipschitz. The second term in (3) can be bounded by , because the distribution of ’s entries is known, and is roughly that of independent polynomials in Gaussian random variables. To bound the second term from (2), we separate the contribution of the entries of which are bounded by , to which we can apply an identical argument, and the entries which exceed this threshold, and then appeal to the fact that these integrate to a small total. A similar argument can be used for arbitrary polynomial (for Lipschitz , (1) can be bounded directly and the clipping is not necessary).
To bound the second term in (1), we use the fact that can be written as the sum of a matrix , supported on an principal minor, and a matrix which is equal to the support of on at most rows/columns—these are precisely the rows/columns of which were removed to form , but were not involved in the initial principal minor corruption. So, . Since is supported on columns,
Here again, because we know the order statistics of , and because is required to be a well-behaved function, the maximum norm of when restricted to a subset of coordinates is on the order of . Also, since is a submatrix of , .
The matrix can be split into the part supported on columns, for which the argument is identical to the case of above. But there is also a part supported on rows. Here, we have to take a different perspective: since is a restriction of to the rows indexed by some set with , we have that , which is an -sparse subset of the vector . But we understand the order statistics of this vector too! Hence we have that as desired.
Putting everything together, we have that . The argument is now finished by again using our knowledge of the distribution of to conclude that and are within constant scalings of each other.
Much of this analysis mirrors and simplifies the analysis in [IS24]. There, a semidefinite program is used to obtain a pseudoexpectation of a “cleaned” version of . The semidefinite program has formal variables for low-degree symmetric polynomials of . It adds constraints to try to enforce that , that be supported on a principal minor (by introducing indicator variables for “clean” rows and columns), as well as the constraint that some symmetric vector-valued polynomials in the entries of have entries which are no larger than corresponding polynomials in .
The high-level sequence of arguments mirrors those outlined in (1) and the subsequent lines. We introduce some additional structure/arguments because our spectral cleaning step (for which we design a natural-in-hindsight spectral cleaning algorithm) deletes rows and columns. One advantage of the present argument over that in [IS24] is that it is unclear how to make a semidefinite program leverage the order statistics of vector-valued polynomials, so in our prior work we crudely enforce a bound on the infinity norm of the vectors, which gives rise to factors. Here we are able to circumvent this because we clip our iterates by hand.
2 AMP preliminaries
To complete Definition 1.1 from the introduction, we must define the Onsager correction term.
Definition 2.1 (Onsager correction).
The Onsager correction term for the AMP algorithm defined by denoisers on input with iterates is the quantity
where where is the normalized divergence of with respect to :
We remark that the Onsager correction is usually defined with the function in place of the constant (and in fact, generally one would estimate from data by computing ). For technical reasons it is easier for us to work with . As was previously noted in the literature [FVRS22, Remark 2.4], when the denoisers are well-behaved this is effectively without loss of generality because the iterates produced by using vs. are -close; we discuss this further in Appendix A.
Definition 2.2 (Pseudo-Lipschitz Functions).
A function is called Pseudo Lipschitz of order (or ) if
for all .
Note that a function is Lipschitz exactly when it is , and a polynomial of degree lies in . By a slight abuse of notation, we will say that constants lie in .
We will need information about the order statistics of the entries of our iterates, . When we run AMP with polynomial denoiser functions, each iterate is a symmetric (fixed by coordinate relabeling), vector-valued polynomial in the entries of . So each entry is a bounded-degree polynomial of independent subgaussian random variables.
While the entries of are not independent, they are sufficiently close to independent that for simple functions , the average concentrates fairly well around the expectation of on a polynomial of Gaussians. The same is true when the denoiser functions are Lipschitz. This fact is known as “state evolution” in the AMP literature. In the next corollary, we state a useful consequence that will allow us to control the order statistics of our iterates.
Corollary 2.3.
Suppose that is for and is with and . Suppose is an AMP iterate resulting from the application of Pseudo Lipschitz denoisers on input a symmetric matrix with i.i.d. -subgaussian entries having mean and variance . Furthermore, let be a constant (possibly depending on ). Then, the following hold:
-
•
For any ,
-
•
For every with ,
We prove this corollary in Appendix A.
Sometimes we will use the phrase “Almost-Triangle Inequality” to refer to the inequality .
3 Making AMP robust to principal minor corruptions
In this section, we prove our main theorem.
Theorem 3.1 (Main Theorem).
Let be an AMP iteration consisting of either Lipschitz or polynomial denoiser functions. Suppose that is a symmetric matrix with i.i.d. entries of mean , variance , and subgaussian parameter . Let denote the output of the -step AMP algorithm on input , and set to be the degree of as a polynomial, or if the denoisers are Lipschitz.555This aligns with the pseudo-Lipschitz degree of , which functions similarly to the degree as a polynomial. Then, with probability over the choice of , Algorithm 3.4 run on any -principal minor corruption of , produces in time a vector which satisfies
Our algorithm consists of a pre-processing step, followed by a “robust” simulation of AMP:
-
1.
In the pre-processing step, we spectrally clean by removing rows and columns to produce a matrix with .
-
2.
Then, we run AMP on , but with the following modification: after each iteration, we clip the iterate (coordinate-wise) to ensure all coordinates have not-too-large an absolute value.
The following definitions will help us to describe our algorithm.
Definition 3.2.
For , define for an appropriately large depending on , the total number of AMP iterations.666In practice, is a reasonable value. The “-clip” of is now defined to be
Definition 3.3 (Matrix restriction).
Given a matrix , is an -restriction if there exists a set with such that zeroing out the rows and columns of with indices in yields .
Pictorially, this is as follows:
Algorithm 3.4 (Robust AMP)
Input: A symmetric matrix .
Operation:
-
1.
Compute a restriction of satisfying using Algorithm 3.7.
-
2.
For , set to be the clipped AMP iteration
Output: The vector .
Theorem 3.1 is a consequence of the following two lemmas, one for each step of Algorithm 3.4.
Lemma 3.5 (Efficient spectral cleaning).
Suppose is a symmetric matrix with i.i.d. entries of mean zero, variance , and subgaussian parameter . With probability over , Algorithm 3.7 run on any -principal minor corruption of with threshold value outputs in time a matrix which is a -restriction of and satisfies .
Lemma 3.6 (Success of AMP on restrictions).
Suppose is an matrix with i.i.d. entries of mean zero, variance and subgaussian parameter -subgaussian entries. Suppose that is an -principal minor corruption of and is a -restriction of with . Then the clipped AMP iteration from Algorithm 3.4 on produces a vector such that with probability over the choice of .
When combined, these two lemmas immediately imply Theorem 3.1.
3.1 Spectral cleaning
The goal of this section is to prove Lemma 3.5. Here we present the algorithm to construct which is a -restriction of and has .
Algorithm 3.7 (Spectral cleaning of principal minor corruptions)
Input: A symmetric matrix , and a threshold value .
Operation:
-
1.
Let .
-
2.
While :
-
(a)
Let be the eigenvector of with eigenvalue of largest magnitude.
-
(b)
Sample with probability .
-
(c)
Zero out the -th row and column of .
-
(a)
Output: Matrix .
Note that critically we do not require that we exactly recover the corrupted rows and columns: all that matters is that we remove the indices that contribute the most to the spectral corruption.
Proof of Lemma 3.5.
Certainly, the algorithm terminates since no index can be sampled more than once. We will show that with high probability, indices are removed. The runtime bound can be deduced from noting that we have to run power iteration at most once per index removal, each run taking time.
For convenience, let , and recall our threshold is . Since has independent subgaussian entries, with high probability . Let denote the set of corrupted indices in . Furthermore, let denote the matrix after each iteration of the while loop. Similarly define and (the non-zeroed out corrupted indices).
Note that if all indices in are removed, the while loop will terminate (it can terminate in other instances, but this is just one stopping condition). We show that with high probability we will reach within iterations (an thus is a -restriction of ) using a win-win analysis: either we reach a small operator norm before removing all of or we remove all of (which implies the remaining matrix has norm , because it is a principal minor of ). In particular, the crux of the argument is the following:
Claim 3.8
Let be the top eigenvector of and suppose . Then with high probability over ,
Note that this claim is equivalent to saying that at each iteration of the while loop there is at least a probability of removing some index from .
Proof of Claim.
Suppose that (as opposed to ). Let be such that all indices outside of are set to zero. Our goal is to lower bound . Notice by definition. Since is the top eigenvector of ,
where in the second to last step we used that and in the last step we used that, since is a principal minor of , w.h.p.
However, note that , which implies that . Since by assumption, this implies that . The proof of the case is identical up to a change of sign. ∎
To prove that our loop terminates in steps with high probability, define the stopping time . Now, let denote the indicator of whether the index removed between and was in , and note that each independently stochastically dominates a . Suppose that . Then, it follows that
which happens with exponentially small probability (this is equivalent to asking for the probability that ). Together, this implies that with high probability.∎
3.2 Analysis of clipped AMP on spectrally cleaned input
In this section we will prove Lemma 3.6. To begin, we examine the effect of a combination of principal minor and restriction corruptions. Suppose is an -principal minor corruption of , and suppose is a -restriction of . Let denote the set of rows in the support of , and let denote the set of rows in the support of . For simplicity, let (the set of corrupted rows which are not removed by the restriction). Then, the matrix evolves as follows:
In particular, if we let be the portion of the error matrix which survives the restriction, and then let be the remainder in , it follows that is either or . Furthermore, we will split into two sections: consisting of all entries in rows indexed by (in other words, ), and consisting of all entries in columns indexed by , except those covered by (in other words, . Pictorially, this can be represented via
As a warm-up, we show that each of these quantities is bounded in operator norm.
Proposition 3.9.
Suppose that . For the above definitions of , we have that
with high probability.
Proof.
Let us begin with . Begin by considering , which has each entry an independent subgaussian random variable. Applying standard matrix concentration arguments (e.g. Theorem 4.6.1 in [Ver18]), we have with high probability that . We can apply a similar argument to see that as well.
Now, consider . We then have that
as the operator norm of a principal minor is at most that of the original matrix. ∎
With this in mind, we are ready to prove Lemma 3.6, which we reprint here for clarity.
Lemma (Restatement of Lemma 3.6).
Suppose is an matrix with i.i.d. entries of mean zero, variance and subgaussian parameter -subgaussian entries. Suppose that is an -principal minor corruption of and is a -restriction of with . Then the clipped AMP iteration from Algorithm 3.4 on produces a vector such that with probability over the choice of .
The proof follows from a few central claims. The first of these shows that clipping cannot substantially change how far we are from the true AMP iteration.
Proposition 3.10 (Clipping preserves error).
Define to be the unclipped version of (that is, the inner expression passed to ). Then,
with probability .
The next proposition aims to show that even though , and have constant operator norms, their row (or column) sparsity allow for controlling their effect on AMP iterates. Here we also introduce the shorthand .
Proposition 3.11 (Block-sparse corruptions have small error).
Suppose that and define . There exists a constant (independent of and but possibly dependent on ) such that each of , and are bounded by with probability .
The final proposition aims to show that applying polynomials to clipped AMP iterates cannot dramatically change closeness. Note that this is not true in general and requires both boundedness and state evolution to hold in our case.
Proposition 3.12 (Pseudo-Lipschitz functions preserve closeness of AMP iterates).
Suppose that , and let . Then, there exists a constant (independent of and but dependent on ) such that
with probability .
Together, these three propositions allow us to prove the lemma.
Proof of Lemma 3.6.
We prove by induction on the iteration . Certainly, so the base case is complete. Else, suppose we have proven the statement for all . We prove for .
By Proposition 3.10, we have that , so let us handle this first term. To decrease verbiage, let . By the Triangle Inequality and the definition of the AMP iteration, we have that
where for the last inequality we applied Proposition 3.11 and Proposition 3.12. Now, the Almost-Triangle Inequality and combining with Proposition 3.10 implies that
If the AMP iteration consists of Lipschitz denoisers, it follows that for all and thus . Else, notice that the power of can be at most , which completes the proof.
∎
We finish this section by proving the three propositions.
Proof of Proposition 3.10.
Note that is a -Lipschitz function. So, by the triangle inequality,
so it remains to bound this second quantity. We may apply Corollary 2.3 with (which is Lipschitz) and , which implies that
By choosing and taking , it follows that
from where the conclusion follows. ∎
Proof of Proposition 3.11.
Let be the indices in the support of , and let be the set of indices in the row-support of (and column-support of ), as in the figure at the beginning of the section. For a given vector , we will define to be the restriction of to . Then, note that
and similarly . To handle each of these, notice that . Therefore, we may apply Corollary 2.3 to deduce that
and similarly for . This implies the boundedness of and .
We cannot use the same argument for because is supported on all columns Instead, let us recall that on its supported rows, so and we are trying to bound . Using the definition of the AMP iteration, we can rewrite
Therefore, and is a function of the iterates . Once more applying Corollary 2.3, it follows that
and we are done. ∎
Proof of Proposition 3.12.
We begin by applying the definition of . In particular, combined with the Almost-Triangle Inequality we find that
(4) |
The last inequality holds because for each , . Therefore, it remains to handle the last sum.
We claim that
Indeed,
-
•
If , then certainly the left side is bounded by the first term.
-
•
Else, note that , where the absolute value protects against opposite signs. Therefore, in this latter case we have that the left side is bounded by the second term.
Summing over , it follows that
Applying Corollary 2.3 to this second term with , (which is Lipschitz), and , it follows that
by taking and having . Therefore, plugging this all back in to (4), we have that
as desired, assuming that . ∎
4 AMP is robust to small spectral perturbations
Here we argue that AMP is robust to spectral perturbations.
Lemma 4.1.
Suppose that has independent entries of mean , variance , and subgaussian parameter . Let be an AMP algorithm consisting of Lipschitz denoiser functions with Lipschitz constant at most , and let denote the output of the -step AMP algorithm on input , and denote the output of the same algorithm on input for any satisfying . Then there exists a universal constant such that with probability over ,
Since the starting iterate and has entries of variance , the scaling is of the correct order for reasonable denoisers, in which case the above implies that are -correlated.
Proof.
Let us denote the iterates to be and for and , respectively, and let . When , so the statement trivially holds. Now assuming we have shown this for all , we will now show it . We will use the shorthand . We may expand the expression for :
And since with high probability and from the subgaussianity of , and by assumption, for a constant sufficiently large, | ||||
To control the first term, we invoke the Lipschitzness of ,
For the second term, we have from Corollary 2.3 (applied with ) that . Combining these facts, we find that
and so
Acknowledgments
We thank Spencer Compton, Sam Hopkins and Andrea Montanari for helpful discussions.
References
- [BLM12] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and iterative algorithms. In Proceedings of the 2012 IEEE International Symposium on Information Theory, ISIT 2012, Cambridge, MA, USA, July 1-6, 2012, pages 1643–1647. IEEE, 2012.
- [BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, 2011.
- [BMN20] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions. Information and Inference: A Journal of the IMA, 9(1):33–79, 2020.
- [Bol14] Erwin Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
- [CL20] Wei-Kuo Chen and Wai-Kit Lam. Universality of approximate message passing algorithms. CoRR, abs/2003.10431, 2020.
- [CZK14] Francesco Caltagirone, Lenka Zdeborová, and Florent Krzakala. On convergence of approximate message passing. In 2014 IEEE International Symposium on Information Theory, pages 1812–1816. IEEE, 2014.
- [DdHS23] Jingqiu Ding, Tommaso d’Orsi, Yiding Hua, and David Steurer. Reaching Kesten-Stigum threshold in the stochastic block model under node corruptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 4044–4071. PMLR, 2023.
- [DdNS22] Jingqiu Ding, Tommaso d’Orsi, Rajai Nasser, and David Steurer. Robust recovery for stochastic block models. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 387–394. IEEE, 2022.
- [DKK+19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
- [DM14] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In 2014 IEEE International Symposium on Information Theory, pages 2197–2201. IEEE, 2014.
- [DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
- [FVRS22] Oliver Y Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J Samworth. A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4):335–536, 2022.
- [HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191, 2016.
- [IS24] Misha Ivkov and Tselil Schramm. Semidefinite programs simulate approximate message passing robustly. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 348–357, 2024.
- [JLST21] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34:4475–4488, 2021.
- [JP24] Chris Jones and Lucas Pesenti. Diagram analysis of iterative algorithms. CoRR, abs/2404.07881, 2024.
- [KMS+12] Florent Krzakala, Marc Mézard, Francois Sausset, Yifan Sun, and Lenka Zdeborová. Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009, 2012.
- [Mon12] Andrea Montanari. Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, page 394, 2012.
- [Mon21] Andrea Montanari. Optimization of the Sherrington–Kirkpatrick Hamiltonian. SIAM Journal on Computing, (0):FOCS19–1, 2021.
- [MR15] Andrea Montanari and Emile Richard. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Transactions on Information Theory, 62(3):1458–1484, 2015.
- [MRW24] Sidhanth Mohanty, Prasad Raghavendra, and David X Wu. Robust recovery for stochastic block models, simplified and generalized. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 367–374, 2024.
- [RSFS19] Sundeep Rangan, Philip Schniter, Alyson K Fletcher, and Subrata Sarkar. On the convergence of approximate message passing with arbitrary matrices. IEEE Transactions on Information Theory, 65(9):5339–5351, 2019.
- [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
Appendix A Statistics of AMP iterate entries
The most important theorem for us is known as state evolution, which intuitively states that the statistics of entries iterates of an AMP iteration behave somewhat like statistics of Gaussians. There are two version of state evolution that will be important to us, corresponding to polynomial and Lipschitz iterations.
Theorem A.1 (Polynomial State Evolution (e.g. [BLM12, Theorem 4] or [JP24, Theorem 4.21 and Theorem 5.2])).
Suppose that is an AMP iteration corresponding to polynomial denoisers, with input a symmetric matrix having i.i.d -subgaussian entries with mean and variance . Then, for any function ,
where form an appropriate Gaussian process with covariance independent of .
To prove a similar theorem in the Lipschitz setting, we note from a remark in [CL20] that [BLM12, Proposition 6] can be extended to handle Lipschitz denoiser functions, which implies the same universality as for polynomial denoisers.
Theorem A.2 (Lipschitz State Evolution (e.g. [FVRS22, Theorem 2.3])).
Suppose that is an AMP iteration corresponding to Lipschitz denoisers, with input a symmetric matrix having i.i.d -subgaussian entries with mean and variance . Then, for any function ,
where form an appropriate Gaussian process with covariance independent of .
As noted by [FVRS22, Remark 2.4], state evolution holds for Lipschitz denoisers when we consider instead of in the AMP iteration. By adapting the version of this proof presented in [BMN20, Corollary 2], we may also substitute for when considering polynomial denoisers.
Consequences of state evolution for Pseudo-Lipschitz functions
We collect some facts about Pseudo-Lipschitz functions, after which we can prove Corollary 2.3, which implies the concentration we need for order statistics of the AMP iterates.
Proposition A.3.
Suppose that is with Lipschitz constant and is with Lipschitz constant . Furthermore, suppose that . Then, with Lipschitz constant .
Proof.
We may write
We can see that and by applying centeredness. Therefore, for the first term we obtain that
It remains to show that the product of the latter two terms is at most . Consider casing on whether . If this is at most , the above product is bounded by . Else, suppose without loss of generality that and . Then, we can bound the product by . This implies that
and symmetrically for . Thus, we have shown that with Lipschitz constant .
∎
Finally, we prove Corollary 2.3.
Corollary (Restatement of Corollary 2.3).
Suppose that is for and is with and . Suppose is an AMP iterate resulting from the application of Pseudo Lipschitz denoisers on input a symmetric matrix with i.i.d. -subgaussian entries having mean and variance . Furthermore, let be a constant (possibly depending on ). Then, the following hold:
-
•
For any ,
-
•
For every with ,
Before the proof, note that a priori we have no control over (this is not Pseudo-Lipschitz at any degree). However, we show that we can approximate it above and below by Pseudo-Lipschitz functions and thus still reason about it.
Claim A.4
There exists a sequence of Lipschitz functions and each having Lipschitz constant such that and as these bounding functions converge to .
Proof.
Define
which by definition satisfy the given constraint. ∎
This implies that state evolution holds for indicators (and we can treat them as if they are another Lipschitz function).
Proof of Corollary 2.3.
Let’s begin with the first bullet point. By state evolution, we have that
where is the covariance matrix of . Note that for any , (by case analysis on whether ).
Therefore, we find that
By Proposition A.3, we have that .
Define , which is also and centered with Lipschitz constant at most . In particular, we thus have that and
Now, we may compute that
where in the last two steps we used that .
Now, we may use this to prove the second bullet point. We can assume without loss of generality that : otherwise, note that so we can with a factor of loss consider centering . Furthermore, note that we only need to consider the top quantile of indices to prove the statement of the lemma. Now, for any , consider writing
Therefore, dividing both sides by and taking the implies that
Our goal is to show that by choosing , the latter expectation is which would complete the proof. By the above result, we have that
(since ). Take and . From here, we find that
Thus, we have that
which is exactly as desired. ∎