Fast, robust approximate message passing

Misha Ivkov Stanford University. [email protected]. Supported by NSF Graduate Research Fellowship. Tselil Schramm Stanford University. [email protected]. Supported by NSF CAREER award # 2143246.

(November 5, 2024)

Abstract

We give a fast, spectral procedure for implementing approximate-message passing (AMP) algorithms robustly. For any quadratic optimization problem over symmetric matrices $X$ with independent subgaussian entries, and any separable AMP algorithm $\mathcal{A}$ , our algorithm performs a spectral pre-processing step and then mildly modifies the iterates of $\mathcal{A}$ . If given the perturbed input $X+E\in\mathbb{R}^{n\times n}$ for any $E$ supported on a $\varepsilon n\times\varepsilon n$ principal minor, our algorithm outputs a solution $\hat{v}$ which is guaranteed to be close to the output of $\mathcal{A}$ on the uncorrupted $X$ , with $\|\mathcal{A}(X)-\hat{v}\|_{2}\leqslant f(\varepsilon)\|\mathcal{A}(X)\|_{2}$ where $f(\varepsilon)\to 0$ as $\varepsilon\to 0$ depending only on $\varepsilon$ .

1 Introduction

Approximate Message Passing (AMP) is a family of algorithmic methods which generalize matrix power iteration. Suppose we are given a symmetric matrix $X\in\mathbb{R}^{n\times n}$ , and our goal is to maximize the quadratic form $v^{\top}Xv$ over vectors $v$ in some constraint set $K$ . The basic AMP algorithm starts from some initialization $x^{(0)}\in\mathbb{R}^{n}$ and computes iterates $x^{(1)},x^{(2)},\ldots$ by setting $x^{(t+1)}\approx Xf(x^{(t)})$ ,¹¹1The $\approx$ relation hides a lower-order additive term, the “Onsager correction,” which depends on $x^{(t)}$ . For the sake of simplicity we ignore this in the present discussion. where the “denoiser” $f$ is a function (of the algorithm designers’ choosing) from $\mathbb{R}\to\mathbb{R}$ applied coordinate-wise. The goal of the “powering” action, $Xx^{(t)}$ , is to increase the quadratic form, while the denoiser $f$ is chosen to bring $f(x^{(t)})$ close to the constraint set $K$ .

AMP algorithms are extremely popular in high-dimensional statistics. In this context, given a prior distribution over the matrix $X$ , it is often possible to optimize the design of the denoisers $f$ in such a way that AMP gives an FPTAS, in that $x^{(t)}$ obtains an $(1-\varepsilon)$ -optimal solution for $t$ large enough as a function of $\varepsilon$ . Introduced initially as a generalization of Belief Propagation methods from statistical physics [Bol14, DMM09, BM11], AMP algorithms are now state-of-the-art for a variety of average-case optimization problems, including compressed sensing [DMM09], sparse Principal Components Analysis (PCA) [DM14], linear regression [DMM09, BM11, KMS⁺12], non-negative PCA [MR15], and more (many additional examples may be found in the surveys [Mon12, FVRS22]). One especially notable recent application is the breakthrough work of Montanari for optimizing the Sherrington-Kirkpatrick Hamiltonian, an average-case version of max-cut [Mon21].

One major drawback of AMP algorithms is that they are not robust. The NP-hardness of quadratic optimization means that, obviously, one cannot hope for the optimality of AMP on average-case inputs to generalize to arbitrary inputs $X$ . But even structured perturbations can throw AMP off [CZK14, RSFS19]; for example, an additive perturbation to $X$ by a rank- $1$ matrix of large norm, or planting a principal minor of uniform sign (as described in [IS24]).

Our prior work addressing this issue [IS24] shows that for a certain class of adversarial corruptions, AMP can be simulated robustly by polynomial-sized semidefinite programming relaxations in the “local statistics hierarchy.” While this result is a proof of concept that a robust version of AMP is possible, it is perhaps more interesting from a complexity-theoretic perspective than an algorithmic one: the semidefinite programs are of size $n^{\exp(t)}$ , where $t$ is the number of AMP iterations. When AMP is an FPTAS, the algorithm of [IS24] gives a robust PTAS, but the running time is too slow to feasibly implement on any computer.

In the present work, we obtain simple and fast spectral algorithms which run in time $O(n^{3})$ , while not just matching but even improving on the robustness guarantees of [IS24]. In the “spectral algorithms from sum-of-squares analyses” line of work (initiated in [HSSS16]), our result stands out as giving a particularly dramatic reduction in running time, as well as in yielding a significantly simpler analysis.

1.1 Setup and definitions

We give some necessary definitions of AMP and the noise model that we consider.

Definition 1.1 (AMP algorithm).

An Approximate Message Passing algorithm is specified by a sequence of denoiser functions $\mathcal{F}=f_{0},f_{1},f_{2},\ldots$ , with $f_{t}:\mathbb{R}^{t+1}\to\mathbb{R}$ for each $t\in\mathbb{N}$ . It takes as input a symmetric $n\times n$ matrix $X$ , a number of iterations $T\in\mathbb{N}$ , and produces a sequence of iterates $x^{(0)},x^{(1)},\ldots,x^{(T)}$ , with $x^{(0)}=\vec{1}$ and

x^{(t+1)}=Xf_{t}(x^{(t)},x^{(t-1)},\ldots,x^{(0)})-\Delta_{t}(x^{(t)},x^{(t-1)% },\ldots,x^{(0)}),

where $f_{t}$ is applied coordinate-wise, and $\Delta_{t}$ is the Onsager correction term for decreasing correlations between iterates and is fully determined by $\mathcal{F}$ (see Definition 2.1). AMP algorithms often also come with a rounding procedure which is applied to the final iterate, in order to ensure it satisfies the optimization constraints.

We note that we are considering separable AMP algorithms (where the denoisers are applied coordinate-wise) with fixed starting point $x^{(0)}=\vec{1}$ . In full generality AMP may relax both of these criteria, but the majority of AMP analyses are compatible with these assumptions.

Example 1.2 (non-negative PCA).

In the non-negative principal components analysis (PCA) problem, one is given a matrix $X\in\mathbb{R}^{n\times n}$ and asked to maximize $v^{\top}Xv$ over non-negative unit vectors $v\geqslant 0$ . The AMP algorithm which starts from $x^{(0)}=\vec{1}$ and uniformly chooses the separable denoiser $f_{s}(x^{(s)},\ldots,x^{(0)})=f(x^{(s)})$ , with $f(x)=\max(x,0)$ , is an FPTAS for non-negative PCA on $X$ with i.i.d. subgaussian entries [MR15].²²2Technically $x^{(t)}$ may not be a unit vector nor non-negative, but AMP algorithms such as this one usually include a final “rounding” step—in this case, the rounding is just applying $f(x)=\max(x,0)$ followed by projection to the unit ball. In this case, up to the Onsager correction, AMP coincides with projected gradient ascent with “infinite” step size.

We will allow adversarially-chosen perturbations in the following model.

Definition 1.3 ( $\varepsilon$ -principal minor corruption).

Given matrices $X,Y\in\mathbb{R}^{n\times n}$ , we say $Y$ is an $\varepsilon$ -principal minor corruption of $X$ if $Y-X$ is supported on an $\varepsilon n\times\varepsilon n$ -principal minor.

A mean- $0$ random variable $\bm{X}$ is said to be $\sigma$ -subgaussian if for each integer $k\in\mathbb{N}$ , $\operatorname*{\mathbf{E}}[|\bm{X}|^{k}]\leqslant\sigma^{k}k^{k/2}$ . For example, a mean- $0$ Gaussian with variance $\sigma^{2}$ is $\sigma$ -subgaussian, and a uniformly random sign $\in\{\pm 1\}$ is $1$ -subgaussian. Note that rescaling a $\sigma$ -subgaussian variable $\bm{X}$ to $C\bm{X}$ for constant $C$ rescales the subgaussian parameter to $C\sigma$ .

1.2 Results

Our main theorem is the following.

Theorem 1.4 (Informal version of Theorem 3.1).

Suppose $\mathcal{A}$ is a $T$ -step AMP algorithm with $O(1)$ -Lipschitz or polynomial denoiser functions. Let $X$ be a symmetric $n\times n$ matrix with i.i.d. $\frac{O(1)}{\sqrt{n}}$ -subgaussian entries having mean $0$ and variance $\frac{1}{n}$ , and let $v_{\mathrm{AMP}}(X)$ be the output of $\mathcal{A}$ on $X$ . Then there exists an algorithm which when given access to an $\varepsilon$ -principal minor corruption $Y$ produces in time $O(\varepsilon n^{3}\log n)$ a vector $\hat{v}(Y)$ satisfying

\|\hat{v}(Y)-v_{\mathrm{AMP}}(X)\|^{2}\leqslant O(\varepsilon\log^{d}\tfrac{1}% {\varepsilon})\cdot\|v_{\mathrm{AMP}}(X)\|^{2},

with probability $1-o(1)$ over the randomness of $X$ , where $d=1$ if the denoisers are Lipschitz, and $d=k^{T}$ if the denoisers are degree $\leqslant k$ polynomials.

In words, given access to an adversarially corrupted matrix $Y$ , our algorithm can find a vector $\hat{v}(Y)$ which is close to the output of AMP on the uncorrupted matrix $X$ .³³3Since $X$ has bounded operator norm, this implies that $\hat{v}(Y)$ has objective value $\hat{v}^{\top}X\hat{v}$ within an additive $\tilde{O}(\sqrt{\varepsilon})$ of the objective of $v_{\mathrm{AMP}}(X)$ . The result improves on that of [IS24] in that it (1) runs in time $O(\varepsilon n^{3}\log n)$ rather than $n^{\exp(T)}$ , and (2) guarantees that $\|\mathcal{A}(X)-\hat{v}(Y)\|\leqslant f(\varepsilon)\|\mathcal{A}(X)\|$ for a function $f(\varepsilon)\to_{\varepsilon}0$ which is independent of $n$ (but does depend on $T$ ), whereas in [IS24] the function $f(\varepsilon)$ included a multiplicative factor of $\mathrm{poly}\log(n)$ , and thus was trivial unless $\varepsilon=o(1)$ .

As noted in [IS24], an equivalent result is information-theoretically impossible under the stronger corruption model in which $X-Y$ is supported on $\varepsilon n^{2}$ arbitrary entries (unless $\varepsilon=o(n^{-1/2})$ ).

As a direct corollary, we can robustly simulate Montanari’s algorithm [Mon21] for finding the ground state of the Sherrington-Kirkpatrick Hamiltonian—that is, an approximately optimal solution for Max-Cut with i.i.d. Gaussian edge weights.

Corollary 1.5 (Fast, robust Sherrington Kirkpatrick).

Suppose $X$ is a symmetric matrix with entries sampled i.i.d. from $\mathcal{N}(0,\frac{1}{n})$ . Then there is an algorithm which when run on an $\varepsilon$ -principal minor corruption $Y$ of $X$ , with probability $1-o(1)$ produces in time $O(\varepsilon n^{3}\log n)$ a unit vector $\hat{v}(Y)\in\{\pm 1/\sqrt{n}\}^{n}$ achieving objective value $\hat{v}(Y)^{\top}X\hat{v}(Y)\geqslant\mathrm{OBJ_{AMP}}-O(\sqrt{\varepsilon% \log\frac{1}{\varepsilon}})$ .

The value $\mathrm{OBJ_{AMP}}$ is the objective value achieved by Montanari’s AMP algorithm; modulo a widely-believed conjecture in statistical physics, $\mathrm{OBJ_{AMP}}$ approaches $\mathrm{OPT}=\max_{v\in\{\pm 1/\sqrt{n}}v^{\top}Xv\approx 1.52$ as $T\to\infty$ . The corollary follows from Theorem 1.4 because Montanari’s denoisers are Lipschitz, and the rounding scheme applied to place the final iterate in the hypercube is also Lipschitz.

In Section 4, we give a simple proof (along similar lines as the proof of Theorem 1.4) that AMP is robust to adversarial perturbations of small spectral norm. This fact is folklore, but we feel our proof is quite simple and may be of interest.

1.3 Experiments

Our algorithm is fast enough that it can be easily implemented and run on a laptop. We have run some experiments to demonstrate the utility of our method. We consider the non-negative PCA objective described in Example 1.2. In [MR15], it was shown that AMP with denoiser function $f(x)=\max(0,x)$ is an FPTAS for $\mathrm{OPT}=\max_{v\geqslant 0,\|v\|=1}v^{\top}Xv=\sqrt{2}$ .

In Figure 1, we show the result for $n=3000,\varepsilon=0.02$ , with the adversarial corruption given by perturbing an $\varepsilon n\times\varepsilon n$ principal minor by sampling two independent rank $50=\frac{5}{6}\varepsilon n$ Wishart matrices, each normalized to have expected Frobenius norm $100$ , and adding one and subtracting the other. Without having taken pains to optimize the running time, the implementation in Python on a laptop takes less than 5 minutes. We have plotted (1) the correlation of our algorithm’s output, $\hat{v}(Y)$ , with $v_{\mathrm{AMP}}(X)$ , and (2) the objective value of the output for the uncorrupted matrix $X$ , $\hat{v}(Y)^{\top}X\hat{v}(Y)$ , as a function of the number of iterations. For comparison, we plot in Figure 1 the performance of (a) AMP on the corrupt matrix, $v_{\mathrm{AMP}}(Y)$ , and (b) AMP on a “naive” spectral cleaning $\tilde{Y}$ of $Y$ , given by deleting all larger-than-expected eigenvalues. Our procedure performs much better than AMP on the corrupt input. Empirically, the naive cleaning performance is comparable to ours, but unlike our algorithm, the naive procedure does not come with provable guarantees for arbitrary perturbations (and we suspect the naive procedure may be succeeding due to a small- $n$ effect).

Refer to caption — Figure 1: Plot of the correlation of the vector $\hat{v}(Y)$ with the output of AMP on the “clean” matrix $X$ , and of the objective value attained by $\hat{v}(Y)$ on the clean matrix $X$ .

1.4 Discussion

We give a fast spectral algorithm for simulating AMP under adversarial principal minor corruptions. Our algorithm is an implementation of the “spectral algorithms from sum-of-squares (SoS) analyses” strategy introduced in [HSSS16]. We find it to be a particularly striking example of this strategy—not only was the running time reduced from $n^{\exp(T)}$ to $O(n^{3})$ , but also, the analysis very transparently mimics/distills that of [IS24] to yield a much cleaner argument. We draw a comparison to previous spectral-to-SoS analyses in robust statistics, most of which have been based on a “filtering” approach (e.g. [JLST21, DKK⁺19]); in the filtering algorithms, the non-SoS analysis required significant additional tools. Another fitting comparison is to recent works obtaining robust spectral algorithms for community recovery in the stochastic block model [MRW24, DdHS23, DdNS22], where it was important to have a very fine-grained understanding of the spectrum of specific matrices. In our case, we are able to get away with a much simpler analysis.

Though we have improved on the result in [IS24] in terms of running time and the robustness-accuracy tradeoff, we differ from our prior work in one aspect: we require a description of the denoisers $\mathcal{F}$ used in the AMP algorithm $\mathcal{A}$ , whereas the algorithm in [IS24] has access only to the low-degree moments of the joint distribution over $X,\mathcal{A}(X)$ . We find it unlikely that a fast algorithm could succeed without a description of $\mathcal{F}$ , but we pose this as a question nonetheless.

Another question is whether our error guarantees are optimal, as a function of the number of AMP iterations $T$ . In our theorem, the $\tilde{O}(\sqrt{\varepsilon})$ hides factors that grow with the number of AMP iterations; however our experiments (Figure 1) seem to suggest that the error stabilizes—is this a small $n$ effect? Or perhaps an artifact of the specific perturbation from our experiments?

One clear direction for future work is making AMP robust when the input matrix $X$ has planted structure, rather than just having i.i.d. subgaussian entries. For example, AMP has been a successful algorithm for “spiked matrix models” in which $X=G+\lambda uu^{\top}$ with $G$ a Gaussian matrix and $uu^{\top}$ a rank-1 spike, the goal often being to find $u$ given $X$ . In this case, it is not completely clear which noise model to study. In some cases (e.g. when $u$ is sparse) a principal minor corruption could simply erase the spike $uu^{\top}$ . However, it is an interesting question whether our techniques can be extended to this case—currently, our algorithm incorporates information about i.i.d. subgaussian variables, which makes it inappropriate for planted models (the same is true of [IS24]).

Finally, it is interesting to consider alternative corruption models. The principal minor corruption is tractable to study, and the fact that it is adversarial makes it a powerful model. We know from [IS24] that a similar result is information-theoretically impossible under the strongest sparse adversarial corruption model, in which an arbitrary subset of $\varepsilon n^{2}$ entries is perturbed. However, it would be interesting to consider alternative corruption models that more faithfully model the distribution shift one expects to see in practice, for example in the application of compressed sensing.

1.5 Technical overview

Though the proof of Theorem 1.4 is not long, we briefly summarize the main ideas here. For the sake of simplicity, in this technical overview we pretend that the AMP iteration has the form $x^{(t)}=Xf(x^{(t-1)})$ , ignoring the Onsager correction and the dependence on more than one prior iterate.

Recall that we are given an $\varepsilon$ -principal minor corruption $Y$ of $X$ . The fact that $X$ has i.i.d. subgaussian entries of variance $\frac{1}{n}$ implies that with high probability, $\|X\|_{\operatorname{\mathsf{op}}}=O(1)$ . The first step of our algorithm is a spectral procedure which removes $O(\varepsilon n)$ rows and columns of $Y$ , producing a matrix $\hat{Y}$ with $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}=O(1)$ . Then, we run a modified version of the AMP algorithm on the cleaned input matrix $\hat{Y}$ , producing iterates $y^{(1)},y^{(2)},\ldots$ just as the original AMP algorithm would have, except that at each iteration we clip the entries $y^{(t)}=\hat{Y}f(\operatorname{\mathsf{clip}}(y^{(t-1)}))$ , so that the magnitude of all entries of $\operatorname{\mathsf{clip}}(y^{(t-1)})$ does not exceed the $\varepsilon$ -quantile⁴⁴4In the proof we choose the threshold to not exactly correspond to the $\varepsilon$ -quantile, but this choice would have also worked and is simpler for the sake of this overview. value $O(\mathrm{poly}\log\frac{1}{\varepsilon})$ of the entries in a typical iterate $x^{(t-1)}$ from a clean input matrix.

We argue that $\|y^{(t)}-x^{(t)}\|\leqslant\tilde{O}(\sqrt{\varepsilon})\|x^{(t)}\|$ by induction on $t$ ; In the base case, $t=0$ , the iterates are identical as $x^{(t)}=\vec{1}=y^{(t)}$ . Now for $t\geqslant 1$ , suppose that $x^{(t)}$ is the (unobserved) iterate AMP would have produced on $X$ . Then

$\displaystyle\left\\|y^{(t)}-x^{(t)}\right\\|$	$\displaystyle=\left\\|\hat{Y}f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-Xf(x^{(% t-1)})\right\\|$
	$\displaystyle\leqslant\left\\|\hat{Y}(f(\operatorname{\mathsf{clip}}(y^{(t-1)})% )-f(x^{(t-1)}))\right\\|+\left\\|(\hat{Y}-X)f(x^{(t-1)})\right\\|$
	$\displaystyle\leqslant\left\\|\hat{Y}\right\\|_{\operatorname{\mathsf{op}}}\left% \\|f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-f(x^{(t-1)})\right\\|+\left\\|(\hat% {Y}-X)f(x^{(t-1)})\right\\|$	(1)

The spectral cleaning ensures that $\|Y\|_{\operatorname{\mathsf{op}}}=O(1)$ . To further bound the first term in (1), consider the illustrative case of the denoiser $f(x)=x^{2}$ . Then for any vectors $a,b$ , $f(a)-f(b)=(a+b)\circ(a-b)$ , for $\circ$ the entrywise product. Thus we have

	$\displaystyle\left\\|f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-f(x^{(t-1)})\right\\|$	$\displaystyle=\left\\|(\operatorname{\mathsf{clip}}(y^{(t-1)})+x^{(t-1)})\circ(% \operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)})\right\\|$
		$\displaystyle\leqslant\left\\|\operatorname{\mathsf{clip}}(y^{(t-1)})\right\\|_{% \infty}\cdot\left\\|\operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)}\right\\|+% \left\\|x^{(t-1)}\circ(\operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)})\right\\|$		(2)

The first and second terms of (2) are bounded in a similar manner, we begin by explaining the first. Because of the clipping procedure, $\|\operatorname{\mathsf{clip}}(y^{(t-1)})\|_{\infty}=O(\mathrm{poly}\log\frac{% 1}{\varepsilon})$ . Further, by the triangle inequality,

\|\operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)}\|\leqslant\|\operatorname{% \mathsf{clip}}(y^{(t-1)})-\operatorname{\mathsf{clip}}(x^{(t-1)})\|+\|% \operatorname{\mathsf{clip}}(x^{(t-1)})-x^{(t-1)}\|.

(3)

The first term on the right of (3) can be bounded by $\tilde{O}(\sqrt{\varepsilon})\cdot\|x^{(t-1)}\|$ from the inductive hypothesis, because the $\operatorname{\mathsf{clip}}$ function is $1$ -Lipschitz. The second term in (3) can be bounded by $\tilde{O}(\sqrt{\varepsilon})\cdot\|x^{(t-1)}\|$ , because the distribution of $x^{(t-1)}$ ’s entries is known, and is roughly that of independent polynomials in Gaussian random variables. To bound the second term from (2), we separate the contribution of the entries of $x^{(t-1)}$ which are bounded by $O(\mathrm{poly}\log\frac{1}{\varepsilon})$ , to which we can apply an identical argument, and the entries which exceed this threshold, and then appeal to the fact that these integrate to a small total. A similar argument can be used for arbitrary polynomial $f$ (for Lipschitz $f$ , (1) can be bounded directly and the clipping is not necessary).

To bound the second term in (1), we use the fact that $\hat{Y}-X$ can be written as the sum of a matrix $E$ , supported on an $\varepsilon n\times\varepsilon n$ principal minor, and a matrix $F$ which is equal to the support of $-X$ on at most $O(\varepsilon n)$ rows/columns—these are precisely the rows/columns of $Y$ which were removed to form $\hat{Y}$ , but were not involved in the initial principal minor corruption. So, $\|(\hat{Y}-X)f(x^{(t-1)})\|\leqslant\|Ef(x^{(t-1)})\|+\|Ff(x^{(t-1)})\|$ . Since $E$ is supported on $\varepsilon n$ columns,

\|Ef(x^{(t-1)})\|\leqslant\|E\|_{\operatorname{\mathsf{op}}}\cdot\max_{I% \subset[n],|I|=\varepsilon n}\sum_{i\in I}f(x^{(t-1)})_{i}^{2}.

Here again, because we know the order statistics of $x^{(t-1)}$ , and because $f$ is required to be a well-behaved function, the maximum norm of $f(x^{(t-1)})$ when restricted to a subset of $\varepsilon n$ coordinates is on the order of $\tilde{O}(\sqrt{\varepsilon})\|x^{(t-1)}\|$ . Also, since $E$ is a submatrix of $\hat{Y}-X$ , $\|E\|_{\operatorname{\mathsf{op}}}\leqslant\|\hat{Y}\|_{\operatorname{\mathsf{% op}}}+\|X\|_{\operatorname{\mathsf{op}}}\leqslant 12$ .

The matrix $F$ can be split into the part $F_{1}$ supported on $O(\varepsilon n)$ columns, for which the argument is identical to the case of $Ef(x^{(t-1)}$ above. But there is also a part $F_{2}$ supported on $O(\varepsilon n)$ rows. Here, we have to take a different perspective: since $F_{2}$ is a restriction of $-X$ to the rows indexed by some set $T\subset[n]$ with $|T|=\varepsilon n$ , we have that $F_{2}f(x^{(t-1)})=(-Xf(x^{(t-1)}))_{T}$ , which is an $\varepsilon n$ -sparse subset of the vector $-Xf(x^{(t-1)}$ . But we understand the order statistics of this vector too! Hence we have that $\|F_{2}f(x^{(t-1)})\|=\tilde{O}(\sqrt{\varepsilon})\|x^{(t-1)}\|$ as desired.

Putting everything together, we have that $\|y^{(t)}-x^{(t)}\|\leqslant\tilde{O}(\sqrt{\varepsilon})\cdot\|x^{(t-1)}\|$ . The argument is now finished by again using our knowledge of the distribution of $x^{(t-1)}$ to conclude that $\|x^{(t-1)}\|$ and $\|x^{(t)}\|$ are within constant scalings of each other.

Much of this analysis mirrors and simplifies the analysis in [IS24]. There, a semidefinite program is used to obtain a pseudoexpectation of a “cleaned” version $\hat{X}$ of $Y$ . The semidefinite program has formal variables for low-degree symmetric polynomials of $\hat{X}$ . It adds constraints to try to enforce that $\|\hat{X}\|_{\operatorname{\mathsf{op}}}=O(1)$ , that $\hat{X}-Y$ be supported on a principal minor (by introducing indicator variables for “clean” rows and columns), as well as the constraint that some symmetric vector-valued polynomials in the entries of $\hat{X}$ have entries which are no larger than corresponding polynomials in $X$ .

The high-level sequence of arguments mirrors those outlined in (1) and the subsequent lines. We introduce some additional structure/arguments because our spectral cleaning step (for which we design a natural-in-hindsight spectral cleaning algorithm) deletes rows and columns. One advantage of the present argument over that in [IS24] is that it is unclear how to make a semidefinite program leverage the order statistics of vector-valued polynomials, so in our prior work we crudely enforce a bound on the infinity norm of the vectors, which gives rise to $\mathrm{poly}\log n$ factors. Here we are able to circumvent this because we clip our iterates by hand.

2 AMP preliminaries

To complete Definition 1.1 from the introduction, we must define the Onsager correction term.

Definition 2.1 (Onsager correction).

The Onsager correction term for the AMP algorithm defined by denoisers $\mathcal{F}=f_{1},\ldots$ on input $X$ with iterates $x^{(0)},x^{(1)},\ldots$ is the quantity

\Delta_{t}(v_{t},\ldots,v_{0})=\sum_{j=1}^{t}B_{t,j}\cdot f_{j-1}(x^{(j-1)},% \ldots,x^{(0)})

where $B_{t,j}=\operatorname*{\mathbf{E}}_{X}[b_{t,j}]$ where $b_{t,j}$ is the normalized divergence of $f_{t}$ with respect to $x^{(j)}$ :

b_{t,j}=\frac{1}{n}\sum_{i=1}^{n}\left.\frac{\partial f_{t}(x_{i}^{t},\ldots,u% _{i}^{j},\ldots,,x_{i}^{0})}{\partial u_{i}^{j}}\right|_{u^{j}\rightarrow x^{j% }}.

We remark that the Onsager correction is usually defined with the function $b_{t,j}$ in place of the constant $B_{t,j}$ (and in fact, generally one would estimate $B_{t,j}$ from data by computing $b_{t,j}$ ). For technical reasons it is easier for us to work with $B_{t,j}$ . As was previously noted in the literature [FVRS22, Remark 2.4], when the denoisers are well-behaved this is effectively without loss of generality because the iterates produced by using $b_{t,j}$ vs. $B_{t,j}$ are $o(1)$ -close; we discuss this further in Appendix A.

Definition 2.2 (Pseudo-Lipschitz Functions).

A function $\varphi:\mathbb{R}^{t}\rightarrow\mathbb{R}$ is called Pseudo Lipschitz of order $k$ (or $\operatorname{PL}(k)$ ) if

|\varphi(x)-\varphi(y)|\leqslant L\left(1+\|x\|_{2}^{k-1}+\|y\|_{2}^{k-1}% \right)\|x-y\|_{2}

for all $x,y\in\mathbb{R}^{t}$ .

Note that a function is Lipschitz exactly when it is $\operatorname{PL}(1)$ , and a polynomial of degree $k$ lies in $\operatorname{PL}(k)$ . By a slight abuse of notation, we will say that constants lie in $\operatorname{PL}(0)$ .

We will need information about the order statistics of the entries of our iterates, $x^{(t)}$ . When we run AMP with polynomial denoiser functions, each iterate $x^{(t)}$ is a symmetric (fixed by coordinate relabeling), vector-valued polynomial in the entries of $X$ . So each entry is a bounded-degree polynomial of independent subgaussian random variables.

While the entries of $x^{(t)}$ are not independent, they are sufficiently close to independent that for simple functions $g:\mathbb{R}\to\mathbb{R}$ , the average $\frac{1}{n}\sum_{i=1}^{n}g(x^{(t)}_{i})$ concentrates fairly well around the expectation of $g$ on a polynomial of Gaussians. The same is true when the denoiser functions are Lipschitz. This fact is known as “state evolution” in the AMP literature. In the next corollary, we state a useful consequence that will allow us to control the order statistics of our iterates.

Corollary 2.3.

Suppose that $f:\mathbb{R}^{t+1}\rightarrow\mathbb{R}$ is $\operatorname{PL}(k)$ for $k\geqslant 0$ and $g:\mathbb{R}^{t+1}\rightarrow\mathbb{R}$ is $\operatorname{PL}(\ell)$ with $g(\vec{0})=0$ and $\ell\geqslant 1$ . Suppose $\vec{x}=x^{(t)}$ is an AMP iterate resulting from the application of Pseudo Lipschitz denoisers on input $X$ a symmetric matrix with i.i.d. $\frac{O(1)}{\sqrt{n}}$ -subgaussian entries having mean $0$ and variance $\frac{1}{n}$ . Furthermore, let $C>0$ be a constant (possibly depending on $t$ ). Then, the following hold:

•

For any $r\gg\max(t,k)$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[g(\vec{x}_{i})^{2}>\theta]\leqslant% \frac{1}{\theta^{r}}\cdot C^{2r}\cdot(3\ell r)^{\ell r}.

•

For every $I\subseteq[n]$ with $|I|\leqslant\varepsilon n$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i\in I}f(\vec{x}_{i})^{2}\leqslant C\varepsilon\log^{k}\frac{1}{% \varepsilon}.

We prove this corollary in Appendix A.

Sometimes we will use the phrase “Almost-Triangle Inequality” to refer to the inequality $(a+b)^{2}\leqslant 2a^{2}+2b^{2}$ .

3 Making AMP robust to principal minor corruptions

In this section, we prove our main theorem.

Theorem 3.1 (Main Theorem).

Let $\mathcal{F}$ be an AMP iteration consisting of either Lipschitz or polynomial denoiser functions. Suppose that $X$ is a symmetric matrix with i.i.d. entries of mean $0$ , variance $\frac{1}{n}$ , and subgaussian parameter $\frac{O(1)}{\sqrt{n}}$ . Let $v_{\mathrm{AMP}}(X)$ denote the output of the $T$ -step AMP algorithm on input $X$ , and set $d$ to be the degree of $v_{\mathrm{AMP}}(X)$ as a polynomial, or $1$ if the denoisers are Lipschitz.⁵⁵5This aligns with the pseudo-Lipschitz degree of $v_{\mathrm{AMP}}(X)$ , which functions similarly to the degree as a polynomial. Then, with probability $1-o(1)$ over the choice of $X$ , Algorithm 3.4 run on any $\varepsilon$ -principal minor corruption $Y$ of $X$ , produces in time $O(\varepsilon n^{3}\log n)$ a vector $\hat{v}(Y)$ which satisfies

\left\|\hat{v}(Y)-v_{\mathrm{AMP}}(X)\right\|_{2}^{2}\leqslant O\left(% \varepsilon\log^{d}\frac{1}{\varepsilon}\right)\cdot\|v_{\mathrm{AMP}}(X)\|_{2% }^{2}.

Our algorithm consists of a pre-processing step, followed by a “robust” simulation of AMP:

1.

In the pre-processing step, we spectrally clean $Y$ by removing rows and columns to produce a matrix $\hat{Y}$ with $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}=O(1)$ .
2.

Then, we run AMP on $\hat{Y}$ , but with the following modification: after each iteration, we clip the iterate (coordinate-wise) to ensure all coordinates have not-too-large an absolute value.

The following definitions will help us to describe our algorithm.

Definition 3.2.

For $\varepsilon>0$ , define $\operatorname{\mathsf{cutoff}}(\varepsilon)=\sqrt{C_{T}\log\frac{1}{% \varepsilon}}$ for an appropriately large $C_{T}$ depending on $T$ , the total number of AMP iterations.⁶⁶6In practice, $C_{T}=16$ is a reasonable value. The “ $\varepsilon$ -clip” of $y\in\mathbb{R}$ is now defined to be

\operatorname{\mathsf{clip}}^{\varepsilon}(y)=\begin{cases}y&|y|\leqslant% \operatorname{\mathsf{cutoff}}(\varepsilon)\\ \mathsf{sign}(y)\cdot\operatorname{\mathsf{cutoff}}(\varepsilon)&|y|>% \operatorname{\mathsf{cutoff}}(\varepsilon)\end{cases}

Definition 3.3 (Matrix restriction).

Given a matrix $Y\in\mathbb{R}^{n\times n}$ , $\hat{Y}$ is an $\varepsilon$ -restriction if there exists a set $S\subseteq[n]$ with $|S|\leqslant\varepsilon n$ such that zeroing out the rows and columns of $Y$ with indices in $S$ yields $\hat{Y}$ .

Pictorially, this is as follows:

Y=\begin{bmatrix}Y_{S,S}&Y_{S,\overline{S}}\\ Y_{\overline{S},S}&Y_{\overline{S},\overline{S}}\end{bmatrix}\longrightarrow% \hat{Y}=\begin{bmatrix}\mathbf{0}_{S,S}&\mathbf{0}_{S,\overline{S}}\\ \mathbf{0}_{\overline{S},S}&Y_{\overline{S},\overline{S}}\end{bmatrix}.

Algorithm 3.4 (Robust AMP)

Input: A symmetric $n\times n$ matrix $Y$ .

Operation:

1.

Compute a restriction $\hat{Y}$ of $Y$ satisfying $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}\leqslant 5\cdot\operatorname*{\mathbf% {E}}[\|X\|_{\operatorname{\mathsf{op}}}]$ using Algorithm 3.7.

For $t=1,\ldots,T$ , set $y^{(t)}$ to be the clipped AMP iteration

y^{(t)}=\operatorname{\mathsf{clip}}^{\varepsilon}\left(\hat{Y}f_{t}(y^{(t-1)}% ,\ldots,y^{(0)})-\sum_{j=1}^{t}B_{t,j}\cdot f_{j-1}(y^{(j-1)},\ldots,y^{(0)})% \right).

Output: The vector $\hat{v}=y^{(T)}$ .

Theorem 3.1 is a consequence of the following two lemmas, one for each step of Algorithm 3.4.

Lemma 3.5 (Efficient spectral cleaning).

Suppose $X$ is a symmetric $n\times n$ matrix with i.i.d. entries of mean zero, variance $\frac{1}{n}$ , and subgaussian parameter $\frac{O(1)}{\sqrt{n}}$ . With probability $1-o(1)$ over $X$ , Algorithm 3.7 run on any $\varepsilon$ -principal minor corruption $Y$ of $X$ with threshold value $K=5\operatorname*{\mathbf{E}}[\|X\|_{\operatorname{\mathsf{op}}}]$ outputs in time $O(\varepsilon n^{3}\log n)$ a matrix $\hat{Y}$ which is a $4\varepsilon$ -restriction of $Y$ and satisfies $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}=O(1)$ .

Lemma 3.6 (Success of AMP on restrictions).

Suppose $X$ is an $n\times n$ matrix with i.i.d. entries of mean zero, variance $\frac{1}{n}$ and subgaussian parameter $\frac{O(1)}{\sqrt{n}}$ -subgaussian entries. Suppose that $Y$ is an $\varepsilon$ -principal minor corruption of $X$ and $\hat{Y}$ is a $4\varepsilon$ -restriction of $Y$ with $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}=O(1)$ . Then the clipped AMP iteration from Algorithm 3.4 on $\hat{Y}$ produces a vector $\hat{v}$ such that $\|\hat{v}-v_{\mathrm{AMP}}(X)\|_{2}^{2}\leqslant O(\varepsilon\log^{d}\frac{1}% {\varepsilon})\|v_{\mathrm{AMP}}(X)\|_{2}^{2}$ with probability $1-o_{n}(1)$ over the choice of $X$ .

When combined, these two lemmas immediately imply Theorem 3.1.

3.1 Spectral cleaning

The goal of this section is to prove Lemma 3.5. Here we present the algorithm to construct $\hat{Y}$ which is a $4\varepsilon$ -restriction of $Y$ and has $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}=O(1)$ .

Algorithm 3.7 (Spectral cleaning of principal minor corruptions)

Input: A symmetric $n\times n$ matrix $Y$ , and a threshold value $K\geqslant 0$ .

Operation:

1.

Let $\hat{Y}=Y$ .
2.
While $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}>K$ :
1. (a)
  
  Let $v$ be the eigenvector of $\hat{Y}$ with eigenvalue of largest magnitude.
2. (b)
  
  Sample $i\in[n]$ with probability $v_{i}^{2}$ .
3. (c)
  
  Zero out the $i$ -th row and column of $\hat{Y}$ .

Output: Matrix $\hat{Y}$ .

Note that critically we do not require that we exactly recover the corrupted rows and columns: all that matters is that we remove the indices that contribute the most to the spectral corruption.

Proof of Lemma 3.5.

Certainly, the algorithm terminates since no index can be sampled more than once. We will show that with high probability, $O(\varepsilon)n$ indices are removed. The runtime bound can be deduced from noting that we have to run power iteration at most once per index removal, each run taking $O(n^{2}\log n)$ time.

For convenience, let $\alpha=\operatorname*{\mathbf{E}}[\|X\|_{\operatorname{\mathsf{op}}}]$ , and recall our threshold is $K=5\alpha$ . Since $X$ has independent subgaussian entries, with high probability $\|X\|_{\operatorname{\mathsf{op}}}\leqslant\alpha+o(1)$ . Let $Q$ denote the set of corrupted indices in $Y$ . Furthermore, let $Y^{(0)}=Y,Y^{(1)},\ldots,Y^{(t)}$ denote the matrix $\hat{Y}$ after each iteration of the while loop. Similarly define $E^{(t)}$ and $Q^{(t)}$ (the non-zeroed out corrupted indices).

Note that if all indices in $Q$ are removed, the while loop will terminate (it can terminate in other instances, but this is just one stopping condition). We show that with high probability we will reach $\|Y^{(t)}\|_{\operatorname{\mathsf{op}}}\leqslant 5\alpha$ within $4\varepsilon n$ iterations (an thus $\hat{Y}$ is a $4\varepsilon$ -restriction of $Y$ ) using a win-win analysis: either we reach a small operator norm before removing all of $Q$ or we remove all of $Q$ (which implies the remaining matrix has norm $\leqslant\alpha+o(1)$ , because it is a principal minor of $X$ ). In particular, the crux of the argument is the following:

Claim 3.8

Let $v$ be the top eigenvector of $Y^{(t)}$ and suppose $\|Y^{(t)}\|_{\operatorname{\mathsf{op}}}>5\alpha$ . Then with high probability over $X$ ,

\sum_{i\in Q^{(t)}}v_{i}^{2}\geqslant\frac{1}{2}-o(1).

Note that this claim is equivalent to saying that at each iteration of the while loop there is at least a $\frac{1}{2}$ probability of removing some index from $Q$ .

Proof of Claim.

Suppose that $v^{\top}Y^{(t)}v>5\alpha$ (as opposed to $v^{\top}Y^{(t)}v<-5\alpha$ ). Let $\tilde{v}$ be $v$ such that all indices outside of $Q^{(t)}$ are set to zero. Our goal is to lower bound $\|\tilde{v}\|_{2}^{2}$ . Notice $v^{\top}E^{(t)}v=\tilde{v}^{\top}E^{(t)}\tilde{v}$ by definition. Since $v$ is the top eigenvector of $Y^{(t)}$ ,

v^{\top}E^{(t)}v=v^{\top}(Y^{(t)}-X^{(t)})v\geqslant\|Y^{(t)}\|_{\operatorname% {\mathsf{op}}}-\|X^{(t)}\|_{\operatorname{\mathsf{op}}}\geqslant(\|E^{(t)}\|_{% \operatorname{\mathsf{op}}}-\|X^{(t)}\|_{\operatorname{\mathsf{op}}})-\|X^{(t)% }\|_{\operatorname{\mathsf{op}}}\geqslant\|E^{(t)}\|_{\operatorname{\mathsf{op% }}}-2\alpha-o(1)

where in the second to last step we used that $Y^{(t)}=X^{(t)}+E^{(t)}$ and in the last step we used that, since $X^{(t)}$ is a principal minor of $X$ , $\|X^{(t)}\|_{\operatorname{\mathsf{op}}}\leqslant\|X\|_{\operatorname{\mathsf{% op}}}\leqslant\alpha+o(1)$ w.h.p.

However, note that $v^{\top}E^{(t)}v\leqslant\|E^{(t)}\|_{\operatorname{\mathsf{op}}}\cdot\|\tilde% {v}\|_{2}^{2}$ , which implies that $\|\tilde{v}\|_{2}^{2}\geqslant 1-\frac{2\alpha+o(1)}{\|E\|_{\operatorname{% \mathsf{op}}}}$ . Since $\|E^{(t)}\|_{\operatorname{\mathsf{op}}}\geqslant\|Y^{(t)}\|_{\operatorname{% \mathsf{op}}}-\|X^{(t)}\|_{\operatorname{\mathsf{op}}}\geqslant 5\alpha-(% \alpha-o(1))$ by assumption, this implies that $\|\tilde{v}\|_{2}^{2}\geqslant\frac{1}{2}-o(1)$ . The proof of the case $v^{\top}Yv<-5\alpha$ is identical up to a change of sign. ∎

To prove that our loop terminates in $4\varepsilon n$ steps with high probability, define the stopping time $\tau=\min\{t\geqslant 0:\|Y^{(t)}\|_{\operatorname{\mathsf{op}}}\leqslant 5\alpha\}$ . Now, let $I_{t}$ denote the indicator of whether the index removed between $Y^{(t)}$ and $Y^{(t+1)}$ was in $Q$ , and note that each $I_{t}$ independently stochastically dominates a $B_{t}\sim\mathsf{Ber}(\frac{1}{2}-o(1))$ . Suppose that $\tau\geqslant 4\varepsilon n$ . Then, it follows that

\sum_{j=1}^{4\varepsilon n}B_{t-1}\leqslant\varepsilon n,

which happens with exponentially small probability (this is equivalent to asking for the probability that $\mathsf{Binomial}(4\varepsilon n,\frac{1}{2}-o(1))\leqslant\varepsilon n$ ). Together, this implies that $\tau\leqslant 4\varepsilon n$ with high probability.∎

3.2 Analysis of clipped AMP on spectrally cleaned input

In this section we will prove Lemma 3.6. To begin, we examine the effect of a combination of principal minor and restriction corruptions. Suppose $Y$ is an $\varepsilon$ -principal minor corruption of $X$ , and suppose $\hat{Y}$ is a $4\varepsilon$ -restriction of $Y$ . Let $S$ denote the set of rows in the support of $Y-X$ , and let $T$ denote the set of rows in the support of $Y-\hat{Y}$ . For simplicity, let $S^{\prime}=S\setminus T$ (the set of corrupted rows which are not removed by the restriction). Then, the matrix evolves as follows:

X=\begin{bmatrix}X_{S,S}&X_{S,\overline{S}}\\ X_{\overline{S},S}&X_{S,S}\end{bmatrix}\longrightarrow Y=\begin{bmatrix}X_{S,S% }+E_{S,S}&X_{S,\overline{S}}\\ X_{\overline{S},S}&X_{S,S}\end{bmatrix}\longrightarrow\hat{Y}=\begin{bmatrix}% \mathbf{0}_{T,T}&\mathbf{0}_{T,S^{\prime}}&\mathbf{0}_{T,\overline{T\cup S}}\\ \mathbf{0}_{S^{\prime},T}&X_{S^{\prime},S^{\prime}}+E_{S^{\prime},S^{\prime}}&% X_{S^{\prime},\overline{T\cup S}}\\ \mathbf{0}_{\overline{T\cup S},T}&X_{\overline{T\cup S},S^{\prime}}&X_{% \overline{T\cup S},\overline{T\cup S}}\end{bmatrix}.

In particular, if we let $E$ be the portion of the error matrix $Y-X$ which survives the restriction, and then let $F$ be the remainder in $\hat{Y}=X+E+F$ , it follows that $F_{i,j}$ is either $-X_{i,j}$ or $0$ . Furthermore, we will split $F$ into two sections: $F_{1}\in\mathbb{R}^{|T|\times n}$ consisting of all entries in rows indexed by $T$ (in other words, $F_{1}=-X_{T,[n]}$ ), and $F_{2}\in\mathbb{R}^{(n-|T|)\times|T|}$ consisting of all entries in columns indexed by $T$ , except those covered by $F_{1}$ (in other words, $F_{2}=-X_{\overline{T},T})$ . Pictorially, this can be represented via

\hat{Y}-X=\begin{bmatrix}\mathbf{(}-X)_{T,T}&(-X)_{T,S^{\prime}}&\mathbf{(}-X)% _{T,\overline{T\cup S}}\\ \mathbf{(}-X)_{S^{\prime},T}&E_{S^{\prime},S^{\prime}}&\mathbf{0}_{S^{\prime},% \overline{T\cup S}}\\ \mathbf{(}-X)_{\overline{T\cup S},T}&\mathbf{0}_{\overline{T\cup S},S^{\prime}% }&\mathbf{0}_{\overline{T\cup S},\overline{T\cup S}}\end{bmatrix}=\begin{% bmatrix}F_{1}&F_{1}&F_{1}\\ F_{2}&E_{S^{\prime},S^{\prime}}&\mathbf{0}_{S^{\prime},\overline{T\cup S}}\\ F_{2}&\mathbf{0}_{\overline{T\cup S},S^{\prime}}&\mathbf{0}_{\overline{T\cup S% },\overline{T\cup S}}\end{bmatrix}.

As a warm-up, we show that each of these quantities is bounded in operator norm.

Proposition 3.9.

Suppose that $\|\hat{Y}\|_{\operatorname{\mathsf{op}}}\leqslant 5\operatorname*{\mathbf{E}}[% \|X\|_{\operatorname{\mathsf{op}}}]=:5\alpha$ . For the above definitions of $E,F_{1},F_{2}$ , we have that

\|E\|_{\operatorname{\mathsf{op}}}\leqslant 6\alpha\qquad\mathrm{and}\qquad\|F% _{1}\|_{\operatorname{\mathsf{op}}},\|F_{2}\|_{\operatorname{\mathsf{op}}}% \leqslant 2\alpha

with high probability.

Proof.

Let us begin with $F_{1}$ . Begin by considering $F_{1}$ , which has each entry $F_{ij}$ an independent subgaussian random variable. Applying standard matrix concentration arguments (e.g. Theorem 4.6.1 in [Ver18]), we have with high probability that $\|F_{1}\|_{\operatorname{\mathsf{op}}}\leqslant\alpha(1+O(\sqrt{|T|/n}))% \leqslant 4\alpha$ . We can apply a similar argument to see that $\|F_{2}\|_{\operatorname{\mathsf{op}}}\leqslant 4\alpha$ as well.

Now, consider $\hat{Y}_{\overline{T},\overline{T}}=X_{\overline{T},\overline{T}}+E$ . We then have that

\left\|E\right\|_{\operatorname{\mathsf{op}}}\leqslant\left\|\hat{Y}_{% \overline{T},\overline{T}}\right\|_{\operatorname{\mathsf{op}}}+\left\|X_{% \overline{T},\overline{T}}\right\|_{\operatorname{\mathsf{op}}}\leqslant\|\hat% {Y}\|_{\operatorname{\mathsf{op}}}+\|X\|_{\operatorname{\mathsf{op}}}\leqslant 6\alpha

as the operator norm of a principal minor is at most that of the original matrix. ∎

With this in mind, we are ready to prove Lemma 3.6, which we reprint here for clarity.

Lemma (Restatement of Lemma 3.6).

The proof follows from a few central claims. The first of these shows that clipping cannot substantially change how far we are from the true AMP iteration.

Proposition 3.10 (Clipping preserves error).

Define $\widetilde{y}^{(t)}$ to be the unclipped version of $y^{(t)}$ (that is, the inner expression passed to $\operatorname{\mathsf{clip}}^{\varepsilon}(\cdot)$ ). Then,

\|y^{(t)}-x^{(t)}\|_{2}\leqslant\|\widetilde{y}^{(t)}-x^{(t)}\|_{2}+\sqrt{% \varepsilon n}

with probability $1-o_{n}(1)$ .

The next proposition aims to show that even though $E,F_{1}$ , and $F_{2}$ have constant operator norms, their row (or column) sparsity allow for controlling their effect on AMP iterates. Here we also introduce the shorthand $f_{t}(x)\triangleq f_{t}(x^{(t-1)},x^{(t-2)},\ldots,x^{(0)})$ .

Proposition 3.11 (Block-sparse corruptions have small error).

Suppose that $f_{t}\in\operatorname{PL}(d_{t})$ and define $\overline{d}_{t}=\max_{j\leqslant t}d_{j}$ . There exists a constant $C>0$ (independent of $n$ and $\varepsilon$ but possibly dependent on $t$ ) such that each of $\|Ef_{t}(x)\|_{2}^{2},\|F_{1}f_{t}(x)\|_{2}^{2}$ , and $\|F_{2}f_{t}(x)\|_{2}^{2}$ are bounded by $C\varepsilon n\cdot\log^{\overline{d}_{t}}\frac{1}{\varepsilon}$ with probability $1-o_{n}(1)$ .

The final proposition aims to show that applying polynomials to clipped AMP iterates cannot dramatically change closeness. Note that this is not true in general and requires both boundedness and state evolution to hold in our case.

Proposition 3.12 (Pseudo-Lipschitz functions preserve closeness of AMP iterates).

Suppose that $f_{t}\in\operatorname{PL}(d_{t})$ , and let $M=\max_{0\leqslant i<t}\|y^{(i)}-x^{(i)}\|_{2}^{2}$ . Then, there exists a constant $C_{T}>0$ (independent of $n$ and $\varepsilon$ but dependent on $T$ ) such that

\left\|f_{t}(y)-f_{t}(x)\right\|_{2}^{2}\leqslant M(C_{T}t)^{d_{t}}\cdot\log^{% d_{t}-1}\left(\tfrac{1}{\varepsilon}\right)+t\cdot\varepsilon n

with probability $1-o_{n}(1)$ .

Together, these three propositions allow us to prove the lemma.

Proof of Lemma 3.6.

We prove by induction on the iteration $t$ . Certainly, $y^{(0)}=x^{(0)}=\vec{1}$ so the base case is complete. Else, suppose we have proven the statement for all $k<t$ . We prove for $t$ .

By Proposition 3.10, we have that $\|y^{(t)}-x^{(t)}\|_{2}^{2}\leqslant\|\widetilde{y}^{(t)}-x^{(t)}\|_{2}^{2}+\varepsilon n$ , so let us handle this first term. To decrease verbiage, let $\mathsf{MAX}=\max\limits_{1\leqslant j\leqslant t}\|f_{j}(y)-f_{j}(x)\|_{2}$ . By the Triangle Inequality and the definition of the AMP iteration, we have that

	$\displaystyle\\|\widetilde{y}^{(t)}-x^{(t)}\\|_{2}$	$\displaystyle=\left\\|\hat{Y}f_{t}(y)-Xf_{t}(x)+\sum_{j=1}^{t}B_{t,j}\left(f_{j% -1}(y)-f_{j-1}(x)\right)\right\\|_{2}$
		$\displaystyle\leqslant\left\\|\hat{Y}(f_{t}(y)-f_{t}(x)+f_{t}(x))-Xf_{t}(x)% \right\\|_{2}+\sum_{j=1}^{t}\|B_{t,j}\|\left\\|f_{j-1}(y)-f_{j-1}(x)\right\\|_{2}$
		$\displaystyle\leqslant\left\\|\hat{Y}(f_{t}(y)-f_{t}(x))\right\\|_{2}+\left\\|(% \hat{Y}-X)f_{t}(x)\right\\|_{2}+\mathsf{MAX}\cdot\sum_{j=1}^{t}\|B_{t,j}\|$
		$\displaystyle\leqslant\left\\|\hat{Y}\right\\|_{\operatorname{\mathsf{op}}}\left% \\|(f_{t}(y)-f_{t}(x))\right\\|_{2}+\left\\|(\hat{Y}-X)f_{t}(x)\right\\|_{2}+% \mathsf{MAX}\cdot\sum_{j=1}^{t}\|B_{t,j}\|$
		$\displaystyle\leqslant\mathsf{MAX}\cdot\left(10+\sum_{j=1}^{t}\|B_{t,j}\|\right)% +\\|Ef_{t}(x)\\|_{2}+\\|F_{1}f_{t}(x)\\|_{2}+\\|F_{2}f_{t}(x)\\|_{2}$
		$\displaystyle\leqslant C\cdot\sqrt{M(C_{T}t)^{d_{t}}\cdot\log^{d_{t}-1}\left(% \tfrac{1}{\varepsilon}\right)+t\cdot\varepsilon n}+3\sqrt{C\varepsilon n\cdot% \log^{\overline{d}_{t}}\left(\tfrac{1}{\varepsilon}\right)}$

where for the last inequality we applied Proposition 3.11 and Proposition 3.12. Now, the Almost-Triangle Inequality and combining with Proposition 3.10 implies that

	$\displaystyle\\|y^{(t)}-x^{(t)}\\|_{2}^{2}$	$\displaystyle\leqslant 2C^{2}\left(M(C_{T}t)^{d_{t}}\cdot\log^{d_{t}-1}\left(% \tfrac{1}{\varepsilon}\right)+t\cdot\varepsilon n\right)+36C\varepsilon n\cdot% \log^{\overline{d}_{t}}\left(\tfrac{1}{\varepsilon}\right)+2\varepsilon n$
		$\displaystyle=M(Ct)^{d_{t}}\cdot\log^{d_{t}-1}\left(\tfrac{1}{\varepsilon}% \right)+C\cdot\varepsilon n\log^{\overline{d}_{t}}\left(\tfrac{1}{\varepsilon}% \right).$

If the AMP iteration consists of Lipschitz denoisers, it follows that $d_{t}=1$ for all $t$ and thus $\|y^{(t)}-x^{(t)}\|_{2}^{2}\leqslant(Ct)^{t}\cdot\varepsilon n\log\frac{1}{\varepsilon}$ . Else, notice that the power of $\log\frac{1}{\varepsilon}$ can be at most $t\overline{d}_{t}$ , which completes the proof.

∎

We finish this section by proving the three propositions.

Proof of Proposition 3.10.

Note that $\operatorname{\mathsf{clip}}^{\varepsilon}$ is a $1$ -Lipschitz function. So, by the triangle inequality,

	$\displaystyle\\|y^{(t)}-x^{(t)}\\|_{2}=\\|\operatorname{\mathsf{clip}}^{% \varepsilon}(\tilde{y}^{(t)})-x^{(t)}\\|_{2}$	$\displaystyle\leqslant\\|\operatorname{\mathsf{clip}}^{\varepsilon}(\tilde{y}^{% (t)})-\operatorname{\mathsf{clip}}^{\varepsilon}(x^{(t)})\\|_{2}+\\|% \operatorname{\mathsf{clip}}^{\varepsilon}(x^{(t)})-x^{(t)}\\|_{2}$
		$\displaystyle\leqslant\\|\tilde{y}^{(t)}-x^{(t)}\\|_{2}+\\|\operatorname{\mathsf{% clip}}^{\varepsilon}(x^{(t)})-x^{(t)}\\|_{2}$
		$\displaystyle=\\|\tilde{y}^{(t)}-x^{(t)}\\|_{2}+\left(\sum_{i=1}^{n}(x^{(t)}_{i}% )^{2}\cdot\bm{1}\left[(x^{(t)}_{i})^{2}>C_{T}\log\frac{1}{\varepsilon}\right]% \right)^{1/2}$

so it remains to bound this second quantity. We may apply Corollary 2.3 with $f(\vec{x}_{i})=g(\vec{x}_{i})=x_{i}^{(t)}$ (which is Lipschitz) and $\theta=C_{T}\log\frac{1}{\varepsilon}$ , which implies that

\frac{1}{n}\sum_{i=1}^{n}(x^{(t)}_{i})^{2}\cdot\bm{1}\left[(x^{(t)}_{i})^{2}>C% _{T}\log\frac{1}{\varepsilon}\right]\leqslant\frac{1}{(C_{T}\log\frac{1}{% \varepsilon})^{r}}\cdot(C^{\prime})^{2r}\cdot(3r)^{r}=\left(\frac{3(C^{\prime}% )^{2}\cdot r}{C_{T}\log\frac{1}{\varepsilon}}\right)^{r}.

By choosing $C_{T}\geqslant 3e(C^{\prime})^{2}$ and taking $r=\log\frac{1}{\varepsilon}$ , it follows that

\frac{1}{n}\sum_{i=1}^{n}(x^{(t)}_{i})^{2}\cdot\bm{1}\left[(x^{(t)}_{i})^{2}>C% _{T}\log\frac{1}{\varepsilon}\right]\leqslant\varepsilon

from where the conclusion follows. ∎

Proof of Proposition 3.11.

Let $S^{\prime}$ be the indices in the support of $E$ , and let $T$ be the set of indices in the row-support of $F_{1}$ (and column-support of $F_{2}$ ), as in the figure at the beginning of the section. For a given vector $v$ , we will define $v_{S^{\prime}}$ to be the restriction of $v$ to $S^{\prime}$ . Then, note that

\|Ef_{t}(x)\|_{2}^{2}=\left\|E(f_{t}(x))_{S^{\prime}}\right\|_{2}^{2}\leqslant% \|E\|_{\operatorname{\mathsf{op}}}\cdot\left\|f_{t}(x)_{S^{\prime}}\right\|_{2% }^{2}\leqslant 12\left\|f_{t}(x)_{S^{\prime}}\right\|_{2}^{2},

and similarly $\|F_{2}f_{t}(x)\|_{2}^{2}\leqslant\|F_{2}\|_{\operatorname{\mathsf{op}}}\left% \|f_{t}(x)_{T}\right\|_{2}^{2}\leqslant 4\left\|f_{t}(x)_{T}\right\|_{2}^{2}$ . To handle each of these, notice that $|S^{\prime}|,|T|\leqslant 4\varepsilon n$ . Therefore, we may apply Corollary 2.3 to deduce that

\frac{1}{n}\left\|f_{t}(x)_{S^{\prime}}\right\|_{2}^{2}\leqslant C\varepsilon% \log^{d_{t}}\frac{1}{\varepsilon}

and similarly for $f_{t}(x)_{T}$ . This implies the boundedness of $\|Ef_{t}(x)\|_{2}^{2}$ and $\|F_{2}f_{t}(x)\|_{2}^{2}$ .

We cannot use the same argument for $\|F_{1}f_{t}(x)\|_{2}^{2}$ because $F_{1}$ is supported on all columns Instead, let us recall that $F_{1}=-X$ on its supported rows, so $F_{1}f_{t}(x)=(-Xf_{t}(x))_{T}$ and we are trying to bound $\|F_{1}f_{t}(x)\|_{2}^{2}=\left\|(-Xf_{t}(x))_{T}\right\|_{2}^{2}$ . Using the definition of the AMP iteration, we can rewrite

x^{(t)}=Xf_{t}(x)-\sum_{j=1}^{t}B_{t,j}f_{j-1}(x)\implies-Xf_{t}(x)=-x^{(t)}-% \sum_{j=1}^{t}B_{t,j}f_{j-1}(x).

Therefore, $-Xf_{t}(x)\in\operatorname{PL}(\max_{j<t}d_{j})$ and is a function of the iterates $x^{(0)},x^{(1)},\ldots,x^{(t)}$ . Once more applying Corollary 2.3, it follows that

\frac{1}{n}\|F_{1}f_{t}(x)\|_{2}^{2}=\frac{1}{n}\left\|(-Xf_{t}(x))_{T}\right% \|_{2}^{2}\leqslant C\varepsilon\log^{\max_{j<t}d_{j}}\frac{1}{\varepsilon}

and we are done. ∎

Proof of Proposition 3.12.

We begin by applying the definition of $\operatorname{PL}(d_{t})$ . In particular, combined with the Almost-Triangle Inequality we find that

$\displaystyle\\|f_{t}(y)-f_{t}(x)\\|_{2}^{2}$	$\displaystyle=\sum_{i=1}^{n}(f_{t}(y_{i})-f_{t}(x_{i}))^{2}$
	$\displaystyle\leqslant L^{2}\sum_{i=1}^{n}(1+\\|y_{i}\\|^{d_{t}-1}+\\|x_{i}\\|^{d_% {t}-1})^{2}\cdot\\|y_{i}-x_{i}\\|^{2}$
	$\displaystyle\leqslant 3L^{2}\sum_{i=1}^{n}\left(1+\\|y_{i}\\|^{2(d_{t}-1)}+\\|x_% {i}\\|^{2(d_{t}-1)}\right)\cdot\sum_{j=0}^{t-1}(y^{(j)}_{i}-x^{(j)}_{i})^{2}$
	$\displaystyle\leqslant 3L^{2}\left(1+\max_{i}\\|y_{i}\\|^{2(d_{t}-1)}\right)\sum% _{j=0}^{t-1}\left\\|y^{(j)}-x^{(j)}\right\\|_{2}^{2}+\sum_{j=0}^{t-1}\sum_{i=1}^% {n}\\|x_{i}\\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i}-x^{(j)}_{i})^{2}$
	$\displaystyle\leqslant M\cdot 6tL^{2}(C_{T}\cdot t\log\tfrac{1}{\varepsilon})^% {d_{t}-1}+\sum_{j=0}^{t-1}\sum_{i=1}^{n}\\|x_{i}\\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i% }-x^{(j)}_{i})^{2}.$	(4)

The last inequality holds because for each $i$ , $\|y_{i}\|^{2(d_{t}-1)}=\left(\sum_{j=0}^{t-1}(y^{(j)}_{i})^{2}\right)^{d_{t}-1% }\leqslant(C_{T}\cdot t\log\frac{1}{\varepsilon})^{d_{t}-1}$ . Therefore, it remains to handle the last sum.

We claim that

\|x_{i}\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i}-x^{(j)}_{i})^{2}\leqslant\Biggl{[}(C_{% T}\cdot t\log\tfrac{1}{\varepsilon})^{d_{t}-1}(y^{(j)}_{i}-x^{(j)}_{i})^{2}% \Biggr{]}+\Biggl{[}\|x_{i}\|^{2(d_{t}-1)}\cdot(|x_{i}^{(j)}|+2\sqrt{C_{T}\log% \tfrac{1}{\varepsilon}})^{2}\cdot\bm{1}\left[\|x_{i}\|^{2}>C_{T}\cdot t\log% \tfrac{1}{\varepsilon}\right]\Biggr{]}.

Indeed,

•

If $\|x_{i}\|^{2}\leqslant C_{T}\cdot t\log\frac{1}{\varepsilon}$ , then certainly the left side is bounded by the first term.
•

Else, note that $(y^{(j)}_{i}-x^{(j)}_{i})^{2}\leqslant(|x_{i}^{(j)}|+2\sqrt{C_{T}\log\frac{1}{% \varepsilon}})^{2}$ , where the absolute value protects against opposite signs. Therefore, in this latter case we have that the left side is bounded by the second term.

Summing over $i\in[n]$ , it follows that

\sum_{i=1}^{n}\|x_{i}\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i}-x^{(j)}_{i})^{2}% \leqslant M\cdot(C_{T}\cdot t\log\frac{1}{\varepsilon})^{d_{t}-1}+\sum_{i=1}^{% n}\|x_{i}\|^{2(d_{t}-1)}\cdot(|x_{i}^{(j)}|+2\sqrt{C_{T}\log\tfrac{1}{% \varepsilon}})^{2}\cdot\bm{1}\left[\|x_{i}\|^{2}>C_{T}\cdot t\log\tfrac{1}{% \varepsilon}\right]

Applying Corollary 2.3 to this second term with $f(\vec{x}_{i})=\|x_{i}\|^{d_{t}-1}\cdot\left(|x_{i}^{(j)}|+2\sqrt{C_{T}\log% \tfrac{1}{\varepsilon}}\right)$ , $g(\vec{x}_{i})=\|x_{i}\|$ (which is Lipschitz), and $\theta=C_{T}\cdot t\log\frac{1}{\varepsilon}$ , it follows that

\frac{1}{n}\sum_{i=1}^{n}\|x_{i}\|^{2(d_{t}-1)}\cdot(|x_{i}^{(j)}|+2\sqrt{C_{T% }\log\tfrac{1}{\varepsilon}})^{2}\cdot\bm{1}\left[\|x_{i}\|^{2}>C_{T}\cdot t% \log\tfrac{1}{\varepsilon}\right]\leqslant\left(\frac{3(C^{\prime})^{2}r}{C_{T% }\cdot t\log\frac{1}{\varepsilon}}\right)^{r}\leqslant\varepsilon

by taking $r=\log\frac{1}{\varepsilon}$ and having $C_{T}\cdot t>3e(C^{\prime})^{2}$ . Therefore, plugging this all back in to (4), we have that

	$\displaystyle\\|f_{t}(y)-f_{t}(x)\\|_{2}^{2}$	$\displaystyle\leqslant M\cdot 6tL^{2}(C_{T}\cdot t\log\tfrac{1}{\varepsilon})^% {d_{t}-1}+\sum_{j=0}^{t-1}\sum_{i=1}^{n}\\|x_{i}\\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i% }-x^{(j)}_{i})^{2}$
		$\displaystyle\leqslant M\cdot 6tL^{2}(C_{T}\cdot t\log\tfrac{1}{\varepsilon})^% {d_{t}-1}+\sum_{j=0}^{t-1}M\cdot(C_{T}\cdot t\log\frac{1}{\varepsilon})^{d_{t}% -1}+\varepsilon n$
		$\displaystyle\leqslant M\cdot(C_{T}\cdot t)^{d_{t}}\cdot\log^{d_{t}-1}\left(% \tfrac{1}{\varepsilon}\right)+t\cdot\varepsilon n$

as desired, assuming that $C_{T}>6L^{2}$ . ∎

4 AMP is robust to small spectral perturbations

Here we argue that AMP is robust to spectral perturbations.

Lemma 4.1.

Suppose that $X$ has independent entries of mean $0$ , variance $\frac{1}{n}$ , and subgaussian parameter $\frac{O(1)}{\sqrt{n}}$ . Let $\mathcal{F}$ be an AMP algorithm consisting of Lipschitz denoiser functions with Lipschitz constant at most $L$ , and let $v_{\mathrm{AMP}}(X)$ denote the output of the $T$ -step AMP algorithm on input $X$ , and $v_{\mathrm{AMP}}(Y)$ denote the output of the same algorithm on input $Y$ for any $Y$ satisfying $\|Y-X\|_{\operatorname{\mathsf{op}}}\leqslant\varepsilon$ . Then there exists a universal constant $C$ such that with probability $1-o(1)$ over $X$ ,

\frac{1}{n}\|v_{\mathrm{AMP}}(Y)-v_{\mathrm{AMP}}(X)\|_{2}^{2}\leqslant% \varepsilon^{2}\cdot C^{2T+2}\cdot((T+1)!)^{2}.

Since the starting iterate $x^{(0)}=\vec{1}$ and $X$ has entries of variance $\frac{1}{n}$ , the scaling $\frac{1}{n}\|v_{\mathrm{AMP}}(X)\|^{2}\sim 1$ is of the correct order for reasonable denoisers, in which case the above implies that $v_{\mathrm{AMP}}(X),v_{\mathrm{AMP}}(Y)$ are $1-O(\varepsilon^{2})$ -correlated.

Proof.

Let us denote the iterates to be $y^{(t)}$ and $x^{(t)}$ for $Y$ and $X$ , respectively, and let $E=Y-X$ . When $t=0$ , $x^{(0)}=y^{(0)}=\vec{1}$ so the statement trivially holds. Now assuming we have shown this for all $k<t$ , we will now show it $t$ . We will use the shorthand $f_{k}(x)=f_{k}(x^{(k)},\ldots,x^{(0)})$ . We may expand the expression for $y^{(t)}-x^{(t)}$ :

	$\displaystyle\left\\|y^{(t)}-x^{(t)}\right\\|_{2}$	$\displaystyle=\left\\|Yf_{t}(y)-Xf_{t}(x)+\sum_{j=1}^{t}B_{t,j}\left(f_{j}(y)-f% _{j}(x)\right)\right\\|_{2}$
		$\displaystyle=\left\\|Y\big{(}f_{t}(y)-f_{t}(x)\big{)}+(Y-X)f_{t}(x)+\sum_{j=1}% ^{t}B_{t,j}\cdot\left(f_{j}(y)-f_{j}(x)\right)\right\\|_{2}$
		$\displaystyle\leqslant\\|Y\\|_{\operatorname{\mathsf{op}}}\\|f_{t}(y)-f_{t}(x)\\|+% \\|Y-X\\|_{\operatorname{\mathsf{op}}}\\|f_{t}(x)\\|+\sum_{j=1}^{t}\|B_{t,j}\|\cdot% \left\\|f_{j}(y)-f_{j}(x)\right\\|$
And since $\\|Y\\|_{\operatorname{\mathsf{op}}}\leqslant 2\\|X\\|_{\operatorname{\mathsf{op}}% }=O(1)$ with high probability and $\|B_{t,j}\|=O(1)$ from the subgaussianity of $X$ , and $\\|Y-X\\|_{\operatorname{\mathsf{op}}}\leqslant\varepsilon$ by assumption, for a constant $C$ sufficiently large,
		$\displaystyle\leqslant(Ct+C/2)\max_{k\leqslant t}\left\\|f_{k}(y)-f_{k}(x)% \right\\|_{2}+\varepsilon\left\\|f_{t}(x)\right\\|_{2}$

To control the first term, we invoke the Lipschitzness of $f$ ,

\|f_{k}(y)-f_{k}(x)\|_{2}^{2}\leqslant L\sum_{j=1}^{k}\|y^{(j)}-x^{(j)}\|_{2}^% {2}\leqslant Lk\|y^{(k-1)}-x^{(k-1)}\|_{2}^{2}\leqslant(C^{k}(k)!)^{2}\cdot% \varepsilon^{2}n.

For the second term, we have from Corollary 2.3 (applied with $\varepsilon=1$ ) that $\|f_{t}(x)\|_{2}\leqslant\frac{1}{2}C\sqrt{n}$ . Combining these facts, we find that

\|y^{(t)}-x^{(t)}\|_{2}\leqslant(Ct+\tfrac{1}{2}C)(C^{t}t!)\cdot\varepsilon% \sqrt{n}+\tfrac{1}{2}C\varepsilon\sqrt{n}=\varepsilon\sqrt{n}\cdot C^{t+1}(t+1)!

and so

\frac{1}{n}\|y^{(t)}-x^{(t)}\|_{2}^{2}\leqslant\varepsilon^{2}\cdot(C^{t+1}(t+% 1)!)^{2}.\qed

Acknowledgments

We thank Spencer Compton, Sam Hopkins and Andrea Montanari for helpful discussions.

References

[BLM12] Mohsen Bayati, Marc Lelarge, and Andrea Montanari. Universality in polytope phase transitions and iterative algorithms. In Proceedings of the 2012 IEEE International Symposium on Information Theory, ISIT 2012, Cambridge, MA, USA, July 1-6, 2012, pages 1643–1647. IEEE, 2012.
[BM11] Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764–785, 2011.
[BMN20] Raphael Berthier, Andrea Montanari, and Phan-Minh Nguyen. State evolution for approximate message passing with non-separable functions. Information and Inference: A Journal of the IMA, 9(1):33–79, 2020.
[Bol14] Erwin Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington–Kirkpatrick model. Communications in Mathematical Physics, 325(1):333–366, 2014.
[CL20] Wei-Kuo Chen and Wai-Kit Lam. Universality of approximate message passing algorithms. CoRR, abs/2003.10431, 2020.
[CZK14] Francesco Caltagirone, Lenka Zdeborová, and Florent Krzakala. On convergence of approximate message passing. In 2014 IEEE International Symposium on Information Theory, pages 1812–1816. IEEE, 2014.
[DdHS23] Jingqiu Ding, Tommaso d’Orsi, Yiding Hua, and David Steurer. Reaching Kesten-Stigum threshold in the stochastic block model under node corruptions. In The Thirty Sixth Annual Conference on Learning Theory, pages 4044–4071. PMLR, 2023.
[DdNS22] Jingqiu Ding, Tommaso d’Orsi, Rajai Nasser, and David Steurer. Robust recovery for stochastic block models. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 387–394. IEEE, 2022.
[DKK⁺19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
[DM14] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In 2014 IEEE International Symposium on Information Theory, pages 2197–2201. IEEE, 2014.
[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms for compressed sensing. Proceedings of the National Academy of Sciences, 106(45):18914–18919, 2009.
[FVRS22] Oliver Y Feng, Ramji Venkataramanan, Cynthia Rush, and Richard J Samworth. A unifying tutorial on approximate message passing. Foundations and Trends in Machine Learning, 15(4):335–536, 2022.
[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 178–191, 2016.
[IS24] Misha Ivkov and Tselil Schramm. Semidefinite programs simulate approximate message passing robustly. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 348–357, 2024.
[JLST21] Arun Jambulapati, Jerry Li, Tselil Schramm, and Kevin Tian. Robust regression revisited: Acceleration and improved estimation rates. Advances in Neural Information Processing Systems, 34:4475–4488, 2021.
[JP24] Chris Jones and Lucas Pesenti. Diagram analysis of iterative algorithms. CoRR, abs/2404.07881, 2024.
[KMS⁺12] Florent Krzakala, Marc Mézard, Francois Sausset, Yifan Sun, and Lenka Zdeborová. Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08009, 2012.
[Mon12] Andrea Montanari. Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, page 394, 2012.
[Mon21] Andrea Montanari. Optimization of the Sherrington–Kirkpatrick Hamiltonian. SIAM Journal on Computing, (0):FOCS19–1, 2021.
[MR15] Andrea Montanari and Emile Richard. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Transactions on Information Theory, 62(3):1458–1484, 2015.
[MRW24] Sidhanth Mohanty, Prasad Raghavendra, and David X Wu. Robust recovery for stochastic block models, simplified and generalized. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 367–374, 2024.
[RSFS19] Sundeep Rangan, Philip Schniter, Alyson K Fletcher, and Subrata Sarkar. On the convergence of approximate message passing with arbitrary matrices. IEEE Transactions on Information Theory, 65(9):5339–5351, 2019.
[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Appendix A Statistics of AMP iterate entries

The most important theorem for us is known as state evolution, which intuitively states that the statistics of entries iterates of an AMP iteration behave somewhat like statistics of Gaussians. There are two version of state evolution that will be important to us, corresponding to polynomial and Lipschitz iterations.

Theorem A.1 (Polynomial State Evolution (e.g. [BLM12, Theorem 4] or [JP24, Theorem 4.21 and Theorem 5.2])).

Suppose that $x^{0},x^{1},\ldots,x^{T}$ is an AMP iteration corresponding to polynomial denoisers, with input $X$ a symmetric matrix having i.i.d $\frac{O(1)}{\sqrt{n}}$ -subgaussian entries with mean $0$ and variance $\frac{1}{n}$ . Then, for any $\operatorname{PL}(k)$ function $\varphi:\mathbb{R}^{T+1}\rightarrow\mathbb{R}$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}\varphi(x^{T}_{i},x^{T-1}_{i},\ldots,x^{0}_{i})=\operatorname*{% \mathbf{E}}[\varphi(U^{T},U^{T-1},\ldots,U^{0})]

where $U^{0},U^{1},\ldots,U^{T}$ form an appropriate Gaussian process with covariance independent of $n$ .

To prove a similar theorem in the Lipschitz setting, we note from a remark in [CL20] that [BLM12, Proposition 6] can be extended to handle Lipschitz denoiser functions, which implies the same universality as for polynomial denoisers.

Theorem A.2 (Lipschitz State Evolution (e.g. [FVRS22, Theorem 2.3])).

Suppose that $x^{0},x^{1},\ldots,x^{T}$ is an AMP iteration corresponding to Lipschitz denoisers, with input $X$ a symmetric matrix having i.i.d $\frac{O(1)}{\sqrt{n}}$ -subgaussian entries with mean $0$ and variance $\frac{1}{n}$ . Then, for any $\operatorname{PL}(k)$ function $\varphi:\mathbb{R}^{T+1}\rightarrow\mathbb{R}$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}\varphi(x^{T}_{i},x^{T-1}_{i},\ldots,x^{0}_{i})=\operatorname*{% \mathbf{E}}[\varphi(U^{T},U^{T-1},\ldots,U^{0})]

where $U^{0},U^{1},\ldots,U^{T}$ form an appropriate Gaussian process with covariance independent of $n$ .

As noted by [FVRS22, Remark 2.4], state evolution holds for Lipschitz denoisers when we consider $B_{t,j}$ instead of $b_{t,j}$ in the AMP iteration. By adapting the version of this proof presented in [BMN20, Corollary 2], we may also substitute $B_{t,j}$ for $b_{t,j}$ when considering polynomial denoisers.

Consequences of state evolution for Pseudo-Lipschitz functions

We collect some facts about Pseudo-Lipschitz functions, after which we can prove Corollary 2.3, which implies the concentration we need for order statistics of the AMP iterates.

Proposition A.3.

Suppose that $f:\mathbb{R}^{t}\rightarrow\mathbb{R}$ is $\operatorname{PL}(a)$ with Lipschitz constant $L_{1}$ and $g:\mathbb{R}^{t}\rightarrow\mathbb{R}$ is $\operatorname{PL}(b)$ with Lipschitz constant $L_{2}$ . Furthermore, suppose that $f(\vec{0})=g(\vec{0})=0$ . Then, $f\cdot g\in\operatorname{PL}(a+b)$ with Lipschitz constant $12L_{1}L_{2}$ .

Proof.

We may write

|f(x)g(x)-f(y)g(y)|\leqslant\tfrac{1}{2}|f(x)-f(y)|\cdot|g(x)+g(y)|+\tfrac{1}{% 2}|g(x)-g(y)|\cdot|f(x)+f(y)|.

We can see that $|f(x)|\leqslant L_{1}(1+\|x\|^{a-1})\|x\|$ and $|g(x)|\leqslant L_{2}(1+\|x\|^{b-1})\|x\|$ by applying centeredness. Therefore, for the first term we obtain that

	$\displaystyle\|f(x)-f(y)\|\cdot\|g(x)+g(y)\|$	$\displaystyle\leqslant L_{1}(1+\\|x\\|^{a-1}+\\|y\\|^{a-1})\\|x-y\\|\cdot L_{2}(\\|x% \\|+\\|y\\|+\\|x\\|^{b}+\\|y\\|^{b})$
		$\displaystyle=L_{1}L_{2}\\|x-y\\|\left[(1+\\|x\\|^{a-1}+\\|y\\|^{a-1})(\\|x\\|+\\|y\\|+% \\|x\\|^{b}+\\|y\\|^{b})\right]$

It remains to show that the product of the latter two terms is at most $C(1+\|x\|^{a+b-1}+\|y\|^{a+b-1})$ . Consider casing on whether $\max(\|x\|,\|y\|)\leqslant 1$ . If this is at most $1$ , the above product is bounded by $12$ . Else, suppose without loss of generality that $\|x\|\geqslant\|y\|$ and $\|x\|>1$ . Then, we can bound the product by $3\|x\|^{a-1}\cdot 4\|x\|^{b}=12\|x\|^{a+b-1}$ . This implies that

|f(x)-f(y)|\cdot|g(x)+g(y)|\leqslant 12L_{1}L_{2}(1+\|x\|^{a+b-1}+\|y\|^{a+b-1% })\|x-y\|

and symmetrically for $|g(x)-g(y)|\cdot|f(x)+f(y)|$ . Thus, we have shown that $f\cdot g\in\operatorname{PL}(a+b)$ with Lipschitz constant $12L_{1}L_{2}$ .

∎

Finally, we prove Corollary 2.3.

Corollary (Restatement of Corollary 2.3).

•

For any $r\gg\max(t,k)$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[g(\vec{x}_{i})^{2}>\theta]\leqslant% \frac{1}{\theta^{r}}\cdot C^{2r}\cdot(3\ell r)^{\ell r}.

•

For every $I\subseteq[n]$ with $|I|\leqslant\varepsilon n$ ,

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i\in I}f(\vec{x}_{i})^{2}\leqslant C\varepsilon\log^{k}\frac{1}{% \varepsilon}.

Before the proof, note that a priori we have no control over $\bm{1}[g(\vec{x}_{i})^{2}>\theta]$ (this is not Pseudo-Lipschitz at any degree). However, we show that we can approximate it above and below by Pseudo-Lipschitz functions and thus still reason about it.

Claim A.4

There exists a sequence of Lipschitz functions $f_{1}(x;L)$ and $f_{2}(x;L)$ each having Lipschitz constant $L$ such that $f_{1}(x;L)\leqslant\bm{1}[x>\theta]\leqslant f_{2}(x;L)$ and as $L\rightarrow\infty$ these bounding functions converge to $\bm{1}[x^{2}>\theta]$ .

Proof.

Define

f_{1}(x;L)=\begin{cases}0&\qquad x\leqslant\theta\\ L(x-\theta)&\qquad\theta<x\leqslant\theta+\frac{1}{L}\\ 1&\qquad\text{otherwise}\end{cases}\qquad\text{and}\qquad f_{2}(x;L)=\begin{% cases}0&\qquad x\leqslant\theta-\frac{1}{L}\\ L\left(x-\theta\right)+1&\qquad\theta-\frac{1}{L}<x\leqslant\theta\\ 1&\qquad\text{otherwise}\end{cases}

which by definition satisfy the given constraint. ∎

This implies that state evolution holds for indicators (and we can treat them as if they are another Lipschitz function).

Proof of Corollary 2.3.

Let’s begin with the first bullet point. By state evolution, we have that

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[g(\vec{x}_{i})^{2}>\theta]=% \operatorname*{\mathbf{E}}_{U\sim N(0,\Sigma)}\left[f(\vec{U})^{2}\bm{1}[g(% \vec{U})^{2}>\theta]\right],

where $\Sigma$ is the covariance matrix of $\vec{U}$ . Note that for any $r\geqslant 1$ , $\bm{1}[x^{2}>\theta]\leqslant\frac{x^{2r}}{\theta^{r}}$ (by case analysis on whether $x^{2}>\theta$ ).

Therefore, we find that

\operatorname*{\mathbf{E}}_{U\sim N(0,\Sigma)}\left[f(\vec{U})^{2}g(\vec{U})^{% 2}\bm{1}[g(\vec{U})^{2}>\theta]\right]\leqslant\frac{1}{\theta^{r}}% \operatorname*{\mathbf{E}}_{U}\left[(f(\vec{U})g(\vec{U})^{r})^{2}\right].

By Proposition A.3, we have that $f(\vec{U})g(\vec{U})^{r}\in\operatorname{PL}(a+br)$ .

Define $h(\vec{x})=f(\Sigma^{1/2}x)\cdot g(\Sigma^{1/2}x)^{r}$ , which is also $\operatorname{PL}(a+br)$ and centered with Lipschitz constant at most $(12L\|\Sigma\|_{\operatorname{\mathsf{op}}})^{r}$ . In particular, we thus have that $h(\vec{x})\leqslant(12L\|\Sigma\|_{\operatorname{\mathsf{op}}})^{r}\cdot(\|x\|% +\|x\|^{a+br})$ and

h(\vec{x})^{2}\leqslant 2(12L\|\Sigma\|_{\operatorname{\mathsf{op}}})^{2r}(\|x% \|^{2}+\|x\|^{2(a+br)}).

Now, we may compute that

	$\displaystyle\operatorname*{\mathbf{E}}_{g\sim N(0,I)}[h(\vec{g})^{2r}]$	$\displaystyle\leqslant 2(12L\\|\Sigma\\|_{\operatorname{\mathsf{op}}})^{2r}\cdot% \left(1+\sum_{k_{0}+k_{1}+\cdots+k_{t}=a+br}\,\,\,\prod_{i=0}^{t}\operatorname% *{\mathbf{E}}\left[x_{i}^{2k_{i}}\right]\right)$
		$\displaystyle\leqslant 2(12L\\|\Sigma\\|_{\operatorname{\mathsf{op}}})^{2r}\left% (1+\binom{t+a+br}{t}(2(a+br))^{a+br}\right)$
		$\displaystyle\leqslant 2(12L\\|\Sigma\\|_{\operatorname{\mathsf{op}}})^{2r}\cdot% (2(a+br))^{a+br+t}$
		$\displaystyle\leqslant(24L\\|\Sigma\\|_{\operatorname{\mathsf{op}}})^{2r}\cdot(3% br)^{br}$

where in the last two steps we used that $r\gg\max(a,t)$ .

Now, we may use this to prove the second bullet point. We can assume without loss of generality that $f(\vec{0})=0$ : otherwise, note that $f(x)^{2}\leqslant 2(f(x)-f(0))^{2}+2f(0)^{2}$ so we can with a factor of $2$ loss consider centering $f$ . Furthermore, note that we only need to consider the top $\varepsilon$ quantile of indices $i$ to prove the statement of the lemma. Now, for any $\theta>1$ , consider writing

	$\displaystyle\sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[i\text{ in }\varepsilon% \text{ quantile}]$	$\displaystyle=\sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[i\text{ in }\varepsilon% \text{ quantile}]\bm{1}[f(\vec{x}_{i})^{2}\leqslant\theta]+\sum_{i=1}^{n}f(% \vec{x}_{i})^{2}\bm{1}[i\text{ in }\varepsilon\text{ quantile}]\bm{1}[f(\vec{x% }_{i})^{2}>\theta]$
		$\displaystyle\leqslant\varepsilon\theta\cdot n+\sum_{i=1}^{n}f(\vec{x}_{i})^{2% }\bm{1}[f(\vec{x}_{i})^{2}>\theta].$

Therefore, dividing both sides by $n$ and taking the $\operatorname{\operatornamewithlimits{p-lim}}$ implies that

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[i\text{ in }\varepsilon\text{ quantile}% ]\leqslant\varepsilon\theta+\operatorname{\operatornamewithlimits{p-lim}}_{n% \rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[f(\vec{x}_% {i})^{2}>\theta].

Our goal is to show that by choosing $\theta=\Theta(\log^{k}\frac{1}{\varepsilon})$ , the latter expectation is $O(\varepsilon\log^{k}\frac{1}{\varepsilon})$ which would complete the proof. By the above result, we have that

\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty}\frac{1}{n}% \sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[f(\vec{x}_{i})^{2}>\theta]\leqslant% \frac{1}{\theta^{r}}\cdot C^{2r}\cdot(3kr)^{kr}

(since $f\in\operatorname{PL}(k)$ ). Take $r=\log\frac{1}{\varepsilon}$ and $\theta=3ekC^{2}\cdot r^{k}=\Theta_{\varepsilon}(\log^{k}\frac{1}{\varepsilon})$ . From here, we find that

	$\displaystyle\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty% }\frac{1}{n}\sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[f(\vec{x}_{i})^{2}>\theta]$	$\displaystyle\leqslant\frac{1}{\theta^{r}}\cdot C^{2r}\cdot(3kr)^{kr}$
		$\displaystyle=\left(\frac{3kC^{2}\cdot r^{k}}{3ekC^{2}\cdot r^{k}}\right)^{r}$
		$\displaystyle\leqslant\varepsilon.$

Thus, we have that

\displaystyle\operatorname{\operatornamewithlimits{p-lim}}_{n\rightarrow\infty% }\frac{1}{n}\sum_{i=1}^{n}f(\vec{x}_{i})^{2}\bm{1}[i\text{ in }\varepsilon% \text{ quantile}]

\displaystyle\leqslant\varepsilon\theta+\operatorname*{\mathbf{E}}\left[f(\vec% {U})^{2}\bm{1}[f(U)^{2}>\theta]\right]\leqslant 3ekC^{2}\cdot\varepsilon\log^{% k}\frac{1}{\varepsilon}

which is exactly as desired. ∎

$\displaystyle\left\\|y^{(t)}-x^{(t)}\right\\|$	$\displaystyle=\left\\|\hat{Y}f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-Xf(x^{(% t-1)})\right\\|$
	$\displaystyle\leqslant\left\\|\hat{Y}(f(\operatorname{\mathsf{clip}}(y^{(t-1)})% )-f(x^{(t-1)}))\right\\|+\left\\|(\hat{Y}-X)f(x^{(t-1)})\right\\|$
	$\displaystyle\leqslant\left\\|\hat{Y}\right\\|_{\operatorname{\mathsf{op}}}\left% \\|f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-f(x^{(t-1)})\right\\|+\left\\|(\hat% {Y}-X)f(x^{(t-1)})\right\\|$	(1)

	$\displaystyle\left\\|f(\operatorname{\mathsf{clip}}(y^{(t-1)}))-f(x^{(t-1)})\right\\|$	$\displaystyle=\left\\|(\operatorname{\mathsf{clip}}(y^{(t-1)})+x^{(t-1)})\circ(% \operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)})\right\\|$
		$\displaystyle\leqslant\left\\|\operatorname{\mathsf{clip}}(y^{(t-1)})\right\\|_{% \infty}\cdot\left\\|\operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)}\right\\|+% \left\\|x^{(t-1)}\circ(\operatorname{\mathsf{clip}}(y^{(t-1)})-x^{(t-1)})\right\\|$		(2)

	$\displaystyle\\|\widetilde{y}^{(t)}-x^{(t)}\\|_{2}$	$\displaystyle=\left\\|\hat{Y}f_{t}(y)-Xf_{t}(x)+\sum_{j=1}^{t}B_{t,j}\left(f_{j% -1}(y)-f_{j-1}(x)\right)\right\\|_{2}$
		$\displaystyle\leqslant\left\\|\hat{Y}(f_{t}(y)-f_{t}(x)+f_{t}(x))-Xf_{t}(x)% \right\\|_{2}+\sum_{j=1}^{t}\|B_{t,j}\|\left\\|f_{j-1}(y)-f_{j-1}(x)\right\\|_{2}$
		$\displaystyle\leqslant\left\\|\hat{Y}(f_{t}(y)-f_{t}(x))\right\\|_{2}+\left\\|(% \hat{Y}-X)f_{t}(x)\right\\|_{2}+\mathsf{MAX}\cdot\sum_{j=1}^{t}\|B_{t,j}\|$
		$\displaystyle\leqslant\left\\|\hat{Y}\right\\|_{\operatorname{\mathsf{op}}}\left% \\|(f_{t}(y)-f_{t}(x))\right\\|_{2}+\left\\|(\hat{Y}-X)f_{t}(x)\right\\|_{2}+% \mathsf{MAX}\cdot\sum_{j=1}^{t}\|B_{t,j}\|$
		$\displaystyle\leqslant\mathsf{MAX}\cdot\left(10+\sum_{j=1}^{t}\|B_{t,j}\|\right)% +\\|Ef_{t}(x)\\|_{2}+\\|F_{1}f_{t}(x)\\|_{2}+\\|F_{2}f_{t}(x)\\|_{2}$
		$\displaystyle\leqslant C\cdot\sqrt{M(C_{T}t)^{d_{t}}\cdot\log^{d_{t}-1}\left(% \tfrac{1}{\varepsilon}\right)+t\cdot\varepsilon n}+3\sqrt{C\varepsilon n\cdot% \log^{\overline{d}_{t}}\left(\tfrac{1}{\varepsilon}\right)}$

	$\displaystyle\\|y^{(t)}-x^{(t)}\\|_{2}=\\|\operatorname{\mathsf{clip}}^{% \varepsilon}(\tilde{y}^{(t)})-x^{(t)}\\|_{2}$	$\displaystyle\leqslant\\|\operatorname{\mathsf{clip}}^{\varepsilon}(\tilde{y}^{% (t)})-\operatorname{\mathsf{clip}}^{\varepsilon}(x^{(t)})\\|_{2}+\\|% \operatorname{\mathsf{clip}}^{\varepsilon}(x^{(t)})-x^{(t)}\\|_{2}$
		$\displaystyle\leqslant\\|\tilde{y}^{(t)}-x^{(t)}\\|_{2}+\\|\operatorname{\mathsf{% clip}}^{\varepsilon}(x^{(t)})-x^{(t)}\\|_{2}$
		$\displaystyle=\\|\tilde{y}^{(t)}-x^{(t)}\\|_{2}+\left(\sum_{i=1}^{n}(x^{(t)}_{i}% )^{2}\cdot\bm{1}\left[(x^{(t)}_{i})^{2}>C_{T}\log\frac{1}{\varepsilon}\right]% \right)^{1/2}$

$\displaystyle\\|f_{t}(y)-f_{t}(x)\\|_{2}^{2}$	$\displaystyle=\sum_{i=1}^{n}(f_{t}(y_{i})-f_{t}(x_{i}))^{2}$
	$\displaystyle\leqslant L^{2}\sum_{i=1}^{n}(1+\\|y_{i}\\|^{d_{t}-1}+\\|x_{i}\\|^{d_% {t}-1})^{2}\cdot\\|y_{i}-x_{i}\\|^{2}$
	$\displaystyle\leqslant 3L^{2}\sum_{i=1}^{n}\left(1+\\|y_{i}\\|^{2(d_{t}-1)}+\\|x_% {i}\\|^{2(d_{t}-1)}\right)\cdot\sum_{j=0}^{t-1}(y^{(j)}_{i}-x^{(j)}_{i})^{2}$
	$\displaystyle\leqslant 3L^{2}\left(1+\max_{i}\\|y_{i}\\|^{2(d_{t}-1)}\right)\sum% _{j=0}^{t-1}\left\\|y^{(j)}-x^{(j)}\right\\|_{2}^{2}+\sum_{j=0}^{t-1}\sum_{i=1}^% {n}\\|x_{i}\\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i}-x^{(j)}_{i})^{2}$
	$\displaystyle\leqslant M\cdot 6tL^{2}(C_{T}\cdot t\log\tfrac{1}{\varepsilon})^% {d_{t}-1}+\sum_{j=0}^{t-1}\sum_{i=1}^{n}\\|x_{i}\\|^{2(d_{t}-1)}\cdot(y^{(j)}_{i% }-x^{(j)}_{i})^{2}.$	(4)

Fast, robust approximate message passing

Abstract

1 Introduction

1.1 Setup and definitions

Definition 1.1 (AMP algorithm).

Example 1.2 (non-negative PCA).

Definition 1.3 (ε𝜀\varepsilonitalic_ε-principal minor corruption).

1.2 Results

Theorem 1.4 (Informal version of Theorem 3.1).

Corollary 1.5 (Fast, robust Sherrington Kirkpatrick).

1.3 Experiments

1.4 Discussion

1.5 Technical overview

2 AMP preliminaries

Definition 2.1 (Onsager correction).

Definition 2.2 (Pseudo-Lipschitz Functions).

Corollary 2.3.

3 Making AMP robust to principal minor corruptions

Theorem 3.1 (Main Theorem).

Definition 3.2.

Definition 3.3 (Matrix restriction).

Algorithm 3.4 (Robust AMP)

Lemma 3.5 (Efficient spectral cleaning).

Lemma 3.6 (Success of AMP on restrictions).

3.1 Spectral cleaning

Algorithm 3.7 (Spectral cleaning of principal minor corruptions)

Proof of Lemma 3.5.

Claim 3.8

Proof of Claim.

3.2 Analysis of clipped AMP on spectrally cleaned input

Proposition 3.9.

Proof.

Lemma (Restatement of Lemma 3.6).

Proposition 3.10 (Clipping preserves error).

Proposition 3.11 (Block-sparse corruptions have small error).

Proposition 3.12 (Pseudo-Lipschitz functions preserve closeness of AMP iterates).

Proof of Lemma 3.6.

Proof of Proposition 3.10.

Proof of Proposition 3.11.

Proof of Proposition 3.12.

4 AMP is robust to small spectral perturbations

Lemma 4.1.

Proof.

Acknowledgments

References

Appendix A Statistics of AMP iterate entries

Theorem A.1 (Polynomial State Evolution (e.g. [BLM12, Theorem 4] or [JP24, Theorem 4.21 and Theorem 5.2])).

Theorem A.2 (Lipschitz State Evolution (e.g. [FVRS22, Theorem 2.3])).

Consequences of state evolution for Pseudo-Lipschitz functions

Proposition A.3.

Proof.

Corollary (Restatement of Corollary 2.3).

Claim A.4

Proof.

Proof of Corollary 2.3.

Definition 1.3 ( $\varepsilon$ -principal minor corruption).