Ziv–Merhav Estimation for Hidden-Markov Processes

N. Barnfield¹¹1McGill University, Department of Mathematics and Statistics, Montréal QC, Canada , R. Grondin^*, G. Pozzoli²²2CY Cergy Paris Université, Department of Mathematics, Cergy-Pontoise, France and R. Raquépas³³3New York University, Courant Institute of Mathematical Sciences, New York NY, United States

Abstract

We present a proof of strong consistency of a Ziv–Merhav-type estimator of the cross entropy rate for pairs of hidden-Markov processes. Our proof strategy has two novel aspects: the focus on decoupling properties of the laws and the use of tools from the thermodynamic formalism.

Presented at the IEEE International Symposium on Information Theory 2024, in Athens, Greece.

1 Introduction

We are interested in Ziv–Merhav-type estimators [ZM93] of the cross entropy rate for pairs of sources — which is a sum of a Kullback–Leibler (KL) divergence rate and an entropy rate. While such estimators are widely used in practice (see e.g. [KPK01, BCL02, CF05, B+08, CFF10, RP12, L+19, R+22]), theoretical works on the subject are scarce. The goal of this proceedings paper is to concretely present the consequences of recent findings [BCJP21, C+23, CR24, BGPR24, BGPR] for hidden-Markov processes in an accessible way.

Throughout, $\mathcal{A}$ is a finite alphabet and $\mathcal{A}^{\mathbb{N}}$ is the space of one-sided $\mathcal{A}$ -valued sequences, denoted by bold lower-case letters, e.g. $\mathbf{x}=(x_{n})_{n=1}^{\infty}$ . We use $x_{k}^{\ell}$ for the string (i.e. finite concatenation of elements of $\mathcal{A}$ ) of length $\ell-k+1$ starting at index $k$ in such a sequence $\bf{x}$ . We also use $[x_{1}^{l}]:=\{\mathbf{z}\in\mathcal{A}^{\mathbb{N}}:z_{1}^{l}=x_{1}^{l}\}$ . We use bold upper-case letters for $\mathcal{A}$ -valued processes, e.g. $\mathbf{X}=(X_{n})_{n=1}^{\infty}$ . We will only consider processes that are stationary, i.e. processes whose law (a measure on $\mathcal{A}^{\mathbb{N}}$ ) is invariant under the shift $(x_{n})_{n=1}^{\infty}\mapsto(x_{n+1})_{n=1}^{\infty}$ .

1.1 Entropies

The Shannon entropy rate of a stationary process $\mathbf{X}$ — or equivalently of its law $\mathbb{P}_{\mathbf{X}}$ — is the limit

h(\mathbb{P}_{\mathbf{X}})=\lim_{n\to\infty}\frac{\sum_{a\in\mathcal{A}^{n}}% \mathbb{P}_{\mathbf{X}}[a](-\ln\mathbb{P}_{\mathbf{X}}[a])}{n}.

There are at least two important related quantities for pairs of stationary processes: the cross entropy rate and the KL divergence rate (a.k.a. relative entropy rate):

h^{\textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})=\lim_{n\to% \infty}\frac{\sum_{a\in\mathcal{A}^{n}}\mathbb{P}_{\mathbf{Y}}[a](-\ln\mathbb{% P}_{\mathbf{X}}[a])}{n}

whenever the limit exists in $[0,\infty]$ , and then $d^{\textnormal{KL}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})=h^{% \textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})-h(\mathbb{P}_% {\mathbf{Y}}).$ As we shall see in Section 2.1, this will be the case for the pairs of measures we are interested in, but it should be noted that this is not a general fact about pairs of ergodic processes.⁴⁴4 Let us emphasize that the KL divergence of $\mathbb{P}_{\mathbf{Y}}$ with respect to $\mathbb{P}_{\mathbf{X}}$ itself — not the rate — is well defined but too coarse to be useful in many tasks. Indeed, if both measures are ergodic, then this quantity is either zero or infinite.

The KL divergence rate plays a fundamental role in many tasks in information theory and statistics: binary hypothesis testing, universal classification, etc. Because the KL divergence rate differs from the cross entropy rate by an entropy rate which can be universally estimated following the Lempel–Ziv parsing algorithm [ZL78], we will focus on estimation of the cross entropy rate.

1.2 Ziv–Merhav-type estimators

Inspired by the dictionary-based approach of the Lempel–Ziv algorithm for estimating the entropy rate, researchers have sought to develop similar universal estimators that generalize to multiple sources to measure the cross entropy rate. Introduced in [ZM93], the Ziv–Merhav (ZM) estimator is one such cross entropy rate estimator, whose consistency was established in the case where $\mathbf{X}$ and $\mathbf{Y}$ are stationary multi-level Markov processes; we will denote it by $Q_{N}^{\text{ZM}}$ . We consider a modification $Q_{N}$ of $Q_{N}^{\text{ZM}}$ , introduced recently in [BGPR], which we refer to as the modified Ziv–Merhav (mZM) estimator.

Definition 1.

For realizations $(\mathbf{y},\mathbf{x})$ of $(\mathbf{Y},\mathbf{X})$ , the mZM parsing of $y_{1}^{N}$ with respect to $x_{1}^{N}$ begins by determining the shortest prefix $\overline{w}^{(1,N)}$ of $y_{1}^{N}$ that does not appear in $x_{1}^{N}$ . Then, $\overline{w}^{(2,N)}$ is the shortest prefix of the unparsed part of $y_{1}^{N}$ that does not appear in $x_{1}^{N}$ , and so on until we reach the end of $y_{1}^{N}$ . The parsing length $c_{N}(\mathbf{y},\mathbf{x})$ is the number of words in the parsing

y_{1}^{N}=\overline{w}^{(1,N)}\overline{w}^{(2,N)}\dotsc\overline{w}^{(c_{N},N% )};

see Algorithm 1. The mZM estimator is

Q_{N}(\mathbf{y},\mathbf{x}):=\frac{c_{N}(\mathbf{y},\mathbf{x})\ln N}{N-c_{N}% (\mathbf{y},\mathbf{x})}.

In essence, the original ZM algorithm parses ${y}_{1}^{N}$ according to the longest words found in ${x}_{1}^{N}$ whereas the mZM algorithm parses according to the shortest words not found: the difference between the ZM and mZM parsing lengths boils down to the choice of imposing or not the condition “if $i=j$ ” for executing Line 7 in Algorithm 1. This difference is compensated by the choice of not subtracting or subtracting $c_{N}$ at the denominator in the definition of the estimator. As seen in Section 4, these subtleties yield no observable difference in performance.

Algorithm 1 Computation of the mZM parsing length

x_{1}^{N},y_{1}^{N}\in\mathcal{A}^{N}

c_{N}

c_{N}\leftarrow 1

j\leftarrow 1

i\leftarrow 1

2: while

j<N

3: if

y_{i}^{j}

is in

x_{1}^{N}

then

j\leftarrow j+1

5: else

c_{N}\leftarrow c_{N}+1

j\leftarrow j+1

i\leftarrow j

9: end if

10: end while

In a way, ZM-type estimators make repeated use of the notion of longest-match length

\Lambda_{N}(\mathbf{z},\mathbf{x})=\sup\{l:z_{1}^{l}=x_{k}^{k+l-1}\text{ for % some }k\leq N-l+1\}

for which, under suitable assumptions on $\mathbf{Y}$ and $\mathbf{X}$ , we have

\lim_{N\to\infty}\frac{\ln N}{\Lambda_{N}(\mathbf{Y},\mathbf{X})}=h^{% \textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})

almost surely; see e.g. [WZ89, OW93, Ko98, C+23]. However, these repeated uses of longest-match lengths are not independent, making it delicate to try to deduce convergence of ZM-type estimators from convergence of longest-match length estimators.

1.3 Hidden-Markov processes

An $\mathcal{A}$ -valued process is called a hidden-Markov process (HMP) if all the marginals of its law can be represented as

\mathbb{P}[x_{1}^{n}]=\sum_{s\in\mathcal{S}^{n}}\pi_{s_{1}}R_{s_{1},x_{1}}P_{s% _{1},s_{2}}R_{s_{2},x_{2}}\dotsb P_{s_{n-1},s_{n}}R_{s_{n},x_{n}}

for some fixed Markov chain $(\pi,P)$ on some state space $\mathcal{S}$ — this is the “hidden” chain — and some fixed $(\#\mathcal{S})$ -by- $(\#\mathcal{A})$ stochastic matrix $R$ . Such processes are known under different names, including probabilistic functions of Markov processes, and have gained immense popularity in statistical applications since the seminal papers [BP66, Pe69]; also see the review [EM02].

Throughout this paper, we will only consider $\mathcal{A}$ -valued HMPs that can be represented with the following constraints:

i.

the hidden state space is finite;
ii.

the hidden chain is stationary and irreducible;
iii.

for each $s_{1}\in\mathcal{S}$ , there exists $n$ such that $\sum_{t\in\mathcal{S}^{n-1}}\pi_{s_{1}}R_{s_{1},x_{1}}P_{s_{1},t_{1}}R_{t_{1},% x_{2}}\dotsb P_{t_{n-2},t_{n-1}}R_{t_{n-1},x_{n}}$ is positive for more than one string $x$ of length $n$ .

Condition ii implies ergodicity and Condition iii is essentially the minimal condition for the process not to be eventually deterministic. As far as the class of measures on $\mathcal{A}^{\mathbb{N}}$ that can be obtained is concerned, it turns out that there is no loss of generality in considering only deterministic functions of (possibly larger) hidden chains. The Shannon entropy rate of such HMPs has been the subject of many works; see e.g. [Bl57, Bi63, JSS04, ZDKA06, HM06].

1.4 Main result

Our main result is the following result on strong consistency of the mZM estimator of the cross entropy rate from Definition 1. To our knowledge, no analogue of this theorem is available for the ZM estimator.

Theorem 2.

Suppose that $\mathbf{X}$ and $\mathbf{Y}$ are independent, HMPs with respective laws $\mathbb{P}_{\mathbf{X}}$ and $\mathbb{P}_{\mathbf{Y}}$ . If they both satisfy Conditions i–iii, then

\lim_{N\to\infty}Q_{N}(\mathbf{Y},\mathbf{X})=h^{\textnormal{c}}(\mathbb{P}_{% \mathbf{Y}}|\mathbb{P}_{\mathbf{X}})

almost surely.

Note that Theorem 2 provides an estimation of the cross entropy rate between the HMPs $\mathbf{X}$ and $\mathbf{Y}$ , not the underlying (hidden) Markov processes. In Section 2, we discuss some important preliminaries for our presentation of the proof in Section 3. The result is illustrated using numerical experiments in Section 4.

2 Decoupling properties of hidden-Markov processes

In this section, we identify the key properties of HMPs that we will use for the proof of Theorem 2. While these properties are natural from the point of view of the statistical mechanics of lattice gases, we suspect that they might not be familiar to the information theory community.

2.1 Decoupling inequalities and their first consequence

It is shown in [BCJP21] that if $\mathbb{P}$ is the law of a HMP satisfying i–ii, then there exist two natural numbers $k$ and $\tau$ with the following two properties:

•

for all strings $a$ and $b$ ,

{\mathbb{P}[ab]}\leq\mathrm{e}^{k}{\mathbb{P}[a]\mathbb{P}[b]};

(1)

•

for all strings $a$ and $b$ , there exists a string $\xi$ with length $|\xi|\leq\tau$ such that

{\mathbb{P}[a\xi b]}\geq\mathrm{e}^{-k}{\mathbb{P}[a]\mathbb{P}[b]}.

(2)

These properties and generalizations thereof⁵⁵5A particularly useful generalization consists in allowing $\tau$ and $k$ to grow sublinearly with the length of the string $a$ . However, we will not consider this generalization as it is unnecessary for HMPs and complicates some arguments. Still, as stated here, the inequalities (1)–(2) are weaker than the so-called “quasi-Bernoulli” property and imply no form of mixing. are known as decoupling properties and have been instrumental in recent progress in large-deviation theory and its connection to statistical mechanics and information theory; see [Pf02, CJPS19, CR24]. A particularly convenient tool in their derivation in the context of HMPs is the so-called “positive-matrix product” representation; see e.g. [BCJP21]. The following is a straightforward consequence of Kingman’s subadditive ergodic theorem [Ki68].

Lemma 3.

If $\mathbb{P}_{\mathbf{X}}$ satisfies (1) and $\mathbb{P}_{\mathbf{Y}}$ is ergodic, then the limit $h^{\textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})$ exists and

\lim_{n\to\infty}\frac{-\ln\mathbb{P}_{\mathbf{X}}[Y_{1}^{n}]}{n}=h^{% \textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})

(3)

almost surely.

It should be noted that the role of (1) for the study of the KL divergence rate for HMPs dates back at least to [Le92].

2.2 The thermodynamic formalism

In our proof, a key role is played by the following pressure:

\bar{q}(\alpha):=\limsup_{\ell\to\infty}\frac{q_{\ell}(\alpha)}{\ell},

(4)

where $\alpha$ is a real variable and

q_{\ell}(\alpha):=\ln\sum_{a\in\mathcal{A}^{\ell}}\mathrm{e}^{-\alpha\ln% \mathbb{P}_{\mathbf{X}}[a]}\mathbb{P}_{\mathbf{Y}}[a].

(5)

The function $\bar{q}$ is a (possibly improper) monotone, convex function satisfying $\bar{q}(0)=0$ and $\bar{q}(-1)<0$ , the latter being a consequence of Condition iii. While it is not a pressure in the sense of the usual (additive) thermodynamic formalism, the bound (1) provides the requirement for the subadditive thermodynamic formalism of e.g. [CFH08]. As such, the pressure $\bar{q}$ still satisfies the variational principle

\bar{q}(\alpha)=\sup_{\mathbb{Q}}\left[\alpha\int f_{\mathbb{P}_{\mathbf{X}}}% \mathop{\mathrm{\mathstrut d}}\nolimits\!\mathbb{Q}-d^{\textnormal{KL}}(% \mathbb{Q}|\mathbb{P}_{\mathbf{Y}})\right]

(6)

for all $\alpha\leq 0$ , where the supremum is taken over all shift-invariant laws $\mathbb{Q}$ and

f_{\mathbb{P}_{\mathbf{X}}}(\mathbf{z}):=\limsup_{n\to\infty}\frac{-\ln\mathbb% {P}_{\mathbf{X}}[z_{1}^{n}]}{n}.

The role of such variational principles in the study of entropic quantities — which can be traced back to ideas of Gibbs [Gib] — has become ubiquitous in the mathematical literature on statistical mechanics, dynamical systems, and large deviations building on seminal works of Ruelle, Bowen and Walters, e.g. [Ru73, Bo74, Wa75]. While we have no guarantee that $\bar{q}$ is differentiable at the origin, the variational principle (6) allows us to identify the left derivative; see Figure 1.

Lemma 4.

If $\mathbb{P}_{\mathbf{X}}$ and $\mathbb{P}_{\mathbf{Y}}$ satisfy (1) and are ergodic, then

D_{-}\bar{q}(0)=h^{\textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf% {X}}).

Refer to caption — Figure 1: The pressure may or may not be differentiable, but its left derivative at the origin can still be identified as the cross entropy rate under quite general assumptions. As a visual aid to the convexity argument for Lemma 10, the gray dashed line (figuratively) has slope $D_{-}\bar{q}(0)-\tfrac{\epsilon}{4}$ .

The proof of this nontrivial fact relies on a uniqueness argument for the law maximizing (6) in $\alpha=0$ that can be traced back to ideas of Ruelle, and which is explained e.g. in [BJPP18]; also see [Sim]. For actual Markov chains (i.e. when $R$ is a permutation matrix), the left derivative does coincide with the right derivative as a consequence of the Perron–Frobenius theorem and standard perturbation theory arguments.

2.3 Waiting-time estimate

The number of words in ZM-type parsings is intimately related to the so-called Wyner–Ziv problem on waiting times introduced in [WZ89] and notably studied in [Sh93, MS95, Ko98]. The key quantity in this problem is the time

W(a,\mathbf{x})=\inf\{r:x_{r}^{r+|a|-1}=a\}

we need to wait to see the string $a$ appear in a sequence $\mathbf{x}$ . Waiting times are dual to the longest-match lengths from Section 1.2 [WZ89], but typically allow for more direct bounds. It was recently shown in [C+23] that, for every string $a$ and natural number $r$ ,

\operatorname{Prob}\{W(a,\mathbf{X})>r\}\leq(1-\mathrm{e}^{-k}\mathbb{P}_{% \mathbf{X}}[a])^{\left\lfloor\frac{r-1}{|a|+\tau}\right\rfloor},

(7)

provided that the law $\mathbb{P}_{\mathbf{X}}$ of $\mathbf{X}$ under the underlying measure $\operatorname{Prob}$ satisfies (2). This progress was made by revisiting the ideas of [Ko98] from the perspective of decoupling properties. In the recent works [AACG23, CR24], a detailed analysis of the pressure $\bar{q}$ plays a crucial role in the study of large deviations of waiting times.

3 Proof strategy

We are now in a position to present the proof strategy for Theorem 2. Still on $\mathcal{A}$ , the same exact strategy applies to irreducible Kusuoka processes [JOP17, BCJP21] and to $\psi$ -mixing processes that satisfy $\psi^{*}(0)<\infty$ in the notation of [Br05], as soon as one can check the nondegeneracy condition $\bar{q}(-1)<0$ . Further generalizations leveraging relaxations of the decoupling conditions (1)–(2) are discussed in our more technical paper [BGPR].

Throughout this section, the processes $\mathbf{X}$ and $\mathbf{Y}$ are fixed processes with hidden-Markov representations satisfying Conditions i–iii from Section 1.3 and have respective laws $\mathbb{P}_{\mathbf{X}}$ and $\mathbb{P}_{\mathbf{Y}}$ . We will consider them as defined on a common underlying probability space in such a way that they are independent and use “ $\operatorname{Prob}$ ” for the probability measure on this space. For each $N$ , the number $c_{N}$ of words, and the words $\overline{w}^{(1,N)}$ , $\overline{w}^{(2,N)}$ , $\dotsc$ , $\overline{w}^{(c_{N},N)}$ are functions of $X_{1}^{N}$ and $Y_{1}^{N}$ and can therefore be defined through composition as random variables on that same underlying probability space.

Definition 5.

We use $\underline{w}^{(i,N)}$ for the (possibly empty) string obtained by taking $\overline{w}^{(i,N)}$ in Definition 1 without its last letter.

We use “with high probability” to describe sequences $(E_{N})_{N=1}^{\infty}$ of events with the property that $\operatorname{Prob}(E_{N})>1-O(N^{-2})$ as $N\to\infty$ . In particular, the Borel–Cantelli (BC) lemma can be applied to the complement of finitely many such events to deduce almost sure statements.

First, one can show that if there is a string $a$ such that $\mathbb{P}_{\mathbf{Y}}[a]>0$ but $\mathbb{P}_{\mathbf{X}}[a]=0$ , then ergodicity implies that the theorem holds true in the sense that both sides must be infinite. Hence, we will assume from now on that this does not happen. For simplicity, we will present the proof in the case $D_{-}\bar{q}(0)<\infty$ . In the complementary infinite case, some of the bounds become vacuously true and the others are proved superficially adapting the same sequence of arguments as in the finite case.

Lemma 6.

There exists a constant $\kappa>0$ such that

\displaystyle\max_{i=1,\dotsc,c_{N}}|\overline{w}^{(i,N)}|<\kappa\ln N

(8)

with high probability, and for every $\lambda>0$ ,

\displaystyle\lambda

\displaystyle\leq\min_{i=1,\dotsc,c_{N}-1}|\underline{w}^{(i,N)}|

(9)

— and thus $c_{N}\leq\lambda^{-1}N$ — with high probability.

Proof sketch.

For every $\ell$ , by stationarity and a union bound (over the locations at which $Y_{1}^{\ell}$ can appear in $\mathbf{X}$ ), we have

	$\displaystyle\operatorname{Prob}\left\{W(Y_{1}^{\ell},\mathbf{X})<N\right\}$	$\displaystyle\leq\sum_{a\in\mathcal{A}^{\ell}}\mathbb{P}_{\mathbf{Y}}[a]\cdot N% \mathbb{P}_{\mathbf{X}}[a]$
		$\displaystyle=N\mathrm{e}^{q_{\ell}(-1)},$		(10)

where $q_{\ell}$ is defined in (5) and $q_{\ell}(-1)=\ell\bar{q}(-1)+o(\ell)$ by (4). Also recall from Section 2.2 that $\bar{q}(-1)<0$ as a consequence of Condition iii. In particular, taking $\ell=\kappa\ln N$ with $\kappa$ large enough, the right-hand side of (10) can be made to decay as fast as any inverse power of $N$ . By stationarity, the same is true with $Y_{1}^{\ell}$ replaced with $Y_{m}^{m+\ell}$ . Hence, by a union bound (over $m$ ), we find that, with high probability, no substring of $Y_{1}^{N}$ of length $\kappa\ln N$ appears in $X_{1}^{N}$ .

On the other hand, the bound (7) gives an exponentially decaying bound on the probability that some $a$ does not appear in $X_{1}^{N}$ . But that bound can be made uniform in the choice of $a$ of fixed length $\lambda$ with $\mathbb{P}_{\mathbf{Y}}[a]>0$ . Hence, by a union bound (over the choices of $a$ ), we find that, with high probability, all possible strings of length $\lambda$ appear somewhere in $X_{1}^{N}$ . ∎

Lemma 7.

For every $\epsilon>0$ ,

(1-\epsilon)[c_{N}-1]\ln N\leq-\sum_{i=1}^{c_{N}-1}\ln\mathbb{P}_{\mathbf{X}}[% \overline{w}^{(i,N)}]

with high probability.

Proof sketch.

The two crucial observations are the following:

•

If a sum of $c_{N}-1$ nonnegative terms is smaller than $(1-\epsilon)[c_{N}-1]\ln N$ , then at least one term must be smaller than $\ln N^{1-\epsilon}$ ;
•

If $\overline{w}^{(i,N)}$ is a word in a parsing of $Y_{1}^{N}$ that uses words not found in $X_{1}^{N}$ , then $W(\overline{w}^{(i,N)},\mathbf{X})\geq N-|\overline{w}^{(i,N)}|$ .

The inequality (7) from Section 2.3 provides a tension between those two observations: an index $i$ with small $-\ln\mathbb{P}_{\mathbf{X}}[\overline{w}^{(i,N)}]$ is rarely associated with a large waiting time. Thanks to Lemma 6, these observations can be turned into a rigorous proof using standard probabilistic techniques. ∎

It should be noted that the proof of Lemma 7 is the only step that works for the mZM estimator, but not for the ZM estimator.

Lemma 8.

Almost surely,

-\sum_{i=1}^{c_{N}-1}\ln\mathbb{P}_{\mathbf{X}}[\overline{w}^{(i,N)}]\leq-\ln% \mathbb{P}_{\mathbf{X}}[Y_{1}^{N}]+kc_{N}.

Proof sketch.

Start with the string $Y_{1}^{N}$ and apply the upper decoupling inequality (1) repeatedly. ∎

Lemma 9.

For every $\epsilon>0$ ,

-\sum_{i=1}^{c_{N}}\ln\mathbb{P}_{\mathbf{X}}[\underline{w}^{(i,N)}]\leq(1+% \epsilon)c_{N}\ln N

(11)

with high probability.

Proof sketch.

Fix $\epsilon\in(0,1)$ and let $\lambda\geq 2$ be arbitrary for the time being. By Lemma 6, it is harmless to restrict our attention to the event that (8)–(9) hold. With the intent of using a union bound over the possible choices of values $c$ for $c_{N}$ , and then the ways of choosing the starting index $L_{i}$ of each $\underline{w}^{(i,N)}$ for $i=1,\dots,c$ , we seek an upper bound on

\displaystyle\operatorname{Prob}\left(\left\{-\sum_{i=1}^{c}\ln\mathbb{P}_{% \mathbf{X}}[Y_{L_{i}}^{L_{i+1}-2}]\geq(1+\epsilon)c\ln N\right\}\cap\bigcap_{i% =1}^{c}B_{i}\right)

where $B_{i}$ is used as the shorthand

B_{i}:=\left\{W(Y_{L_{i}}^{L_{i+1}-2},\mathbf{X})<N\right\}

with the convention that $L_{1}=1$ and $L_{c+1}=N+1$ .

Now, for any $(t_{i})_{i=1}^{c}\in\mathbb{R}^{c}$ , we can successively use the upper decoupling inequality (1) for the law $\mathbb{P}_{\mathbf{Y}}$ to obtain

\operatorname{Prob}\left(\bigcap_{i=1}^{c}\left\{-\ln\mathbb{P}_{\mathbf{X}}[Y% _{L_{i}}^{L_{i+1}-2}]\geq t_{i}\right\}\cap B_{i}\right)\\ \leq\mathrm{e}^{ck}\prod_{i=1}^{c}\operatorname{Prob}\left(\left\{-\ln\mathbb{% P}_{\mathbf{X}}[Y_{L_{i}}^{L_{i+1}-2}]\geq t_{i}\right\}\cap B_{i}\right).

(12)

By a union bound and shift-invariance, we also have that

\operatorname{Prob}\left(\left\{-\ln\mathbb{P}_{\mathbf{X}}[Y_{L_{i}}^{L_{i+1}% -2}]\geq t_{i}\right\}\cap B_{i}\right)\\ \leq\sum_{\begin{subarray}{c}a\in\mathcal{A}^{L_{i+1}-L_{i}-1},\\ -\ln\mathbb{P}_{\mathbf{X}}[a]\geq t_{i}\end{subarray}}\mathbb{P}_{\mathbf{Y}}% [a]\;N\mathbb{P}_{\mathbf{X}}[a]\leq N\mathrm{e}^{-t_{i}}.

This shows that the measure defined by the left-hand side of inequality (12) is bounded above by a product measure on $\mathbb{R}^{c}$ , with each marginal having a well-defined moment-generating function on $(0,1)$ . Using a Chebyshev-like bound for the function $t\mapsto\mathrm{e}^{(1+\epsilon)^{-1/2}t}$ , we have

\operatorname{Prob}\left(\left\{-\sum_{i=1}^{c}\ln\mathbb{P}_{\mathbf{X}}[Y_{L% _{i}}^{L_{i+1}-2}]\geq t\right\}\cap\bigcap_{i=1}^{c}B_{i}\right)\\ \leq\mathrm{e}^{ck-(1+\epsilon)^{-\frac{1}{2}}t+c\ln N+c(2-\ln\epsilon)}.

With $t=(1+\epsilon)c\ln N$ , we obtain an upper bound of the form $\exp\left(-\delta_{\epsilon}c\ln N\right)$ for $N$ large enough and some $\delta_{\epsilon}>0$ . But since $c\ln N\geq\kappa^{-1}N$ by (8), this in fact yields a bound of the form $\exp(-\delta^{\prime}_{\epsilon}N)$ for $N$ large enough and some $\delta^{\prime}_{\epsilon}>0$ .

To conclude, we want to use a union bound over the choices of $c$ and then the choices of $(L_{i})_{i=1}^{c}$ . Since the number of choices is bounded by $\lambda^{-1}N$ times

\displaystyle\binom{N}{\lambda^{-1}N}<\mathrm{e}^{\frac{1+\ln\lambda}{\lambda}N}

(13)

for $N$ large enough — thanks to Stirling’s formula —, the union bound is indeed successful in showing that (11) occurs with high probability, provided that $\lambda$ is large enough that $1+\ln\lambda<\delta^{\prime}_{\epsilon}\lambda$ . ∎

Lemma 10.

For every $\epsilon>0$ ,

(N-c_{N})(D_{-}\bar{q}(0)-\epsilon)\leq-\sum_{i=1}^{c_{N}}\ln\mathbb{P}_{% \mathbf{X}}[\underline{w}^{(i,N)}]

with high probability.

Proof sketch.

Let $\epsilon>0$ and $\lambda\geq 2$ be arbitrary. By Lemma 6, we can assume the length bounds (8)–(9). We use the notation from Lemma 9. Note that, by a union bound over the possible choices for $c$ and $(L_{i})_{i=1}^{c}$ under the constraint that $L_{i+1}-L_{i}-1\geq\lambda$ , it suffices to provide an exponential upper bound, made uniform in the choice of $c$ , on the events

\operatorname{Prob}\left\{-\sum_{i=1}^{c}\ln\mathbb{P}_{\mathbf{X}}[Y_{L_{i}}^% {L_{i+1}-2}]\leq(N-c)(D_{-}\bar{q}(0)-\epsilon)\right\}.

Note that the moment-generating function of the logarithm of the marginal on the right-hand side of (12) is precisely $\mathrm{e}^{q_{L_{i+1}-L_{i}-1}(\alpha)}$ ; cf. (5). Hence, using (1) and a Chebyshev-like inequality for the function $t\mapsto\mathrm{e}^{-\alpha t}$ with $\alpha<0$ ,

\operatorname{Prob}\left\{-\sum_{i=1}^{c}\ln\mathbb{P}_{\mathbf{X}}[Y_{L_{i}}^% {L_{i+1}-2}]\leq(N-c)(D_{-}\bar{q}(0)-\epsilon)\right\}\\ \leq\mathrm{e}^{ck-\alpha(N-c)(D_{-}\bar{q}(0)-\epsilon)+\sum_{i=1}^{c}q_{L_{i% +1}-L_{i}-1}(\alpha)}.

By convexity of the pressure $\bar{q}$ , we can take $\alpha<0$ small enough that $\bar{q}(\alpha)<\alpha(D_{-}\bar{q}(0)-\tfrac{\epsilon}{4})$ ; see Figure 1. Now, pick $\lambda$ large enough that (4) and (9) guarantee

\frac{q_{L_{i+1}-L_{i}-1}(\alpha)}{L_{i+1}-L_{i}-1}<\bar{q}(\alpha)-\frac{% \alpha\epsilon}{4}

for each $i=1,\dots,c$ . Recalling that $\sum_{i=1}^{c}L_{i+1}-L_{i}-1=N-c$ and that (9) provides the bound $c\leq\lambda^{-1}N$ , we can conclude as in Lemma 9 using (13) with $\lambda$ large enough. ∎

Let $\epsilon\in(0,1)$ be arbitrary. Combining the last conclusion of Lemma 6 with $\lambda>(k+1)\epsilon^{-1}$ and the BC lemma, we deduce that both $c_{N}$ and $kc_{N}$ are almost surely eventually bounded by $N\epsilon$ . Hence, by Lemmas 3, 7 and 8 and the BC lemma,

\displaystyle\limsup_{N\to\infty}\frac{c_{N}\ln N}{N-c_{N}}\leq\left(\frac{1}{% 1-\epsilon}\right)^{2}\left(h^{\textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb% {P}_{\mathbf{X}})+\epsilon\right),

almost surely. By Lemmas 4, 9 and 10 and the BC lemma,

\displaystyle\liminf_{N\to\infty}\frac{c_{N}\ln N}{N-c_{N}}\geq\frac{1}{1+% \epsilon}(h^{\textnormal{c}}(\mathbb{P}_{\mathbf{Y}}|\mathbb{P}_{\mathbf{X}})-% \epsilon),

almost surely. To deduce Theorem 2, take $\epsilon\to 0$ along some sequence and use countable additivity of probability.

4 Numerical experiments

In Figure 2, we compute the longest-match length estimator, the original ZM estimator and the mZM estimator for some pair of HMPs on $\mathcal{A}=\{0,1\}$ . Then, to estimate the root-mean-square error (RMSE), we compare the estimator to (3) at $N=2^{20}$ — which is only possible in this case as we know the process.

While our proofs provide no rate of convergence, these numerical experiments — and others [BGPR] — suggest a relatively rapid polynomial convergence. Importantly, we see a significant performance advantage in the ZM-type estimators compared to the more routine longest-match length estimator.

5 Conclusion

We have proved strong consistency of a slight modification of the ZM estimator of cross entropy rates between HMPs. We emphasize that while our numerical experiments suggest little to no impact on concrete performance on HMPs, the modification of the ZM estimator allows for proofs at a level of generality that was previously inaccessible [ZM93, BGPR24].

Acknowledgements

The work of NB and RR was partially funded by the Fonds de recherche du Québec — Nature et technologies (FRQNT) and by the Natural Sciences and Engineering Research Council of Canada (NSERC). Part of the work of RG was partially funded by the Rubin Gruber Science Undergraduate Research Award and Axel W Hundemer. The work of GP was supported by the CY Initiative of Excellence through the grant Investissements d’Avenir ANR-16-IDEX-0008, and was done under the auspices of the Gruppo Nazionale di Fisica Matematica (GNFM) section of the Istituto Nazionale di Alta Matematica (INdAM) while GP was a post-doctoral researcher at University of Milano-Bicocca (Milan, Italy).

References

[AACG23] M. Abadi, V. G. de Amorim, J.-R. Chazottes, and S. Gallo, “Return-time ${L}^{q}$ -spectrum for equilibrium states with potentials of summable variation,” Ergodic Theor. Dyn. Syst., vol. 43, no. 8, pp. 2489–2515, 2023.
[B+08] C. Basile, D. Benedetto, E. Caglioti, and M. Degli Esposti, “An example of mathematical authorship attribution,” J. Math. Phys., vol. 49, no. 12, 2008.
[BCL02] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Phys. Rev. Lett., vol. 88, p. 048702, 2002.
[BCJP21] T. Benoist, N. Cuneo, V. Jakšić, and C.-A. Pillet, “On entropy production of repeated quantum measurements II. Examples,” J. Stat. Phys., vol. 182, no. 3, pp. 1–71, 2021.
[BJPP18] T. Benoist, V. Jakšić, Y. Pautrat, and C.-A. Pillet, “On entropy production of repeated quantum measurements I. General theory,” Commun. Math. Phys., vol. 357, no. 1, pp. 77–123, 2018.
[BGPR24] N. Barnfield, R. Grondin, G. Pozzoli, and R. Raquépas, “On the Ziv–Merhav theorem beyond Markovianity I,” Can. J. Math., Online “FirstView”, pp. 1–25, 2024.
[BGPR] N. Barnfield, R. Grondin, G. Pozzoli, and R. Raquépas, “On the Ziv–Merhav theorem beyond Markovianity II: Leveraging the thermodynamic formalism,” preprint, arXiv:2312.02098v2, 2024.
[Bi63] J. J. Birch. “Approximations for the entropy for functions of Markov chains,” Ann. Math. Stat., vol. 33, no. 3, pp. 930–938, 1962.
[Bl57] D. Blackwell. “The entropy of functions of finite-state Markov chains,” Trans. First Prague Conf. Inf. Theory, pp. 13–20, 1957.
[Bo74] R. Bowen, “Some systems with unique equilibrium states,” Math. Syst. Theor., vol. 8, no. 3, pp. 193–202, 1974.
[BP66] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Stat., vol. 37, no. 6, pp. 1554–1563, 1966.
[Br05] R. C. Bradley, “Basic properties of strong mixing conditions. A survey and some open questions,” Probab. Surv., vol. 2, pp. 107–144, 2005.
[C+23] G. Cristadoro, M. Degli Esposti, V. Jakšić, and R. Raquépas, “On a waiting-time result of Kontoyiannis: mixing or decoupling?,” Stoch. Proc. Appl., vol. 166, p. 104222, 2023.
[CF05] D. P. Coutinho and M. A. Figueiredo, “Information theoretic text classification using the Ziv–Merhav method,” in Pattern Recognition and Image Analysis (J. S. Marques, N. Pérez de la Blanca, and P. Pina, eds.), vol. 3523 of Lecture Notes in Computer Science, pp. 355–362, Berlin: Springer, 2005.
[CFF10] D. P. Coutinho, A. L. Fred, and M. A. Figueiredo, “One-lead ECG-based personal identification using Ziv–Merhav cross parsing,” in 20th International Conference on Pattern Recognition, pp. 3858–3861, IEEE, 2010.
[CFH08] Y. Cao, D. Feng, and W. Huang, “The thermodynamic formalism for sub-additive potentials,” Discrete Contin. Dyn. Syst., vol. 20, no. 3, p. 639, 2008.
[CJPS19] N. Cuneo, V. Jakšić, C.-A. Pillet, and A. Shirikyan, “Large deviations and fluctuation theorem for selectively decoupled measures on shift spaces,” Rev. Math. Phys., vol. 31, no. 10, p. 1950036, 2019.
[CR24] N. Cuneo and R. Raquépas, “Large deviations of return times and related entropy estimators on shift spaces,” Commun. Math. Phys., vol. 405, art. 135, 2024.
[EM02] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE Trans. Inf. Theory, vol. 48, no. 6, pp. 1518–1569, 2002.
[Gib] J. W. Gibbs, Elementary principles in statistical mechanics: developed with especial reference to the rational foundations of thermodynamics. C. Scribner’s sons, 1902.
[HM06] G. Han and B. Marcus. “Analyticity of entropy rate in families of hidden Markov chains II,” 2006 IEEE Int. Symp. Inf. Theory, pp. 103–107, 2006.
[JOP17] A. Johansson, A. Öberg, and M. Pollicott, “Ergodic theory of Kusuoka measures,” J. Fractal Geom., vol. 4, no. 2, pp. 185–214, 2017.
[JSS04] P. Jacquet, G. Seroussi, and W. Szpankowski. “On the entropy of a hidden Markov process,” 2004 IEEE Int. Symp. Inf. Theory, p. 10, 2004.
[Ki68] J. F. Kingman, “The ergodic theory of subadditive stochastic processes,” J. Roy. Statist. Soc. B, vol. 30, no. 3, pp. 499–510, 1968.
[Ko98] I. Kontoyiannis, “Asymptotic recurrence and waiting times for stationary processes,” J. Theor. Probab., vol. 11, no. 3, pp. 795–811, 1998.
[KPK01] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev. “Using literal and grammatical statistics for authorship attribution,” Probl. Inf. Transm., vol. 37, pp. 172–184, 2001.
[Le92] B. G. Leroux, “Maximum-likelihood estimation for hidden Markov models,” Stoch. Proc. Appl., vol. 40, no. 1, pp. 127–143, 1992.
[L+19] M. Lippi, M. A. Montemurro, M. Degli Esposti, and G. Cristadoro, “Natural Language Statistical Features of LSTM-Generated Texts,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, pp. 3326–3337, 2019.
[MS95] K. Marton and P. C. Shields, “Almost-sure waiting time results for weak and very weak Bernoulli processes,” Ergodic Theor. Dyn. Syst., vol. 15, no. 5, pp. 951–960, 1995.
[OW93] D. S. Ornstein and B. Weiss, “Entropy and data compression schemes,” IEEE Trans. Inf. Theory, vol. 39, no. 1, pp. 78–83, 1993.
[Pe69] T. Petrie, “Probabilistic functions of finite state Markov chains,” Ann. Math. Stat., vol. 40, no. 1, pp. 97–115, 1969.
[Pf02] C.-É. Pfister, “Thermodynamical aspects of classical lattice systems,” in In and Out of Equilibrium: Probability with a Physics Flavor (V. Sidoravicius, ed.), vol. 51 of Prog. Probab., pp. 393–472, Birkhäuser, 2002.
[R+22] S. Ro, B. Guo, A. Shih, T. V. Phan, R. H. Austin, D. Levine, P. M. Chaikin, and S. Martiniani, “Model-free measurement of local entropy production and extractable work in active matter,” Phys. Rev. Lett., vol. 129, no. 22, p. 220601, 2022.
[RP12] É. Roldán and J. M. R. Parrondo, “Entropy production and Kullback–Leibler divergence between stationary trajectories of discrete systems,” Phys. Rev. E, vol. 85, p. 031129, 2012.
[Ru73] D. Ruelle, “Statistical mechanics on a compact set with $\mathbf{Z}^{\nu}$ action satisfying expansiveness and specification,” Trans. Amer. Math. Soc., vol. 185, pp. 237–251, 1973.
[Sh93] P. C. Shields, “Waiting times: positive and negative results on the Wyner–Ziv problem,” J. Theor. Probab., vol. 6, no. 3, pp. 499–519, 1993.
[Sim] B. Simon, The Statistical Mechanics of Lattice Gases, vol. 1. Princeton University Press, 1993.
[Wa75] P. Walters, “A variational principle for the pressure of continuous transformations,” Amer. J. Math., vol. 97, no. 4, pp. 937–971, 1975.
[WZ89] A. D. Wyner and J. Ziv, “Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression,” IEEE Trans. Inf. Theory, vol. 35, no. 6, pp. 1250–1258, 1989.
[ZDKA06] O. Zuk, E. Domany, I. Kanter, and M. Aizenman. “Taylor series expansions for the entropy rate of hidden Markov processes,” 2006 IEEE Int. Conf. Commun., pp. 1598–1604, 2006.
[ZL78] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inf. Theory, vol. 24, no. 5, pp. 530–536, 1978.
[ZM93] J. Ziv and N. Merhav, “A measure of relative entropy between individual sequences with application to universal classification,” IEEE Trans. Inf. Theory, vol. 39, no. 4, pp. 1270–1279, 1993.