Ziv–Merhav Estimation for Hidden-Markov Processes
Abstract
We present a proof of strong consistency of a Ziv–Merhav-type estimator of the cross entropy rate for pairs of hidden-Markov processes. Our proof strategy has two novel aspects: the focus on decoupling properties of the laws and the use of tools from the thermodynamic formalism.
Presented at the IEEE International Symposium on Information Theory 2024, in Athens, Greece.
1 Introduction
We are interested in Ziv–Merhav-type estimators [ZM93] of the cross entropy rate for pairs of sources — which is a sum of a Kullback–Leibler (KL) divergence rate and an entropy rate. While such estimators are widely used in practice (see e.g. [KPK01, BCL02, CF05, B+08, CFF10, RP12, L+19, R+22]), theoretical works on the subject are scarce. The goal of this proceedings paper is to concretely present the consequences of recent findings [BCJP21, C+23, CR24, BGPR24, BGPR] for hidden-Markov processes in an accessible way.
Throughout, is a finite alphabet and is the space of one-sided -valued sequences, denoted by bold lower-case letters, e.g. . We use for the string (i.e. finite concatenation of elements of ) of length starting at index in such a sequence . We also use . We use bold upper-case letters for -valued processes, e.g. . We will only consider processes that are stationary, i.e. processes whose law (a measure on ) is invariant under the shift .
1.1 Entropies
The Shannon entropy rate of a stationary process — or equivalently of its law — is the limit
There are at least two important related quantities for pairs of stationary processes: the cross entropy rate and the KL divergence rate (a.k.a. relative entropy rate):
whenever the limit exists in , and then As we shall see in Section 2.1, this will be the case for the pairs of measures we are interested in, but it should be noted that this is not a general fact about pairs of ergodic processes.444 Let us emphasize that the KL divergence of with respect to itself — not the rate — is well defined but too coarse to be useful in many tasks. Indeed, if both measures are ergodic, then this quantity is either zero or infinite.
The KL divergence rate plays a fundamental role in many tasks in information theory and statistics: binary hypothesis testing, universal classification, etc. Because the KL divergence rate differs from the cross entropy rate by an entropy rate which can be universally estimated following the Lempel–Ziv parsing algorithm [ZL78], we will focus on estimation of the cross entropy rate.
1.2 Ziv–Merhav-type estimators
Inspired by the dictionary-based approach of the Lempel–Ziv algorithm for estimating the entropy rate, researchers have sought to develop similar universal estimators that generalize to multiple sources to measure the cross entropy rate. Introduced in [ZM93], the Ziv–Merhav (ZM) estimator is one such cross entropy rate estimator, whose consistency was established in the case where and are stationary multi-level Markov processes; we will denote it by . We consider a modification of , introduced recently in [BGPR], which we refer to as the modified Ziv–Merhav (mZM) estimator.
Definition 1.
For realizations of , the mZM parsing of with respect to begins by determining the shortest prefix of that does not appear in . Then, is the shortest prefix of the unparsed part of that does not appear in , and so on until we reach the end of . The parsing length is the number of words in the parsing
see Algorithm 1. The mZM estimator is
In essence, the original ZM algorithm parses according to the longest words found in whereas the mZM algorithm parses according to the shortest words not found: the difference between the ZM and mZM parsing lengths boils down to the choice of imposing or not the condition “if ” for executing Line 7 in Algorithm 1. This difference is compensated by the choice of not subtracting or subtracting at the denominator in the definition of the estimator. As seen in Section 4, these subtleties yield no observable difference in performance.
In a way, ZM-type estimators make repeated use of the notion of longest-match length
for which, under suitable assumptions on and , we have
almost surely; see e.g. [WZ89, OW93, Ko98, C+23]. However, these repeated uses of longest-match lengths are not independent, making it delicate to try to deduce convergence of ZM-type estimators from convergence of longest-match length estimators.
1.3 Hidden-Markov processes
An -valued process is called a hidden-Markov process (HMP) if all the marginals of its law can be represented as
for some fixed Markov chain on some state space — this is the “hidden” chain — and some fixed -by- stochastic matrix . Such processes are known under different names, including probabilistic functions of Markov processes, and have gained immense popularity in statistical applications since the seminal papers [BP66, Pe69]; also see the review [EM02].
Throughout this paper, we will only consider -valued HMPs that can be represented with the following constraints:
-
i.
the hidden state space is finite;
-
ii.
the hidden chain is stationary and irreducible;
-
iii.
for each , there exists such that is positive for more than one string of length .
Condition ii implies ergodicity and Condition iii is essentially the minimal condition for the process not to be eventually deterministic. As far as the class of measures on that can be obtained is concerned, it turns out that there is no loss of generality in considering only deterministic functions of (possibly larger) hidden chains. The Shannon entropy rate of such HMPs has been the subject of many works; see e.g. [Bl57, Bi63, JSS04, ZDKA06, HM06].
1.4 Main result
Our main result is the following result on strong consistency of the mZM estimator of the cross entropy rate from Definition 1. To our knowledge, no analogue of this theorem is available for the ZM estimator.
Theorem 2.
Suppose that and are independent, HMPs with respective laws and . If they both satisfy Conditions i–iii, then
almost surely.
Note that Theorem 2 provides an estimation of the cross entropy rate between the HMPs and , not the underlying (hidden) Markov processes. In Section 2, we discuss some important preliminaries for our presentation of the proof in Section 3. The result is illustrated using numerical experiments in Section 4.
2 Decoupling properties of hidden-Markov processes
In this section, we identify the key properties of HMPs that we will use for the proof of Theorem 2. While these properties are natural from the point of view of the statistical mechanics of lattice gases, we suspect that they might not be familiar to the information theory community.
2.1 Decoupling inequalities and their first consequence
It is shown in [BCJP21] that if is the law of a HMP satisfying i–ii, then there exist two natural numbers and with the following two properties:
-
•
for all strings and ,
(1) -
•
for all strings and , there exists a string with length such that
(2)
These properties and generalizations thereof555A particularly useful generalization consists in allowing and to grow sublinearly with the length of the string . However, we will not consider this generalization as it is unnecessary for HMPs and complicates some arguments. Still, as stated here, the inequalities (1)–(2) are weaker than the so-called “quasi-Bernoulli” property and imply no form of mixing. are known as decoupling properties and have been instrumental in recent progress in large-deviation theory and its connection to statistical mechanics and information theory; see [Pf02, CJPS19, CR24]. A particularly convenient tool in their derivation in the context of HMPs is the so-called “positive-matrix product” representation; see e.g. [BCJP21]. The following is a straightforward consequence of Kingman’s subadditive ergodic theorem [Ki68].
Lemma 3.
2.2 The thermodynamic formalism
In our proof, a key role is played by the following pressure:
(4) |
where is a real variable and
(5) |
The function is a (possibly improper) monotone, convex function satisfying and , the latter being a consequence of Condition iii. While it is not a pressure in the sense of the usual (additive) thermodynamic formalism, the bound (1) provides the requirement for the subadditive thermodynamic formalism of e.g. [CFH08]. As such, the pressure still satisfies the variational principle
(6) |
for all , where the supremum is taken over all shift-invariant laws and
The role of such variational principles in the study of entropic quantities — which can be traced back to ideas of Gibbs [Gib] — has become ubiquitous in the mathematical literature on statistical mechanics, dynamical systems, and large deviations building on seminal works of Ruelle, Bowen and Walters, e.g. [Ru73, Bo74, Wa75]. While we have no guarantee that is differentiable at the origin, the variational principle (6) allows us to identify the left derivative; see Figure 1.
Lemma 4.
If and satisfy (1) and are ergodic, then
The proof of this nontrivial fact relies on a uniqueness argument for the law maximizing (6) in that can be traced back to ideas of Ruelle, and which is explained e.g. in [BJPP18]; also see [Sim]. For actual Markov chains (i.e. when is a permutation matrix), the left derivative does coincide with the right derivative as a consequence of the Perron–Frobenius theorem and standard perturbation theory arguments.
2.3 Waiting-time estimate
The number of words in ZM-type parsings is intimately related to the so-called Wyner–Ziv problem on waiting times introduced in [WZ89] and notably studied in [Sh93, MS95, Ko98]. The key quantity in this problem is the time
we need to wait to see the string appear in a sequence . Waiting times are dual to the longest-match lengths from Section 1.2 [WZ89], but typically allow for more direct bounds. It was recently shown in [C+23] that, for every string and natural number ,
(7) |
provided that the law of under the underlying measure satisfies (2). This progress was made by revisiting the ideas of [Ko98] from the perspective of decoupling properties. In the recent works [AACG23, CR24], a detailed analysis of the pressure plays a crucial role in the study of large deviations of waiting times.
3 Proof strategy
We are now in a position to present the proof strategy for Theorem 2. Still on , the same exact strategy applies to irreducible Kusuoka processes [JOP17, BCJP21] and to -mixing processes that satisfy in the notation of [Br05], as soon as one can check the nondegeneracy condition . Further generalizations leveraging relaxations of the decoupling conditions (1)–(2) are discussed in our more technical paper [BGPR].
Throughout this section, the processes and are fixed processes with hidden-Markov representations satisfying Conditions i–iii from Section 1.3 and have respective laws and . We will consider them as defined on a common underlying probability space in such a way that they are independent and use “” for the probability measure on this space. For each , the number of words, and the words , , , are functions of and and can therefore be defined through composition as random variables on that same underlying probability space.
Definition 5.
We use for the (possibly empty) string obtained by taking in Definition 1 without its last letter.
We use “with high probability” to describe sequences of events with the property that as . In particular, the Borel–Cantelli (BC) lemma can be applied to the complement of finitely many such events to deduce almost sure statements.
First, one can show that if there is a string such that but , then ergodicity implies that the theorem holds true in the sense that both sides must be infinite. Hence, we will assume from now on that this does not happen. For simplicity, we will present the proof in the case . In the complementary infinite case, some of the bounds become vacuously true and the others are proved superficially adapting the same sequence of arguments as in the finite case.
Lemma 6.
There exists a constant such that
(8) |
with high probability, and for every ,
(9) |
— and thus — with high probability.
Proof sketch.
For every , by stationarity and a union bound (over the locations at which can appear in ), we have
(10) |
where is defined in (5) and by (4). Also recall from Section 2.2 that as a consequence of Condition iii. In particular, taking with large enough, the right-hand side of (10) can be made to decay as fast as any inverse power of . By stationarity, the same is true with replaced with . Hence, by a union bound (over ), we find that, with high probability, no substring of of length appears in .
On the other hand, the bound (7) gives an exponentially decaying bound on the probability that some does not appear in . But that bound can be made uniform in the choice of of fixed length with . Hence, by a union bound (over the choices of ), we find that, with high probability, all possible strings of length appear somewhere in . ∎
Lemma 7.
For every ,
with high probability.
Proof sketch.
The two crucial observations are the following:
-
•
If a sum of nonnegative terms is smaller than , then at least one term must be smaller than ;
-
•
If is a word in a parsing of that uses words not found in , then .
The inequality (7) from Section 2.3 provides a tension between those two observations: an index with small is rarely associated with a large waiting time. Thanks to Lemma 6, these observations can be turned into a rigorous proof using standard probabilistic techniques. ∎
It should be noted that the proof of Lemma 7 is the only step that works for the mZM estimator, but not for the ZM estimator.
Lemma 8.
Almost surely,
Proof sketch.
Start with the string and apply the upper decoupling inequality (1) repeatedly. ∎
Lemma 9.
For every ,
(11) |
with high probability.
Proof sketch.
Fix and let be arbitrary for the time being. By Lemma 6, it is harmless to restrict our attention to the event that (8)–(9) hold. With the intent of using a union bound over the possible choices of values for , and then the ways of choosing the starting index of each for , we seek an upper bound on
where is used as the shorthand
with the convention that and .
Now, for any , we can successively use the upper decoupling inequality (1) for the law to obtain
(12) |
By a union bound and shift-invariance, we also have that
This shows that the measure defined by the left-hand side of inequality (12) is bounded above by a product measure on , with each marginal having a well-defined moment-generating function on . Using a Chebyshev-like bound for the function , we have
With , we obtain an upper bound of the form for large enough and some . But since by (8), this in fact yields a bound of the form for large enough and some .
To conclude, we want to use a union bound over the choices of and then the choices of . Since the number of choices is bounded by times
(13) |
for large enough — thanks to Stirling’s formula —, the union bound is indeed successful in showing that (11) occurs with high probability, provided that is large enough that . ∎
Lemma 10.
For every ,
with high probability.
Proof sketch.
Let and be arbitrary. By Lemma 6, we can assume the length bounds (8)–(9). We use the notation from Lemma 9. Note that, by a union bound over the possible choices for and under the constraint that , it suffices to provide an exponential upper bound, made uniform in the choice of , on the events
Note that the moment-generating function of the logarithm of the marginal on the right-hand side of (12) is precisely ; cf. (5). Hence, using (1) and a Chebyshev-like inequality for the function with ,
By convexity of the pressure , we can take small enough that ; see Figure 1. Now, pick large enough that (4) and (9) guarantee
for each . Recalling that and that (9) provides the bound , we can conclude as in Lemma 9 using (13) with large enough. ∎
Let be arbitrary. Combining the last conclusion of Lemma 6 with and the BC lemma, we deduce that both and are almost surely eventually bounded by . Hence, by Lemmas 3, 7 and 8 and the BC lemma,
almost surely. By Lemmas 4, 9 and 10 and the BC lemma,
almost surely. To deduce Theorem 2, take along some sequence and use countable additivity of probability.
4 Numerical experiments
In Figure 2, we compute the longest-match length estimator, the original ZM estimator and the mZM estimator for some pair of HMPs on . Then, to estimate the root-mean-square error (RMSE), we compare the estimator to (3) at — which is only possible in this case as we know the process.
While our proofs provide no rate of convergence, these numerical experiments — and others [BGPR] — suggest a relatively rapid polynomial convergence. Importantly, we see a significant performance advantage in the ZM-type estimators compared to the more routine longest-match length estimator.
5 Conclusion
We have proved strong consistency of a slight modification of the ZM estimator of cross entropy rates between HMPs. We emphasize that while our numerical experiments suggest little to no impact on concrete performance on HMPs, the modification of the ZM estimator allows for proofs at a level of generality that was previously inaccessible [ZM93, BGPR24].
Acknowledgements
The work of NB and RR was partially funded by the Fonds de recherche du Québec — Nature et technologies (FRQNT) and by the Natural Sciences and Engineering Research Council of Canada (NSERC). Part of the work of RG was partially funded by the Rubin Gruber Science Undergraduate Research Award and Axel W Hundemer. The work of GP was supported by the CY Initiative of Excellence through the grant Investissements d’Avenir ANR-16-IDEX-0008, and was done under the auspices of the Gruppo Nazionale di Fisica Matematica (GNFM) section of the Istituto Nazionale di Alta Matematica (INdAM) while GP was a post-doctoral researcher at University of Milano-Bicocca (Milan, Italy).
References
- [AACG23] M. Abadi, V. G. de Amorim, J.-R. Chazottes, and S. Gallo, “Return-time -spectrum for equilibrium states with potentials of summable variation,” Ergodic Theor. Dyn. Syst., vol. 43, no. 8, pp. 2489–2515, 2023.
- [B+08] C. Basile, D. Benedetto, E. Caglioti, and M. Degli Esposti, “An example of mathematical authorship attribution,” J. Math. Phys., vol. 49, no. 12, 2008.
- [BCL02] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Phys. Rev. Lett., vol. 88, p. 048702, 2002.
- [BCJP21] T. Benoist, N. Cuneo, V. Jakšić, and C.-A. Pillet, “On entropy production of repeated quantum measurements II. Examples,” J. Stat. Phys., vol. 182, no. 3, pp. 1–71, 2021.
- [BJPP18] T. Benoist, V. Jakšić, Y. Pautrat, and C.-A. Pillet, “On entropy production of repeated quantum measurements I. General theory,” Commun. Math. Phys., vol. 357, no. 1, pp. 77–123, 2018.
- [BGPR24] N. Barnfield, R. Grondin, G. Pozzoli, and R. Raquépas, “On the Ziv–Merhav theorem beyond Markovianity I,” Can. J. Math., Online “FirstView”, pp. 1–25, 2024.
- [BGPR] N. Barnfield, R. Grondin, G. Pozzoli, and R. Raquépas, “On the Ziv–Merhav theorem beyond Markovianity II: Leveraging the thermodynamic formalism,” preprint, arXiv:2312.02098v2, 2024.
- [Bi63] J. J. Birch. “Approximations for the entropy for functions of Markov chains,” Ann. Math. Stat., vol. 33, no. 3, pp. 930–938, 1962.
- [Bl57] D. Blackwell. “The entropy of functions of finite-state Markov chains,” Trans. First Prague Conf. Inf. Theory, pp. 13–20, 1957.
- [Bo74] R. Bowen, “Some systems with unique equilibrium states,” Math. Syst. Theor., vol. 8, no. 3, pp. 193–202, 1974.
- [BP66] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Stat., vol. 37, no. 6, pp. 1554–1563, 1966.
- [Br05] R. C. Bradley, “Basic properties of strong mixing conditions. A survey and some open questions,” Probab. Surv., vol. 2, pp. 107–144, 2005.
- [C+23] G. Cristadoro, M. Degli Esposti, V. Jakšić, and R. Raquépas, “On a waiting-time result of Kontoyiannis: mixing or decoupling?,” Stoch. Proc. Appl., vol. 166, p. 104222, 2023.
- [CF05] D. P. Coutinho and M. A. Figueiredo, “Information theoretic text classification using the Ziv–Merhav method,” in Pattern Recognition and Image Analysis (J. S. Marques, N. Pérez de la Blanca, and P. Pina, eds.), vol. 3523 of Lecture Notes in Computer Science, pp. 355–362, Berlin: Springer, 2005.
- [CFF10] D. P. Coutinho, A. L. Fred, and M. A. Figueiredo, “One-lead ECG-based personal identification using Ziv–Merhav cross parsing,” in 20th International Conference on Pattern Recognition, pp. 3858–3861, IEEE, 2010.
- [CFH08] Y. Cao, D. Feng, and W. Huang, “The thermodynamic formalism for sub-additive potentials,” Discrete Contin. Dyn. Syst., vol. 20, no. 3, p. 639, 2008.
- [CJPS19] N. Cuneo, V. Jakšić, C.-A. Pillet, and A. Shirikyan, “Large deviations and fluctuation theorem for selectively decoupled measures on shift spaces,” Rev. Math. Phys., vol. 31, no. 10, p. 1950036, 2019.
- [CR24] N. Cuneo and R. Raquépas, “Large deviations of return times and related entropy estimators on shift spaces,” Commun. Math. Phys., vol. 405, art. 135, 2024.
- [EM02] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE Trans. Inf. Theory, vol. 48, no. 6, pp. 1518–1569, 2002.
- [Gib] J. W. Gibbs, Elementary principles in statistical mechanics: developed with especial reference to the rational foundations of thermodynamics. C. Scribner’s sons, 1902.
- [HM06] G. Han and B. Marcus. “Analyticity of entropy rate in families of hidden Markov chains II,” 2006 IEEE Int. Symp. Inf. Theory, pp. 103–107, 2006.
- [JOP17] A. Johansson, A. Öberg, and M. Pollicott, “Ergodic theory of Kusuoka measures,” J. Fractal Geom., vol. 4, no. 2, pp. 185–214, 2017.
- [JSS04] P. Jacquet, G. Seroussi, and W. Szpankowski. “On the entropy of a hidden Markov process,” 2004 IEEE Int. Symp. Inf. Theory, p. 10, 2004.
- [Ki68] J. F. Kingman, “The ergodic theory of subadditive stochastic processes,” J. Roy. Statist. Soc. B, vol. 30, no. 3, pp. 499–510, 1968.
- [Ko98] I. Kontoyiannis, “Asymptotic recurrence and waiting times for stationary processes,” J. Theor. Probab., vol. 11, no. 3, pp. 795–811, 1998.
- [KPK01] O. V. Kukushkina, A. A. Polikarpov, and D. V. Khmelev. “Using literal and grammatical statistics for authorship attribution,” Probl. Inf. Transm., vol. 37, pp. 172–184, 2001.
- [Le92] B. G. Leroux, “Maximum-likelihood estimation for hidden Markov models,” Stoch. Proc. Appl., vol. 40, no. 1, pp. 127–143, 1992.
- [L+19] M. Lippi, M. A. Montemurro, M. Degli Esposti, and G. Cristadoro, “Natural Language Statistical Features of LSTM-Generated Texts,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 11, pp. 3326–3337, 2019.
- [MS95] K. Marton and P. C. Shields, “Almost-sure waiting time results for weak and very weak Bernoulli processes,” Ergodic Theor. Dyn. Syst., vol. 15, no. 5, pp. 951–960, 1995.
- [OW93] D. S. Ornstein and B. Weiss, “Entropy and data compression schemes,” IEEE Trans. Inf. Theory, vol. 39, no. 1, pp. 78–83, 1993.
- [Pe69] T. Petrie, “Probabilistic functions of finite state Markov chains,” Ann. Math. Stat., vol. 40, no. 1, pp. 97–115, 1969.
- [Pf02] C.-É. Pfister, “Thermodynamical aspects of classical lattice systems,” in In and Out of Equilibrium: Probability with a Physics Flavor (V. Sidoravicius, ed.), vol. 51 of Prog. Probab., pp. 393–472, Birkhäuser, 2002.
- [R+22] S. Ro, B. Guo, A. Shih, T. V. Phan, R. H. Austin, D. Levine, P. M. Chaikin, and S. Martiniani, “Model-free measurement of local entropy production and extractable work in active matter,” Phys. Rev. Lett., vol. 129, no. 22, p. 220601, 2022.
- [RP12] É. Roldán and J. M. R. Parrondo, “Entropy production and Kullback–Leibler divergence between stationary trajectories of discrete systems,” Phys. Rev. E, vol. 85, p. 031129, 2012.
- [Ru73] D. Ruelle, “Statistical mechanics on a compact set with action satisfying expansiveness and specification,” Trans. Amer. Math. Soc., vol. 185, pp. 237–251, 1973.
- [Sh93] P. C. Shields, “Waiting times: positive and negative results on the Wyner–Ziv problem,” J. Theor. Probab., vol. 6, no. 3, pp. 499–519, 1993.
- [Sim] B. Simon, The Statistical Mechanics of Lattice Gases, vol. 1. Princeton University Press, 1993.
- [Wa75] P. Walters, “A variational principle for the pressure of continuous transformations,” Amer. J. Math., vol. 97, no. 4, pp. 937–971, 1975.
- [WZ89] A. D. Wyner and J. Ziv, “Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression,” IEEE Trans. Inf. Theory, vol. 35, no. 6, pp. 1250–1258, 1989.
- [ZDKA06] O. Zuk, E. Domany, I. Kanter, and M. Aizenman. “Taylor series expansions for the entropy rate of hidden Markov processes,” 2006 IEEE Int. Conf. Commun., pp. 1598–1604, 2006.
- [ZL78] J. Ziv and A. Lempel, “Compression of individual sequences via variable-rate coding,” IEEE Trans. Inf. Theory, vol. 24, no. 5, pp. 530–536, 1978.
- [ZM93] J. Ziv and N. Merhav, “A measure of relative entropy between individual sequences with application to universal classification,” IEEE Trans. Inf. Theory, vol. 39, no. 4, pp. 1270–1279, 1993.