Approximating Klee’s Measure Problem and
a Lower Bound for Union Volume Estimation
Abstract
Union volume estimation is a classical algorithmic problem. Given a family of objects , we want to approximate the volume of their union. In the special case where all objects are boxes (also known as hyperrectangles) this is known as Klee’s measure problem. The state-of-the-art algorithm [Karp, Luby, Madras ’89] for union volume estimation as well as Klee’s measure problem in constant dimension computes a -approximation with constant success probability by using a total of queries of the form (i) ask for the volume of , (ii) sample a point uniformly at random from , and (iii) query whether a given point is contained in .
First, we show that if one can only interact with the objects via the aforementioned three queries, the query complexity of [Karp, Luby, Madras ’89] is indeed optimal, i.e., queries are necessary. Our lower bound already holds for estimating the union of equiponderous axis-aligned polygons in , and even if the algorithm is allowed to inspect the coordinates of the points sampled from the polygons, and still holds when a containment query can ask containment of an arbitrary (not necessarily sampled) point.
Second, guided by the insights of the lower bound, we provide a more efficient approximation algorithm for Klee’s measure problem improving the time to . We achieve this improvement by exploiting the geometry of Klee’s measure problem in various ways: (1) Since we have access to the boxes’ coordinates, we can split the boxes into classes of boxes of similar shape. (2) Within each class, we show how to sample from the union of all boxes, by using orthogonal range searching. And (3) we exploit that boxes of different classes have small intersection, for most pairs of classes.
1 Introduction
We revisit the classical problem of union volume estimation: given objects , we want to estimate the volume of .111Technically, the objects need to be measurable. In fact, a generalization of this problem allows to be any measurable subsets of a measure space, and we want to estimate the measure of their union. However, throughout this paper the objects will always be boxes in (in our algorithm) or polygons in the plane (in our lower bound construction), and thus these technicalities are irrelevant in our context. This problem has several important applications such as DNF Counting and Network Reliability; see the discussion in Section 1.2.
The state-of-the-art solution [19] works in a model where one has access to each input object by three types of queries: (i) determine the volume of the object, (ii) sample a point uniformly at random from the object, and (iii) ask whether a point is contained in the object. Apart from these types of queries, the model allows arbitrary computations. The complexity of algorithms is thus measured by the number of queries to the input objects.
After Karp and Luby [19] introduced this model, Karp, Luby and Madras [20] showed that one can -approximate the volume of objects in this model using queries with constant success probability222The success probability can be boosted to at the cost of a factor in the number of queries and running time., by an algorithm that uses additional time (and their solution only asks containment queries of previously sampled points). This improved earlier related algorithms by Karp and Luby [19] and Luby [23]. In the last 35 years this problem has seen no improvement of the upper bound. Hence, it is natural to ask whether this classical upper bound is best possible and whether one can give a matching lower bound. We resolve this question in this work by providing a matching lower bound.
The union volume estimation problem was also studied very recently in the streaming setting [26, 24]. Here, the objects come in a stream , and when we are at position in the stream, we can only query object . Assuming the objects are subsets of a universe , this line of work gives a streaming algorithm using queries per object (the same bound holds for the space usage and update time additional to the queries). Summed over boxes this yields the same total running time as the general tool, apart from the factor. So, interestingly, even in the streaming setting the same running time can be achieved.333See also [31] for earlier work studying Klee’s measure problem in the streaming setting.
The perhaps most famous application of the algorithm by Karp, Luby, and Madras [20] is Klee’s measure problem [22]: This is a fundamental problem in computational geometry in which we are given axis-aligned boxes in and want to compute the volume of their union. Here an axis-aligned box is any set of the form , and the input consists of the coordinates of each box. A long line of research on this problem and various special cases (e.g., for fixed dimensions or for cubes) [32, 27, 11, 2, 1, 12, 33, 3] lead to an exact algorithm running in time for constant [13]. A conditional lower bound suggests that any faster algorithm would require fast matrix multiplication techniques [12], but it is unclear how to apply fast matrix multiplication to this problem. On the approximation side, note that for a -dimensional axis-aligned box, the three queries can be implemented in time . Thus, the union volume estimation algorithm can be applied, and it computes a -approximation of Klee’s measure problem in time , as has been observed in [4]. This direct application of union volume estimation was the state of the art for approximate solutions for Klee’s measure problem until our work. See Section 1.2 for interesting applications of Klee’s measure problem.
1.1 Our Contribution
Our contribution is twofold.
Lower bound for union volume estimation
Given the state of the art, a natural question is to ask whether the query complexity of the general union volume estimation algorithm of [20] can be further improved. Any such improvement would speed up several important applications, cf. Section 1.2. On the other hand, any lower bound showing that the algorithm of [20] is optimal also implies tightness of the known streaming algorithms (up to logarithmic factors), as the streaming algorithms match the static running time bound.
We answer this question negatively in the aforementioned query model. Note that the model allows unbounded computational power, examining the numerical coordinates of sampled points, and asking containment queries on arbitrary points. In contrast, these powers are not exploited by [20]. So our lower bound encompasses a much wider paradigm of algorithms. We show a query complexity lower bound of for this model, which matches the upper bound of [20]:
Theorem 1.
Any algorithm for computing a -approximation to the cardinality of the union of objects via volume, sampling and containment queries with success probability at least must make queries.
We want to particularly highlight that our lower bound even holds for subsets of , and for equiponderous, axis-aligned polygons in the plane.
Upper bound for Klee’s measure problem
Our lower bound for union volume estimation implies that we can only achieve an improvement of the current upper bound of Klee’s measure problem if we exploit the geometric structure of boxes. Specifically, we exploit that we can split the input boxes into classes of similar boxes, since we have access to the boxes’ coordinates, and we make use of orthogonal range searching. This allows us to break the barrier that is possible within the query model and provide an algorithm that improves Klee’s measure problem from time to in constant dimension.
Theorem 2.
There is an algorithm that runs in time and with probability at least computes a -approximation for Klee’s measure problem.
The success probability can be boosted to any using standard techniques and incurring an additional factor in the running time. We also want to highlight that the core of our algorithm is an efficient method to sample uniformly and independently with a given density from the union of the input objects. While this allows us to -approximate the volume of the union, we believe that our efficient sampling method is also of independent interest.
Throughout this work, for simplicity and readability we assume the dimension to be constant. We remark that our running time bounds hide factors of the form .
1.2 Related Work
A major application of union volume estimation is DNF Counting, in which we are given a formula in disjunctive normal form and want to count its number of satisfying assignments. Computing the exact number of satisfying assignments is P-complete, therefore it likely requires exponential time. Approximating the number of satisfying assignments can be achieved by an easy application of union volume estimation, as described in [20]. Their algorithm remains the state of the art for this problem to this day, see, e.g., [25]. In particular, a direct application of the union volume estimation algorithm of [20] gives the best known complexity for approximate DNF Counting. This has been extended to more general model counting [28, 9, 25], probabilistic databases [21, 15, 29], and probabilistic queries on databases [6].
We also want to mention Network Reliability as another application for union volume estimation, which was already discussed in [20]. Additionally, Karger’s famous paper on the problem [18] uses the algorithm of [20] as a subroutine. However, the current state-of-the-art algorithms no longer use union volume estimation as a tool [7].
Finally, we want to draw a connection to the following well-known query sampling bound. Canetti, Even, and Goldreich [5] showed that approximating the mean of a random variable whose codomain is the unit interval requires queries, thus obtaining tight bounds for the sampling complexity of the mean estimation problem. Their bound generalises to on the number of queries needed to estimate the mean of a random variable in general. Before our work it was thus natural to expect that the dependence in the number of queries for union volume estimation is optimal. However, whether the factor is necessary, or the number of queries could be improved to, say, , was open to the best of our knowledge.
Klee’s measure problem is an important problem in computational geometry. One reason for its importance is that techniques that have been developed for Klee’s measure problem can often be adapted to solve various related problems, such as the depth problem (given a set of boxes, what is the largest number of boxes that can be stabbed by a single point?) [13] or Hausdorff distance under translation in [14]. Moreover, various other problems can be reduced to Klee’s measure problem or to its related problems, e.g., deciding whether a set of boxes covers its boundary box can be reduced to Klee’s Measure problem [13], the continuous -Center problem on graphs (i.e., finding centers that can lie on the edges of a graph that cover the vertices of a graph) can also be reduced to Klee’s measure problem [30], and finding the smallest hypercube containing at least points among given points can be reduced to the depth problem [17, 10, 13]. In light of this, it would be interesting to see whether our approximation techniques generalize to any of these related problems.
1.3 Technical Overview
We now give an overview of our results, starting with our upper bound result for Klee’s measure problem. We keep the statements on an intuitive level and hide many technical details. For the formal statements and proofs, see Section 2 for the upper bound and Section 3 for the lower bound.
Upper bound for Klee’s measure problem
We first remark that due to our lower bound result, we know that we have to exploit the structure of the input to obtain a running time of the form . Following a common algorithmic approach, we use sampling to approximate the volume of the union. Specifically, we want to draw a sample from the union of boxes with density , such that in the end is a good estimate of the volume of the union of input boxes. We defer how to set to the end of this overview and first focus on the main difficulty, i.e., how to create a sample for a given .
We start with a simple classification of the input boxes into classes of similar shape. Two boxes are in the same class if the side lengths of both boxes in each dimension lie in the same interval for some . We call two classes similar if their side lengths are polynomially related (e.g., within a factor of ) in each dimension.
We use the following three crucial insights to obtain an efficient algorithm:
- 1.
-
2.
Each class has only few (i.e., a polylogarithmic number of) classes that are similar to it, see Observation 1.
- 3.
In the remainder we give some more details on these insights and how they lead us to an efficient algorithm. The rough idea of our algorithm is as follows. We go through the classes in arbitrary order. For each class we sample with density from the union of the boxes of this class, but we only keep a point if it is not contained in any class that comes later in the order. To efficiently check for containment in a later class, we use an orthogonal range searching data structure (with an additional dimension for the index of the class).
To understand why our algorithm is efficient, we have to look at two different parts:
- Sampling from a class:
-
One of our main technical ingredients is to sample from the union of boxes of similar shape. Note that efficient sampling implies efficient volume estimation, so to break our lower bound we must exploit additional input structure than those offered in the query model. Our main approach here is simple but powerful: We can sample points from the union of similar shaped boxes uniformly by (1) gridding the space into cells of side lengths comparable to these boxes, (2) sampling points from the relevant cells, and (3) discarding points not in the union by querying an orthogonal range searching data structure. As the grid size is similar to the shape of the boxes in the class, we ensure that a significant fraction of the points sampled in (2) are contained in the union, i.e., not discarded. The orthogonal range searching data structure allows us to quickly check for containment.
- Bound the number of drawn samples:
-
As we discard samples that appear in later classes, this is a potential source of inefficiency. Therefore, we need to bound the number of samples that we discard using the second and third insight from above. The second insight states that there are only few, i.e., polylogarithmically many, similar classes. Hence, a point might be discarded because it is contained in one of these similar classes, but as there are only few, this will only happen a polylogarithmic number of times. On the other hand, the third insight states that the intersection of dissimilar classes is small. Thus, the probability that we discard a sampled point because of a dissimilar class is small, and such events will not have a significant impact on the running time.
Finally, to set the sampling probability , we need a crude estimate of the volume. To obtain a constant factor approximation, one can use the classical algorithms (by Karp and Luby [19] or Karp, Luby, Madras [20]) with a constant error parameter (say, ), to obtain a constant approximation factor in near-linear time. To keep our work self-contained, we provide a brief description and a simplified correctness proof of this case for union volume estimation, based on Karp and Luby [19], in Section 2.1.
Lower bound
We now give an overview of our lower bound result. The lower bound is proven by a reduction from a variant of the Gap-Hamming problem, defined as follows: Given two vectors , distinguish whether their inner product is greater than or less than . It is known that any algorithm distinguishing these two cases with success probability at least must perform queries into and .
We first give the intuition why samples are necessary to -approximate the union of two sets with constant probability in the query model. Given a Gap-Hamming instance , we construct two sets and , see Figure 3 for an example. Note that for all , we have
Hence, if we have an algorithm that computes a -approximation of with probability , then we can distinguish between and . Setting , we therefore distinguish and . Hence, our algorithm solves the Gap-Hamming instance.
Note that the volumes of and are fixed (depending only on the length of the vectors and but not their entries), and thus a volume query does not disclose any information about and . Each sample or containment query concerns at most one entry of or . Consequently, any union volume estimation algorithm has to use queries to or .
In order to generalize this lower bound for estimating the union of two sets to an lower bound for estimating the union of sets, we need to ensure that the sampled points do not give away too much information about the entries of and . We apply two obfuscations that jointly ensure a lower bound on the number of queries; see Figure 4. Firstly, we introduce sets whose union is and sets whose union is . Imagine cutting each rectangle in Figure 3 into side-by-side pieces and distributing them randomly among ; similarly for . The idea is that one needs to make containment queries on a set in order to hit the correct piece. Hence, the effort for revealing one bit in or is . Secondly, we introduce a large set shared by all and for . In Figure 4, this is the long dark-blue rectangle that spans from left to right. This large set intuitively enforces samples to even obtain a single point that contains any information about and .
2 Approximation Algorithm for Klee’s Measure Problem
In this section we give our upper bound for Klee’s measure problem.
See 2
2.1 Preliminaries
In Klee’s measure problem we are given boxes in . Here, a box is an object of the form , and as input we are given the coordinates of each input box. Throughout this section we assume to be constant. Note that given the coordinates of a box, it is easy to compute its side lengths and volume. Throughout, we write for the volume of the union of boxes. We want to approximate up to a factor of . Our approach is based on sampling, so now let us introduce the relevant notions.
Recall that is the Poisson distribution with mean and variance . It captures the number of active points in a space, under the assumption that active points occur uniformly and independently at random across the space, and that points are active on average.
The following definition is usually referred to as a homogeneous Poisson point process at rate . Intuitively, we activate each point in space independently with “probability density” , thus the number of activated points follows the Poisson distribution with mean .
Definition 1 (-sample).
Let be a measurable set, and let . We say that a random subset is a -sample of if for any measurable we have that .
In particular, if is a -sample of , then . Two more useful properties follow from the definition:
-
(i)
For any measurable subset , the restriction is a -sample of .
-
(ii)
The union of -samples of two disjoint sets is a -sample of .
We will make use of orthogonal range searching. Specifically, we need the query , which upon receiving and returns true if and false otherwise.
Lemma 1.
We can build a data structure in time that answers queries in time.
Proof.
For each , map the box to a higher-dimensional box
We then apply orthogonal range searching, specifically we build a multi-level segment tree over , which takes time; see [16, Section 10.4]. To answer the request where and , we query the segment tree whether there exists a box that contains the point ; or phrased differently, whether for some . The query takes only time. ∎
For our main algorithm to work, we need a constant-factor approximation of the volume . It is known that this can be computed in time [20]. In order stay simple and self-contained, we prove a weaker result by implementing an algorithm of Karp and Luby [19] with the use of appears queries.
Lemma 2 (Adapted from Karp and Luby [19]).
Given the data structure from Lemma 1, there exists an algorithm that computes in time a 2-approximation to with probability at least .
Proof.
We claim that Algorithm 1 has the desired properties. The time bound is easy to see: The computation of the prefix sums takes time. In each iteration, binary searching for costs time, sampling of costs time, and calling appears takes time. So in total we spend time.
For the correctness argument, we define two sets
Consider an iteration in step 3. For any fixed value , we have
With this we can calculate the probability that the counter increments in this iteration:
Since all iterations are independent, at the end of the algorithm we have . Hence is an unbiased estimator for .
To analyse deviation, we observe that . Therefore, . By Chebyshev and as , we have
That is, with probability at least the output is a 2-approximation to . ∎
2.2 Classifying Boxes by Shapes
As our first step in the algorithm, we classify boxes by their shapes.
Definition 2.
Let . We say that a box is of type if its side length in dimension is contained in , for each .
Using this definition, we partition the input boxes into classes such that each class corresponds to one type of boxes. We will fix this notation throughout. For each , let us also define , namely the union of boxes in class .
Similar to appears, we can answer queries of the form: Is a given point contained in ? We call this an query.
Lemma 3.
We can build a data structure in time that answers queries in time.
Proof.
Similar to the proof of Lemma 1, we transform each to a higher-dimensional box
and build a multi-level segment tree on top. The query is thus implemented by querying the point in the segment tree. ∎
Sampling from a class
The next lemma shows that we can obtain a -sample of any efficiently by rejection sampling.
Lemma 4.
Given , and the data structure from Lemma 3, one can generate a -sample of in expected time .
Proof.
Write for the type corresponding to class . We subdivide into the grid
We call each element of a cell. Let be the set of cells that have a non-empty intersection with . Write .
First we create a -sample of as follows. Generate , which determines the number of points we are going to sample. Then sample points uniformly at random from by repeating the following step times: Select a cell uniformly at random and then sample a point from uniformly at random. The sampled points constitute our set .
Next we compute : For each , we query ; if the answer is true then we keep , otherwise we discard it. The resulting set is a -sample of , since restricting to a fixed subset preserves the -sample property.
Before we analyze the running time, we show that makes up a decent proportion of . Recall that every box in class is of type . In any dimension , one projected box from can intersect at most three projected cells from . So each box from intersects at most cells from , implying that . Moreover, since the volume of any cell is at most the volume of a box in , we have .
Regarding the running time, recall that we assume to be constant and hence drop factors only depending on . The computation of takes time. The remaining time is dominated by the inClass queries. The expected size of is . As we query the data structure from Lemma 1 once for each point of , the expected time of the inClass queries is . ∎
Classes do not overlap much
We show the following interesting property of classes, that the sum of their volumes is within a polylogarithmic factor of the total volume .
Lemma 5.
We have that .
We later use this property to draw -samples from efficiently. To show this property, we first need some simple definitions and observations.
Definition 3.
We call classes of type and similar if for all we have . Otherwise we call them dissimilar.
Observation 1.
Every class is similar to at most classes.
Proof.
Fix a type . For each , there are at most many integers such that . ∎
Observation 2.
Let and be boxes in dissimilar classes, then .
Proof.
Let be the type of , and be the type of . Since the boxes belong to dissimilar classes, there is a dimension such that . Without loss of generality, assume ; the other case is symmetric. Let and be the intervals resulting from projecting the boxes and onto dimension , respectively. Note that and . So we have . In other words, at most a fraction of the interval intersects the interval . Hence,
We are now ready to prove Lemma 5.
Proof of Lemma 5.
Without loss of generality assume . We construct a set of indices by the following procedure:
-
•
Initially .
-
•
For , if and are dissimilar for all , then add to .
We have for some only if there exists an such that are similar and ; we thus call a witness of . If multiple witnesses exist, then we pick an arbitrary one. Conversely, every can be a witness at most times by Observation 1. Hence
(1) |
2.3 Joining the Classes
Recall that are the classes of the input boxes and their respective unions. Assume without loss of generality that the boxes are ordered in accordance with the class ordering, that is, form the first class, form the second class, and so on. More formally, we ensure that for .
Let be the points in that are not contained in later classes. Note that is a partition of . Hence, to generate a -sample of , it suffices to draw -samples from each and then take their union.444This idea has previously been used on objects, by considering the difference [19, 26], while we use this idea on classes. To this end, we draw a -sample from via Lemma 4. Then we remove all for which ; these are exactly the points that appear in a later class. What remains is a -sample of . The union of these sets thus is a -sample of , and we can use the size of this -sample to estimate the volume of . The complete algorithm is summarized in Algorithm 2.
Lemma 6.
Conditioned on , Algorithm 2 outputs a -approximation to with probability at least .
Proof.
Note that for all , the set is a -sample of . Since partition , their union is a -sample of . It follows that .
The expectation and variance of are . So by Chebyshev,
In other words, with probability at least , the output is a approximation to . ∎
Lemma 7.
Conditioned on , Algorithm 2 runs in expected time .
Proof.
Proof of Theorem 2.
We run Algorithm 2 with a time budget tenfold the bound in Lemma 7; if step 5 spends excessive time then we immediately abort the algorithm. So the stated time bound is clearly satisfied.
Now consider three bad events:
-
•
.
-
•
, but the algorithm is aborted.
-
•
and the algorithm is not aborted, but it does not output a -approximation to .
By Lemma 2, the first event happens with probability at most . By Markov’s inequality, the second event happens with probability at most . Lastly, by Lemma 6, the third event happens with probability at most . So the total error probability is at most . If none of the bad events happen, then the algorithm correctly outputs a -approximation to . The success probability of can be boosted to, say, by returning the median of a sufficiently large constant number of repetitions of the algorithm. ∎
2.4 Handling Discrete Boxes
We now argue that our algorithm for boxes in also solves the following discrete variant of Klee’s measure problem: Given boxes in , count the number of points in the union . To solve this problem, we employ the following embedding of into :
Note that transforms discrete boxes into continuous boxes, and that the cardinality of any is equal to the volume of its image . Hence the discrete variant of Klee’s measure problem reduces to the continuous counterpart.
3 Lower Bound for Union Volume Estimation
We consider estimating the volume of the union of (measurable) objects . These objects are only accessible through the following three queries:
-
•
: Return the volume of object .
-
•
: Draw a uniform random point from .
-
•
: Given a point , return whether or not.
It is known that queries suffice to return with constant probability a -approximation to the volume of the union . Here we prove a matching lower bound.
For convenience, we also consider a discrete version of the problem in which each object is instead a finite subset of the integer lattice . The queries are then
-
•
: Return the cardinality .
-
•
: Draw a uniform random point from .
-
•
: Given a point , return whether or not.
The goal is to give a -approximation to the cardinality of the union.
In Section 3.1 we show a lower bound for the discrete version, and then in Section 3.2 we show that a lower bound for the discrete version implies a similar lower bound for the continuous version.
3.1 Lower Bound for Discrete Union
In the remainder, we write . The starting point is what we call the Query-Gap-Hamming problem: The input is two (hidden) vectors and we can access an arbitrary bit of or at a time. The goal is to distinguish the cases and using as few accesses as possible. Query-Gap-Hamming has linear query complexity:
Lemma 8.
Any randomized algorithm solving Query-Gap-Hamming with probability at least requires accesses to and , regardless of the computational resources it uses.
Proof.
This follows by a folklore argument from the fact that the Gap-Hamming problem has linear randomized communication complexity [8]. We next describe the details.
We reduce from the communication complexity of the Gap-Hamming problem, where Alice holds a vector , Bob holds a vector , and their goal is to distinguish from while communicating as few bits as possible. It is known that the two-way, public-coin randomized communication complexity of Gap-Hamming is [8]. Now suppose that a randomized algorithm can solve Query-Gap-Hamming with probability at least , while making only accesses to and . We construct a protocol between Alice and Bob: They simulate the algorithm synchronously, using a shared random tape. Whenever the algorithm tries to access , Alice sends the bit to Bob. Whenever it tries to access , Bob sends the bit to Alice. Clearly both parties can simulate the algorithm till the end, and output the answer of the algorithm. The communication cost is bits, which contradicts the aforementioned communication complexity. ∎
Next we give a reduction from Query-Gap-Hamming to estimating the cardinality of a union of objects. In more detail, from the hidden input vectors we (implicitly) define objects . Write . Given permutations of , we define
for every . Analogously, given a different set of permutations , we define
for every . Note that is a subset of all and .
Consider an arbitrary index . If then the point sets and are equal, so they together contribute to the cardinality of the union. On the other hand, if then they are disjoint and thus contribute . Furthermore, the point set is contained in all objects and contributes . Hence, the cardinality of the union equals
Let be a -approximation to the cardinality of the union, i.e., . Since
and , by computing we obtain a value in , namely an additive approximation to with probability at least 4/5. For this allows to decide or .
Let be a (possibly randomized) algorithm that -approximates the volume of union of any objects with probability at least , using queries. We assume that ; otherwise we modify to ask dummy queries.
We now simulate as if the input were the objects . It remains to argue that we can answer all queries by while accessing few bits in and . Specifically, the number of accesses would be only . The details of the simulation algorithm are as follows:
Algorithm :
-
1.
Sample random permutations and of uniformly and independently.
-
2.
Simulate algorithm and answer its queries as follows.
-
•
: Answer .
-
•
: In the case :
-
(S1)
With probability , answer with a uniform random point .
-
(S2)
With the remaining probability, pick a uniform random . If we have not accessed yet, access it and keep it in memory. Then answer with the point .
In the case , do the same with replaced by and replaced by .
-
(S1)
-
•
: Let . In the case :
-
(C1)
If then answer true.
-
(C2)
Else, if or then answer false.
-
(C3)
Else, we have . If we have not accessed yet, access it and keep it in memory. If then answer true, otherwise answer false.
In the case , do the same with replaced by and replaced by .
-
(C1)
-
•
-
3.
Let be output of and return .
This finishes the description of algorithm . It is immediate from the algorithm that the execution of is the same as if actually running it on the objects and for . What remains is to bound the number of accesses to and by during the simulation.
To this end, observe that an access to (respectively ) occurs only when the query enters (S2) or (C3). In both (S2) and (C3), a permutation entry (respectively ) is involved, and we say that the entry is hit by the query.
By definition, the number of accesses to and is exactly the number of entries and hit by some query. In light of this, we can move on to upper bound the latter.
We consider two bad events. Let be the event that more than entries are hit by (S2). Let be the event that at most entries are hit by (S2), but more than entries are freshly hit by (C3). Here “freshly” means that the entry was not hit by any query before it is hit by (C3).
Entries hit by (S2).
We first consider the number of entries hit by (S2). For , define an indicator random variable taking the value iff the -th query of enters case (S2). Since every query may hit at most one entry, the total number of entries hit by (S2) is at most . Note that for all and hence . So by Markov’s inequality, .
Entries freshly hit by (C3).
The tricky query to analyze is the query. We will show that . Roughly, we need to argue that if was not hit previously then is unlikely to ask a query with . The intuition is that is unaware of the permutations and , and thus to get a fresh hit it has to “guess” an entry of a permutation.
For the proof, assume for the sake of contradiction that . Under this assumption, we give an algorithm for encoding the random permutations in less than bits in expectation. This is an information theoretic contradiction. More formally, our proof considers a game between an encoder and a decoder. The encoder receives as well as the random tape used by and in simulation step 2. The decoder receives . The encoder must send a message to the decoder which allows the decoder to reconstruct . Since the Shannon entropy is , it follows by Shannon’s source coding theorem that the expected length of the message must be at least bits.
The way we use the assumption , is that the encoder will send the indices of the queries among which freshly hit an entry in (C3). The encoder will further send information that allows the decoder to simulate for the remaining queries. Whenever the decoder reaches one of the specified queries, she knows that the point given by the query satisfies . This allows her to recover , i.e., roughly bits of information. But sending such indices costs bits, or bits per index. Since , we use less bits than the information theoretic lower bound, which is a contradiction. We now proceed to give the formal details.
Encoding procedure.
The encoder receives random permutations and also , and proceeds as follows:
-
1.
Initialize algorithm with the given permutations. Run it from step 2 onward, using the given tape to make random choices for and .
-
2.
If the event does not happen, send a 0-bit followed by a naive encoding of all permutations.
-
3.
Otherwise happens. Signal this by sending a 1-bit. Then send the indices of the first queries that freshly hit some entry in (C3). Next, denote . For in that order, if the -th query hits an entry in (S2) then send the value of that entry. Finally, for each permutation and , send the induced permutation on its entries not hit by queries .
Decoding procedure.
We next argue that we can recover the permutations and after receiving and the above encoding.
-
1.
If the leading bit of the encoding is a 0, then we immediately recover all permutations from the rest of the encoding.
-
2.
If the leading bit is a 1, we start by recovering and . Then we simulate algorithm up to the -th query, as if we knew the permutations. In the meantime we gradually recover all entries and that are hit. More precisely, for we answer the -th query by as follows.
-
•
: Answer .
-
•
: In the case :
-
–
If the tape decides to give a point , then answer with this point .
-
–
Else, the tape decides to give a . Since is hit by this query in (S2), its value is readily available in the encoding. We answer .
In the case , do the same with replaced by and replaced by .
-
–
-
•
: Let . In the case :
-
–
If then answer true.
-
–
Else, if then answer false.
-
–
Else, if then the current query freshly hits , so it must be the case that . We have thus recovered . Then we answer true if ; otherwise we answer false.
-
–
Finally, if then was hit before, or it is not hit by the current query. In the former case we know its value, so we answer true if , and false otherwise. In the latter case we know that , so we simply answer false.
In the case , do the same with replaced by and replaced by .
-
–
-
•
-
3.
Having recovered all entries of and that are hit by queries , we finally recover the remaining entries from the rest of the encoding.
Encoding length.
We finally analyze the expected encoding length to derive a contradiction to the assumption that .
If does not happen then the encoding length is bits. If happens then we can save a significant number of bits. To this end, let us focus on the queries . Let be the number of entries hit by (S2); note that under the event . For let be the number of entries in not hit by any query. Similarly, for let be the number of entries in not hit by any query. Then the encoding length is
By Stirling’s approximation, we have . Hence, the product of the largest terms in the factorial (namely ) is at least . Thus
Since is exactly the number of entries hit by queries , the encoding length is at most
where we used by event .
Recalling our choice of and the assumption that , the above is at most
Therefore, the expected encoding length is no more than
where the last line follows from the assumption that . This contradicts with the information theoretic lower bound.
Conclusion.
We have now shown that and . By a union bound, we have that none of the events happen, so computes a additive approximation to , with probability at least . In this case, the number of hit entries is at most , so is the number of accesses to . If performs more than queries, we may simply abort and return an arbitrary answer; this does not affect the probability bound.
Recall that we made the simplifying assumption . If the algorithm that we began with asks less than queries, then we added dummy queries to ensure , and the number of accesses to becomes . In any case, the number of accesses is . We thus have an algorithm that makes only accesses and returns a additive approximation with probability at least . We may set to obtain an additive approximation. This is enough to solve the Query-Gap-Hamming problem and hence the number of accesses must be by Lemma 8. We thus have , or . This proves Theorem 1, in the discrete setting with objects in .
3.2 Continuous to Discrete
To prove a lower bound for estimating the volume of the union of objects in , we give a simple reduction from estimating the cardinality of the union of objects in . Let be an algorithm for estimating the volume of the union of objects in using and queries.
We use to estimate the cardinality of the union of sets in as follows. Let be the objects. We think of them as objects in by replacing each point in an object by the unit square that has in its lower left corner, i.e. . Denote the resulting objects in by . (Note that applying this transformation to our objects from the previous reduction gives connected axis-aligned polygons.)
The volume of the union is the same as the cardinality of the union . We thus merely need to simulate as if the input was . For this, on every query made by to the object , we ask the same query to . For a query made by , we run on , receive an integer point . We then draw and independently and uniformly at random and feed the point as the result of the query. Finally, when asks a query, we simply round the coordinates down to the nearest integers to obtain a point . When then query on . Correctness follows immediately and we conclude:
Theorem 3.
Any algorithm for computing a -approximation to the volume of the union of objects in with probability at least via and queries, must use queries.
References
- [1] Pankaj K. Agarwal. An improved algorithm for computing the volume of the union of cubes. In David G. Kirkpatrick and Joseph S. B. Mitchell, editors, Proceedings of the 26th ACM Symposium on Computational Geometry, Snowbird, Utah, USA, June 13-16, 2010, pages 230–239. ACM, 2010.
- [2] Pankaj K. Agarwal, Haim Kaplan, and Micha Sharir. Computing the volume of the union of cubes. In Jeff Erickson, editor, Proceedings of the 23rd ACM Symposium on Computational Geometry, Gyeongju, South Korea, June 6-8, 2007, pages 294–301. ACM, 2007.
- [3] Karl Bringmann. An improved algorithm for Klee’s measure problem on fat boxes. Comput. Geom., 45(5-6):225–233, 2012.
- [4] Karl Bringmann and Tobias Friedrich. Approximating the volume of unions and intersections of high-dimensional geometric objects. Comput. Geom., 43(6-7):601–610, 2010.
- [5] Ran Canetti, Guy Even, and Oded Goldreich. Lower bounds for sampling algorithms for estimating the average. Inf. Process. Lett., 53(1):17–25, 1995.
- [6] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022.
- [7] Ruoxu Cen, William He, Jason Li, and Debmalya Panigrahi. Beyond the quadratic time barrier for network unreliability. CoRR, abs/2304.06552, 2023.
- [8] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 51–60, 2011.
- [9] Supratik Chakraborty, Kuldeep S. Meel, and Moshe Y. Vardi. A scalable approximate model counter. In Christian Schulte, editor, Principles and Practice of Constraint Programming - 19th International Conference, CP 2013, Uppsala, Sweden, September 16-20, 2013. Proceedings, volume 8124 of Lecture Notes in Computer Science, pages 200–216. Springer, 2013.
- [10] Timothy M. Chan. Geometric applications of a randomized optimization technique. Discret. Comput. Geom., 22(4):547–567, 1999.
- [11] Timothy M. Chan. Semi-online maintenance of geometric optima and measures. SIAM J. Comput., 32(3):700–716, 2003.
- [12] Timothy M. Chan. A (slightly) faster algorithm for Klee’s measure problem. Comput. Geom., 43(3):243–250, 2010.
- [13] Timothy M. Chan. Klee’s measure problem made easy. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 410–419. IEEE Computer Society, 2013.
- [14] Timothy M. Chan. Minimum l_ hausdorff distance of point sets under translation: Generalizing klee’s measure problem. In Erin W. Chambers and Joachim Gudmundsson, editors, 39th International Symposium on Computational Geometry, SoCG 2023, June 12-15, 2023, Dallas, Texas, USA, volume 258 of LIPIcs, pages 24:1–24:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023.
- [15] Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523–544, 2007.
- [16] Mark de Berg, Otfried Cheong, Marc J. van Kreveld, and Mark H. Overmars. Computational geometry: algorithms and applications, 3rd Edition. Springer, 2008.
- [17] David Eppstein and Jeff Erickson. Iterated nearest neighbors and finding minimal polytopes. Discret. Comput. Geom., 11:321–350, 1994.
- [18] David R. Karger. A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM J. Comput., 29(2):492–514, 1999.
- [19] Richard M. Karp and Michael Luby. Monte-Carlo algorithms for the planar multiterminal network reliability problem. J. Complex., 1(1):45–64, 1985.
- [20] Richard M. Karp, Michael Luby, and Neal Madras. Monte-Carlo approximation algorithms for enumeration problems. J. Algorithms, 10(3):429–448, 1989.
- [21] Benny Kimelfeld, Yuri Kosharovsky, and Yehoshua Sagiv. Query efficiency in probabilistic XML models. In Jason Tsong-Li Wang, editor, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 701–714. ACM, 2008.
- [22] Victor Klee. Can the measure of be computed in less than steps? The American Mathematical Monthly, 84(4):284–285, 1977.
- [23] Michael G. Luby. Monte-Carlo methods for estimating system reliability. Technical report, Report UCB/CSD 84/168, Computer Science Division, University of California, Berkeley, 1983.
- [24] Kuldeep S. Meel, Sourav Chakraborty, and N. V. Vinodchandran. Estimation of the size of union of delphic sets: Achieving independence from stream size. In Leonid Libkin and Pablo Barceló, editors, PODS ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pages 41–52. ACM, 2022.
- [25] Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. Not all FPRASs are equal: demystifying FPRASs for DNF-counting. Constraints An Int. J., 24(3-4):211–233, 2019.
- [26] Kuldeep S. Meel, N. V. Vinodchandran, and Sourav Chakraborty. Estimating the size of union of sets in streaming models. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 126–137. ACM, 2021.
- [27] Mark H. Overmars and Chee-Keng Yap. New upper bounds in Klee’s measure problem. SIAM J. Comput., 20(6):1034–1045, 1991.
- [28] Aduri Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, and Kuldeep S. Meel. Model counting meets estimation. In Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo, editors, PODS’21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20-25, 2021, pages 299–311. ACM, 2021.
- [29] Christopher Ré, Nilesh N. Dalvi, and Dan Suciu. Efficient top-k query evaluation on probabilistic data. In Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pages 886–895. IEEE Computer Society, 2007.
- [30] Qiaosheng Shi and Binay K. Bhattacharya. Application of computational geometry to network p-center location problems. In Proceedings of the 20th Annual Canadian Conference on Computational Geometry, Montréal, Canada, August 13-15, 2008, 2008.
- [31] Srikanta Tirthapura and David P. Woodruff. Rectangle-efficient aggregation in spatial data streams. In Michael Benedikt, Markus Krötzsch, and Maurizio Lenzerini, editors, Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, pages 283–294. ACM, 2012.
- [32] Jan van Leeuwen and Derick Wood. The measure problem for rectangular ranges in d-space. J. Algorithms, 2(3):282–300, 1981.
- [33] Hakan Yildiz and Subhash Suri. On Klee’s measure problem for grounded boxes. In Tamal K. Dey and Sue Whitesides, editors, Proceedings of the 28th ACM Symposium on Computational Geometry, Chapel Hill, NC, USA, June 17-20, 2012, pages 111–120. ACM, 2012.