Maintaining Bernoulli samples over evolving multisets
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively.
Details
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 |
| Pages | 93-102 |
| Number of pages | 10 |
| Publication status | Published - 11 Jun 2007 |
| Peer-reviewed | Yes |
Conference
| Title | 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 |
|---|---|
| Duration | 11 - 13 June 2007 |
| City | Beijing |
| Country | China |
External IDs
| ORCID | /0000-0001-8107-2775/work/200630411 |
|---|
Keywords
Research priority areas of TU Dresden
DFG Classification of Subject Areas according to Review Boards
Subject groups, research areas, subject areas according to Destatis
ASJC Scopus subject areas
Keywords
- Bernoulli multiset sampling, Incremental sample maintenance