Maintaining Bernoulli samples over evolving multisets

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

  • Rainer Gemulla - , TUD Dresden University of Technology (Author)
  • Wolfgang Lehner - , Chair of Databases (Author)
  • Peter J. Haas - , IBM (Author)

Abstract

Random sampling has become a crucial component of modern data management systems. Although the literature on database sampling is large, there has been relatively little work on the problem of maintaining a sample in the presence of arbitrary insertions and deletions to the underlying dataset. Most existing maintenance techniques apply either to the insert-only case or to datasets that do not contain duplicates. In this paper, we provide a scheme that maintains a Bernoulli sample of an underlying multiset in the presence of an arbitrary stream of updates, deletions, and insertions. Importantly, the scheme never needs to access the underlying multiset. Such Bernoulli samples are easy to manipulate, and are well suited to parallel processing environments. Our method can be viewed as an enhancement of the "counting sample" scheme developed by Gibbons and Matias for estimating the frequency of highly frequent items. We show how the "tracking counters" used by our maintenance scheme can be exploited to estimate population frequencies, sums, and averages in an unbiased manner, with lower variance than the usual estimators based on a Bernoulli sample. The number of distinct items in the multiset can also be estimated without bias. Finally, we discuss certain problems of subsampling and merging that a rise in systems with limited memory resources or distributed processing, respectively.

Details

Original languageEnglish
Title of host publicationProceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
Pages93-102
Number of pages10
Publication statusPublished - 11 Jun 2007
Peer-reviewedYes

Conference

Title26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
Duration11 - 13 June 2007
CityBeijing
CountryChina

External IDs

ORCID /0000-0001-8107-2775/work/200630411

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Subject groups, research areas, subject areas according to Destatis

ASJC Scopus subject areas

Keywords

  • Bernoulli multiset sampling, Incremental sample maintenance