Deferred maintenance of disk-based random samples

Research output: Contribution to book/Conference proceedings/Anthology/ReportConference contributionContributedpeer-review

Contributors

Abstract

Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a "sample" of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.

Details

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2006 - 10th International Conference on Extending Database Technology, Proceedings
Pages423-441
Number of pages19
Publication statusPublished - 2006
Peer-reviewedYes

Publication series

SeriesLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3896 LNCS
ISSN0302-9743

Conference

Title10th International Conference on Extending Database Technology, EDBT 2006
Duration26 - 31 March 2006
CityMunich
CountryGermany

External IDs

ORCID /0000-0001-8107-2775/work/200630407

Keywords

Research priority areas of TU Dresden

DFG Classification of Subject Areas according to Review Boards

Subject groups, research areas, subject areas according to Destatis