Deferred maintenance of disk-based random samples
Research output: Contribution to book/Conference proceedings/Anthology/Report › Conference contribution › Contributed › peer-review
Contributors
Abstract
Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a "sample" of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.
Details
| Original language | English |
|---|---|
| Title of host publication | Advances in Database Technology - EDBT 2006 - 10th International Conference on Extending Database Technology, Proceedings |
| Pages | 423-441 |
| Number of pages | 19 |
| Publication status | Published - 2006 |
| Peer-reviewed | Yes |
Publication series
| Series | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 3896 LNCS |
| ISSN | 0302-9743 |
Conference
| Title | 10th International Conference on Extending Database Technology, EDBT 2006 |
|---|---|
| Duration | 26 - 31 March 2006 |
| City | Munich |
| Country | Germany |
External IDs
| ORCID | /0000-0001-8107-2775/work/200630407 |
|---|