UniCrawl: A Practical Geographically Distributed Web Crawler

Do Le Quoc; Christof Fetzer; Pascal Fellber; Étienne Rivière; Valerio Schiavoni; Pierre Sutra

doi:10.1109/CLOUD.2015.59

UniCrawl: A Practical Geographically Distributed Web Crawler

Publikation: Beitrag in Buch/Konferenzbericht/Sammelband/Gutachten › Beitrag in Konferenzband › Beigetragen › Begutachtung

Beitragende

Do Le Quoc - , Professur für Systems Engineering (SE) (Autor:in)
Christof Fetzer - , Professur für Systems Engineering (SE) (Autor:in)
Pascal Fellber - (Autor:in)
Étienne Rivière - (Autor:in)
Valerio Schiavoni - (Autor:in)
Pierre Sutra - (Autor:in)

Abstract

As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo-distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.

Details

Originalsprache	Englisch
Titel	8th IEEE International Conference on Cloud Computing (CLOUD'15)
Herausgeber (Verlag)	IEEE Computer Society, Washington
Publikationsstatus	Veröffentlicht - 1 Juli 2015
Peer-Review-Status	Ja

Externe IDs

Scopus	84960145187

Schlagworte

Forschungsprofillinien der TU Dresden

Informationstechnologien und Mikroelektronik

DFG-Fachsystematik nach Fachkollegium

Sicherheit und Verlässlichkeit

Schlagwörter

web crawler, geo-distributed system, cloud federation, storage, map-reduce, Uniform resource locators, Computer architecture, Distributed databases, Web pages

Forschungsportal der TU Dresden