NEAR-Miner

Mining evolution associations of web site directories for efficient maintenance of web archives

verfasst von
Ling Chen, Sourav S. Bhowmick, Wolfgang Nejdl
Abstract

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

Organisationseinheit(en)
Forschungszentrum L3S
Externe Organisation(en)
Nanyang Technological University (NTU)
Typ
Artikel
Journal
Proceedings of the VLDB Endowment
Band
2
Seiten
1150-1161
Anzahl der Seiten
12
Publikationsdatum
01.08.2009
Publikationsstatus
Veröffentlicht
Peer-reviewed
Ja
ASJC Scopus Sachgebiete
Informatik (sonstige), Allgemeine Computerwissenschaft
Elektronische Version(en)
https://doi.org/10.14778/1687627.1687757 (Zugang: Geschlossen)