NEAR-Miner

Mining evolution associations of web site directories for efficient maintenance of web archives

authored by
Ling Chen, Sourav S. Bhowmick, Wolfgang Nejdl
Abstract

Web archives preserve the history of autonomous Web sites and are potential gold mines for all kinds of media and business analysts. The most common Web archiving technique uses crawlers to automate the process of collecting Web pages. However, (re)downloading entire collection of pages periodically from a large Web site is unfeasible. In this paper, we take a step towards addressing this problem. We devise a data mining-driven policy for selectively (re)downloading Web pages that are located in hierarchical directory structures which are believed to have changed significantly (e.g., a substantial percentage of pages are inserted to/removed from the directory). Consequently, there is no need to download and maintain pages that have not changed since the last crawl as they can be easily retrieved from the archive. In our approach, we propose an off-line data mining algorithm called NEAR-Miner that analyzes the evolution history of Web directory structures of the original Web site stored in the archive and mines negatively correlated association rules (near) between ancestor-descendant Web directories. These rules indicate the evolution correlations between Web directories. Using the discovered rules, we propose an efficient Web archive maintenance algorithm called warm that optimally skips the subdirectories (during the next crawl) which are negatively correlated with it in undergoing significant changes. Our experimental results with real data show that our approach improves the efficiency of the archive maintenance process significantly while sacrificing slightly in keeping the "freshness" of the archives. Furthermore, our experiments demonstrate that it is not necessary to discover nears frequently as the mining rules can be utilized effectively for archive maintenance over multiple versions.

Organisation(s)
L3S Research Centre
External Organisation(s)
Nanyang Technological University (NTU)
Type
Article
Journal
Proceedings of the VLDB Endowment
Volume
2
Pages
1150-1161
No. of pages
12
Publication date
01.08.2009
Publication status
Published
Peer reviewed
Yes
ASJC Scopus subject areas
Computer Science (miscellaneous), General Computer Science
Electronic version(s)
https://doi.org/10.14778/1687627.1687757 (Access: Closed)