Friday, 1:30–3:00 PM
Chair: Flavio Junqueira

A Pattern Tree-based Approach to Learning URL Normalization Rules

Rui Cai, Lei Zhang

Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs to a canonical form using a set of rewrite rules. Nowadays, URL normalization is attracting significant attention as it is lightweight and can be flexibly integrated into both the online (e.g. crawling) and offline (e.g. index compression) parts of a search engine. To deal with a large scale of websites, automatic approaches are highly desired to learn rewrite rules for various kinds of duplicate URLs. In this paper, we rethink the problem of URL normalization from a global perspective and propose a pattern tree-based approach. This is remarkably different from existing approaches, which normalize URLs by iteratively inducing local duplicate pairs to a more general form, and inevitably suffer from noisy training URLs and the low efficiency problem in practical systems. Given a training set of URLs partitioned into duplicate clusters for a targeted website, we develop a simple yet efficient algorithm to automatically construct a URL pattern tree. With the constructed pattern tree, we can leverage the statistical information from all the training samples and make the learning process more robust and reliable. The learning process can also be accelerated as rules are directly summarized based on pattern tree nodes. In addition, from the engineering side, the pattern tree can help select deployable rules by removing conflictions and redundancies. A large-scale evaluation on more than 70 million duplicate URLs from 200 websites showed the proposed approach can achieve very promising performance, in terms of both de-duping effectiveness and computational efficiency.

0-Cost Semisupervised Bot Detection for Search Engines

Hongwen Kang, Kuansan Wang, David Soukal, Fritz Behr, Zijian Zheng

In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. This is a crucial problem for web search engine because bot traffic significantly affect both the realtime search engine performance and the quality of the offline data-mining results of the search logs. However, the enormous amount of search data pose a challenging problem for traditional approaches that rely on fully annotated training samples. To this end, we propose a novel semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs search sessions that are likely to be recordings of genuine human users and bot generated. This step outputs a large set of training samples with initial labels, though directly using these training data sources is problematic because they are biased samples of the whole data space and therefore can lead a machine learning system to acquire a skewed decision boundary that does not generalize well for unseen data. To tackle this problem, we further develop a semi-supervised learning algorithm that takes advantage of unlabeled data to improve the classification performance. These two proposed algorithms are seamlessly combined and have the following advantages. First, it becomes very cost efficient to generate large number of labeled data to initialize the training process. Second, our semi-supervised learning approach is very effective and resilient against the bias issue in the data generation process. In our experiment, the proposed approach showed significant (i.e. 2~3:1) improvement compared to the traditional supervised approach.

CETR – Content Extraction via Tag Ratios

Tim Weninger, William Hsu, Jiawei Han

We present Content Extraction via Tag Ratios (CETR) — a method to extract content text from diverse Web pages by using the HTML document’s tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.


Back to full list of papers