Spamming White Papers
The Adversarial Information Retrieval on the Web conference, held in Seattle on August 10th, 2006, produced several interesting papers on how to better classify website spam.
Of particular interest is the paper on Tracking Web Spam with Hidden Similarity,which discusses fingerprinting. One of the methods described involves removing alphanumeric characters from a document and leaving HTML based properties. Then comparing similiar HTML based footprints. For their experiment, they crawled 3 million documents from 1300 hosts of DMOZ directory, a flat crawl of 1 million documents from a french search engine blacklist, and a deep breadth crawl of 10 non-adult spam urls and 10 trusable urls from French universities.
In their estimation, roughly 2/5th’s of these documents can be classified as spam.
In future posts, I will discuss several techniques used by webspammers, including automatic content-generation.
Fighting Spam Blogs, a Hypothesis
10 August 2006
