Search Results Auto Classified By Webpage Genres

Researchers have developed a very accurate and robust model to classify any given webpage as a blog, newspaper site, forum, FAQ, e-shop, listing, personal home page etc and this would extend the limited search results classification (e.g., Google News) available now.

Search results classified by genre are not something new; however most of the classification is manual. For instance, sites/pages listed in Google News are manually submitted and approved links, rather than auto classification. Manual classification of every single page on the web into several genres is impossible. Such a classification will help users easily access information based on very specific needs.

A group of researchers from Greece have implemented and tested a model, which used textual and structural information. This combination of information has leaded to a very high accuracy (96.5%) on the tests conducted. Previous attempts on this had limited success due to the lack of pre classified list of websites for testing.

The research used two different set of classifications (corpora) named 7Genre and KI-04. 7Genre had seven different genres – blog, e-shop, FAQs, online newspaper frontpage, listing, personal home page and search page, with each group having 200 pages for testing and training of the model. KI-04 had eight genres – article, download, link collection, portrayal-private, discussion, help, portrayal-non private and shop. Both of these corpora were used to auto train the model and then test on the success of the model.

To test the robustness of this method, cross testing was performed as well, where in training was done on one corpora and testing was done on a different corpora. Such a setup helped identify the possibilities of using such a system in a real life situation. This testing also produced significantly accurate results.

This model does not need any manual selection of features that best capture the stylistic properties of text or structure of webpages. This enables adapting to specific properties of either a given general or focused collection of genres. The time taken for training is analogous to the size of the corpus; however the cross check experiments show that training on smaller and different corpora is sufficient enough for training. The auto training enables evolving of the model to the ever changing web and can be used for any language.

Read More » Comments Off