Oh Google, Why so Candid? (Screenshots)

“Google Suggest” is a feature which suggests search queries as you type — based on the spelling and the search frequency of potential matching search terms.  There is one other feature “Searches Related to” which is displayed in the bottom of the search page. This feature is not based on the spelling, and is usually referred to as query refinement; much similar to Amazon’s “people who bought X also bought Y & Z”. This main difference of using the spelling and search frequencies to generate suggestions and other metrics to generate related searches helps Google expose quite an array of other possible searches the user could be doing.

Now, here is what Google thinks is related to a couple of interesting search terms.

Google vs Bing – Best Explained (pic)

So how long will Bing take to catch up? Bing you are not new, you are just a renamed search engine.
UPDATE: To people denying this, visit www.bing.com/?cc=de and enter world cup as you query.

Search Results Auto Classified By Webpage Genres

Researchers have developed a very accurate and robust model to classify any given webpage as a blog, newspaper site, forum, FAQ, e-shop, listing, personal home page etc and this would extend the limited search results classification (e.g., Google News) available now.

Search results classified by genre are not something new; however most of the classification is manual. For instance, sites/pages listed in Google News are manually submitted and approved links, rather than auto classification. Manual classification of every single page on the web into several genres is impossible. Such a classification will help users easily access information based on very specific needs.

A group of researchers from Greece have implemented and tested a model, which used textual and structural information. This combination of information has leaded to a very high accuracy (96.5%) on the tests conducted. Previous attempts on this had limited success due to the lack of pre classified list of websites for testing.

The research used two different set of classifications (corpora) named 7Genre and KI-04. 7Genre had seven different genres – blog, e-shop, FAQs, online newspaper frontpage, listing, personal home page and search page, with each group having 200 pages for testing and training of the model. KI-04 had eight genres – article, download, link collection, portrayal-private, discussion, help, portrayal-non private and shop. Both of these corpora were used to auto train the model and then test on the success of the model.

To test the robustness of this method, cross testing was performed as well, where in training was done on one corpora and testing was done on a different corpora. Such a setup helped identify the possibilities of using such a system in a real life situation. This testing also produced significantly accurate results.

This model does not need any manual selection of features that best capture the stylistic properties of text or structure of webpages. This enables adapting to specific properties of either a given general or focused collection of genres. The time taken for training is analogous to the size of the corpus; however the cross check experiments show that training on smaller and different corpora is sufficient enough for training. The auto training enables evolving of the model to the ever changing web and can be used for any language.

Read More » Comments Off