Paper Cuts: Fetch the Information Ubu, Fetch!...Good Dog!

This week’s session was on information retrieval. Although it was far more manageable and intelligible than SQL it instead raised some questions all of its own.

Normally I don’t ask queries in search engines in natural language. I always, and essentially sub consciously, break it down into key words and search that way. ‘Who was the Luftwaffe commander at the Battle of El Alamein?’ for me looks like ‘Luftwaffe commander el alamein’. This will not always return the best results – or even the accurate ones – but I then build the question back up from the keywords. Not sure why I do it this way, but I always have.

When we turned to the exercises after the lecture that was when issues quickly popped up, the obvious one being how do you define ‘relevant’? If the OX Student Union relevant when searching for OX Uni? What about OUP? OU Alumni? Some engines add in maps, ‘covert’ ads, tweets and images. Do these count as results or for the relevance calculation (relevant documents retrieved divided by total documents retrieved)? The ads can answer some queries (‘where do I buy X?’) so do they count as results? What about when one answer has a set of other answers, often seen on Google, whereby it gives a banner correct answer and then has results and links to other sub sets of answers?

As search engines generally restrict to ten results per page there is an inherent restriction on the relevance calculation therefore everything is done on a basis of ten in the calculation and if there are no matches in the top ten the result has to be a fail. Or should the total number of results (sometimes in the hundreds of millions) be part of the calculation? Or perhaps we are only looking for a relevance calculation of 0.0 to 1?

Looking for car insurance – do comparison / screen scraper sites count as good results? They are sometimes not a place where you can actually make a purchase but rather gateways to other providers. They are not the end of the search, you have to use them and then perhaps go to another search result that they generate dependent on what you have asked for.

‘Car insurance’ results can give you results you didn’t ask for with top / most / all hits having an additional word added such as ‘cheap’. On Bing, ‘car insurance NOT cheap’ relegated cheap car insurance to the ads. On Google, cheap car insurance was still a regular hit.

‘President uruguay biography’ had some interesting results. As it is not a dated request, some results gave the wrong answer – previous presidents for example throughout Uruguayan history. Adding AND made little difference.

‘Who is the current president of uruguay?’ was less successful again as the searches seem to focus on the post and not the post holder. It would also suggest links to sites like Yahoo Answers whereby anyone can give an answer.

Natural language did not work as well as basic commands such as ‘uruguay president biography’, but a large reason for that is Wikipedia. Remove the Wikipedia result and it becomes harder and more descriptors and words are needed.

Captain Swing query was entered as posed on the exercises. Bing came out on top here as their 2/3 lines under the search result contained more relevant information so that on Bing I knew almost at once who the good Captain was and what role he played. Google had the best top hit however and the best information in that hit.

‘Captain Swing English history’ returned far better results. The 19^th century element of the request seems less important but Captain Swing isn’t a very common name. If it was the history of a Smith, Jones or similar then there may have been far more results. Or if it had been someone like a Roosevelt or a Churchill as those names and families have been involved in historical events for hundreds of years so a more specific request would be appropriate.

With cooking the artichokes, the keyword cook or cooking would take away the focus from what the (I believe are inedible) vegetables are. More so than Swing and other questions, it was a two part question that would need two results. Some results would give both but many gave one or the other or both but unsatisfactorily.

Korean War and Putney Debates were straight forward enough, though impact of the latter was a little trickier to get than origins of the former. Here I think the popularity (ultimately measured in £’s) of the Korean War play a part in how much is written, therefore making it easier or harder to find really relevant results right off the bat. Although this may be the historian with prejudices and favourite periods coming out in me as opposed to something I’d put in an essay.

Levellers and the reasons behind Operation Barbarossa pretty much the same.

And who was best, Bing or Google? Well it has to be said that it was Bing based on the numbers. Now, part of this may be to do with slight changes in the calculation each time based on variables listed above, but it was certainly a surprise. I loathe Bing and find it little to no use in the searches I perform and any query on it is usually one word “google”. I’m not going to switch to Bing despite this as I simply do not like it and don’t find what I’m after other than in this kind of experiment. Every man has a right to his prejudices after all.

Clearly my standard way of searching – as little natural language as possible – is workable but not a great system for all occasions. Without Wikipedia the results of a lot of this would be very different as well. I will think about the best way to search and to retrieve information but as this clock is not broken I’m not going to take it apart just yet.

Paper Cuts

Tuesday, 18 October 2011

Fetch the Information Ubu, Fetch!...Good Dog!

No comments:

Post a Comment