Paper Cuts: October 2011

Sunday, 30 October 2011

DITA Assignment

No matter how extensive a library collection is, if the required information cannot be retrieved then it swiftly becomes of much lower value or even useless. Any librarian has to evaluate technologies and manage data to get the best results for their users.

Evaluation

Vannevar Bush’s comment in 1945 that “...The investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear...”[1] has become even truer as the World Wide Web (WWW) first became popular and then an unstoppable force on the Internet.

For libraries, the Internet – but more so the WWW - have been deeply disruptive technologies (first changing and then overturning how they previously held, shared and presented data) and while it makes more information more available to more people, the nature in which the WWW has done this has made it often difficult to find relevant information.

Even though the WWW was intended to have a clear structure the 11.45 billion web pages, as of 25 October 2011[2], presents a baffling amount of data. The URL of a page contains a clear structure and can sometimes reveal information about the page – what nation the web page is in, if it is a university, a brand or company name, etc – but it is imperative to use appropriate search methods and techniques and not rely on decoding protocols.

Via search engines Broder’s Taxonomy[3] gives us assistance with Navigational (I want to go to this one place), Transactional (I want to buy this) and Informational (what do you have on this) queries but we also have the option of using keywords (‘attack 911 New York’), natural language (‘the attack on the trade towers’) and the Boolean techniques of AND / OR / NOT. The more specific a library user can be the better and informational retrieval queries such as Known Item (I want this one thing), Fact (what is this one thing), Subject (all examples of this thing) and Exploratory (what things do you have) all help narrow the field.

In a database system like SQL many of the same concepts apply. A library user on a catalogue database may require resources about Subject A but are open to whether it is in book or journal form. Any effective search system must be able to check all the available resources in all formats to see if the search keywords match and then display them effectively. A relational database using SQL is a solution to this as they allow users to search for a variety of linked topics or keywords.

Within the database, there will be tables of information each representing a different ‘thing’ such as author name, ISBN, etc. By matching a Primary Key(s) against a Foreign Key(s) the database (hidden behind a Graphical User Interface or GUI) searches through the different tables and presents the results which match the search query. As each bit / thing of data is kept in a separate table, inconsistencies should be kept to an absolute minimum as the same data (author name) exists only in one table instead of several. In this way, complex searches can be completed and the right book found from a catalogue containing millions of volumes.

For example, in a (fictional) SQL database for a library the command

select * from author where name = "kumar"

would return all the books with the author Kumar.

For a more complex search the command

select title, author from titles where title like "%medicine%"

would show the title and author for all books with medicine anywhere in the title.

Tim Berners-Lee said “One of the things computers have not done for an organization is to be able to store random associations between disparate things...[4]” Information retrieval systems like SQL databases allow complex and multifaceted searches between related things (publisher, author, year of publication) but they do require effective interaction, instructions and strictures to perform these tasks. Hence the need for the searcher to use the most effective search techniques.

Managing

Despite the advantages and opportunities that have been identified above, search results always must be assessed rather than being taken at face value. Similarly, the techniques and technology available need to be managed correctly to get the best results.

Within an SQL database for example, a robust entity relationship is vital. If there are not unique identifiers for the key(s) then a search will fail or contain unnecessary results. If the tables contain more than one ‘thing’ (e.g. author title and publisher) then searches will fail or return invalid results – all books published by Elsevier regardless of author as opposed to all Elsevier books by Kumar for example.

Similarly, an SQL database must be asked the right questions in the right places. If the system is being accessed with a basic GUI then unless the grammar and syntax are perfect unreadable, incorrect or irrelevant results will occur.

On the WWW the question of relevance is even more pronounced and the value of what is returned is critical, as what the user is searching for may not have a Yes / No answer. A simple calculation[5]

Relevant documents retrieved

divided by

Total documents retrieved =

quantitative evaluation of Relevance.

can provide a quantitative evaluation of how successful the search was and allows us to compare search techniques and engines.

The web pages themselves must also be carefully managed if they are to be accessible and searchable with the protocol HTTP (WWW) , the DNS (city.ac.uk) and a path to the server folder (/library) making a clear and logical structure. Within the web pages Hyper Text Markup Language (HTML) allows the user to move between pages and out onto the WWW. Efficient use of HTML is how a librarian can create a more relational feel to web pages by linking in a more ‘human’ fashion

The library user must always evaluate what has been returned and not just accept anything. There can be a binary answer for relevance, but a graded return is more likely as results will be objectively or subjectively relevant.

The WWW and HTML have limitations however and despite all the efforts made a true universal standard does not yet exist and due to differences in HTML different browsers can interpret things differently, an example being problems with viewing a Internet Explorer designed page on Firefox.

As WWW search techniques and engines become more complex and subtle there may continue to be a place for Boolean and other techniques, but perhaps this will shrink as search engines become more sophisticated. Although it will often depend on what field the library user is working in, natural language and keyword searches can and do return highly relevant information.

Blog address: http://atthebookface.blogspot.com/

[1] Bush, V. (1947) As We May Think, Atlantic Magazine (Online) Available at: http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/?single_page=true (Accessed 24 October 2011).

[2] The size of the World Wide Web (The Internet)(Online) Available at: http://www.worldwidewebsize.com/ (Accessed 24 October 2011)

[3] Broder, A. (2002). A taxonomy of web search SIGIR Forum .(Online) Available at: http://www.sigur.org/forum/F2002/broder.pdf (Accessed 23 October 2011)

[4] Berners-Lee, T. (1998)The World Wide Web: A very short personal history. (Online) Available at: http://www.w3.org/People/Berners-Lee/ShortHistory.html (Accessed 28 October 2011)

[5] MacFarlane, A. (2011) Session 04 – Information Retrieval INM348: Digital Information Technologies and Architectures. (Online) (Accessed 28 October 2011)

Tuesday, 18 October 2011

Fetch the Information Ubu, Fetch!...Good Dog!

This week’s session was on information retrieval. Although it was far more manageable and intelligible than SQL it instead raised some questions all of its own.

Normally I don’t ask queries in search engines in natural language. I always, and essentially sub consciously, break it down into key words and search that way. ‘Who was the Luftwaffe commander at the Battle of El Alamein?’ for me looks like ‘Luftwaffe commander el alamein’. This will not always return the best results – or even the accurate ones – but I then build the question back up from the keywords. Not sure why I do it this way, but I always have.

When we turned to the exercises after the lecture that was when issues quickly popped up, the obvious one being how do you define ‘relevant’? If the OX Student Union relevant when searching for OX Uni? What about OUP? OU Alumni? Some engines add in maps, ‘covert’ ads, tweets and images. Do these count as results or for the relevance calculation (relevant documents retrieved divided by total documents retrieved)? The ads can answer some queries (‘where do I buy X?’) so do they count as results? What about when one answer has a set of other answers, often seen on Google, whereby it gives a banner correct answer and then has results and links to other sub sets of answers?

As search engines generally restrict to ten results per page there is an inherent restriction on the relevance calculation therefore everything is done on a basis of ten in the calculation and if there are no matches in the top ten the result has to be a fail. Or should the total number of results (sometimes in the hundreds of millions) be part of the calculation? Or perhaps we are only looking for a relevance calculation of 0.0 to 1?

Looking for car insurance – do comparison / screen scraper sites count as good results? They are sometimes not a place where you can actually make a purchase but rather gateways to other providers. They are not the end of the search, you have to use them and then perhaps go to another search result that they generate dependent on what you have asked for.

‘Car insurance’ results can give you results you didn’t ask for with top / most / all hits having an additional word added such as ‘cheap’. On Bing, ‘car insurance NOT cheap’ relegated cheap car insurance to the ads. On Google, cheap car insurance was still a regular hit.

‘President uruguay biography’ had some interesting results. As it is not a dated request, some results gave the wrong answer – previous presidents for example throughout Uruguayan history. Adding AND made little difference.

‘Who is the current president of uruguay?’ was less successful again as the searches seem to focus on the post and not the post holder. It would also suggest links to sites like Yahoo Answers whereby anyone can give an answer.

Natural language did not work as well as basic commands such as ‘uruguay president biography’, but a large reason for that is Wikipedia. Remove the Wikipedia result and it becomes harder and more descriptors and words are needed.

Captain Swing query was entered as posed on the exercises. Bing came out on top here as their 2/3 lines under the search result contained more relevant information so that on Bing I knew almost at once who the good Captain was and what role he played. Google had the best top hit however and the best information in that hit.

‘Captain Swing English history’ returned far better results. The 19^th century element of the request seems less important but Captain Swing isn’t a very common name. If it was the history of a Smith, Jones or similar then there may have been far more results. Or if it had been someone like a Roosevelt or a Churchill as those names and families have been involved in historical events for hundreds of years so a more specific request would be appropriate.

With cooking the artichokes, the keyword cook or cooking would take away the focus from what the (I believe are inedible) vegetables are. More so than Swing and other questions, it was a two part question that would need two results. Some results would give both but many gave one or the other or both but unsatisfactorily.

Korean War and Putney Debates were straight forward enough, though impact of the latter was a little trickier to get than origins of the former. Here I think the popularity (ultimately measured in £’s) of the Korean War play a part in how much is written, therefore making it easier or harder to find really relevant results right off the bat. Although this may be the historian with prejudices and favourite periods coming out in me as opposed to something I’d put in an essay.

Levellers and the reasons behind Operation Barbarossa pretty much the same.

And who was best, Bing or Google? Well it has to be said that it was Bing based on the numbers. Now, part of this may be to do with slight changes in the calculation each time based on variables listed above, but it was certainly a surprise. I loathe Bing and find it little to no use in the searches I perform and any query on it is usually one word “google”. I’m not going to switch to Bing despite this as I simply do not like it and don’t find what I’m after other than in this kind of experiment. Every man has a right to his prejudices after all.

Clearly my standard way of searching – as little natural language as possible – is workable but not a great system for all occasions. Without Wikipedia the results of a lot of this would be very different as well. I will think about the best way to search and to retrieve information but as this clock is not broken I’m not going to take it apart just yet.

Tuesday, 11 October 2011

SQL makes me SQueaL

During the DITA session it all seemed to make sense. The theory behind SQL, how to interrogate a database, tables, structure, etc – all makes sense and seemed logical. I use databases a lot, like it when they work and understood the theory behind the interface. Surely I could use my logical approach and understand this?

No. No, no and no. In the labs I just could not understand it. My paper understanding of it melted away and generating a list of the ‘PubID, Name, Company Name and City for all publishers based in the city of New York’ became a madman’s dream.

I looked up resources on line and went over the notes and lecture but nothing seemed to work. I could see nice, neat tables marching around other screens while I could barely manage to get the thing to respond to any request with a hint of politeness. It felt like I was back in school and it was long division all over again, when everyone else got it and I was marooned on an uncomprehending island of ignorance as they sailed away on a tortured metaphor.

After a quick session with one of the tutors and a kind fellow student it was clear that I did not grasp the grammar and syntax on an almost fundamental level. The construction of the commands made no sense to me and that became a barrier to the next step, of knowing what the commands were and how to ask for what I wanted. Added to that, I could not ‘see’ the data or the commands which meant I couldn’t work backwards or forwards on interpreting successful commands and how they interface with SQL.

I managed to get a few of the tasks done but even that became guesswork based on mutilation of successful commands as opposed to creating a command that would work from scratch. Deeply frustrating and the successful and elegant tables I could see on other machines by my fellow classmates made it worse as I couldn’t really see why I didn’t get SQL.

I will have another bash at this but I have to admit that my enthusiasm for another session is low. The chances of ever having to interrogate an SQL database like this are very low and it was so difficult I don’t want to waste any more time. I would like to understand it and master it at a basic level but will have to convince myself it's worth it first.

On a brighter note my new suit was fantastic.

Monday, 10 October 2011

HTM Hell

There has been a small hiatus, my loyal readership, as I had a very busy week and could not quite crack the HTML task in DITA. I managed to get all the pages up and viewable after following some advice from a tutor and I also crowd sourced some answers from others on the message boards.

But I still could still not get the images to show up. Once again the answers were on the message boards and I had to copy the picture files over to the W drive to make them appear. A clash between relative and absolute links had seemed to be the issue and low and behold it was.

A quick copy across and there I was. It looks like a personal website from when I was first on the information superhighway back in the early nineties. Sparse, plenty of bone white screen, not very well adjusted and pretty dull. But it is still up there and the links between all sites work with the pictures and that can only be called a victory.

As well as managing to get that page up, this has shown me the value of the course message boards. There’s currently talk of arranging a visit to Kings and some sort of social gathering which is all good and the assistance available via tutors and other students has already proved its worth beyond doubt. Hopefully I’ll be able to help out some others as the course goes on.

Unfortunately I have to go to my tailors to pick up a new suit tonight so will miss any first post lecture drinks, but hope to make any future meet ups. Just who are most of these people I am surrounded by on a Monday...?

Now we have moved to databases and SQL. Familiar terms from my time at Nokia but I never cracked those books open. I think I understand the basics of it and have an intellectual understanding of database construction and interrogation but the proof is in the eating of the pudding.

Just looking at the end tasks, this is going to take some more skull power to solve, so in I go to SQL Land, population millions of bits of data.