Paper Cuts: DITA Assignment

No matter how extensive a library collection is, if the required information cannot be retrieved then it swiftly becomes of much lower value or even useless. Any librarian has to evaluate technologies and manage data to get the best results for their users.

Evaluation

Vannevar Bush’s comment in 1945 that “...The investigator is staggered by the findings and conclusions of thousands of other workers—conclusions which he cannot find time to grasp, much less to remember, as they appear...”[1] has become even truer as the World Wide Web (WWW) first became popular and then an unstoppable force on the Internet.

For libraries, the Internet – but more so the WWW - have been deeply disruptive technologies (first changing and then overturning how they previously held, shared and presented data) and while it makes more information more available to more people, the nature in which the WWW has done this has made it often difficult to find relevant information.

Even though the WWW was intended to have a clear structure the 11.45 billion web pages, as of 25 October 2011[2], presents a baffling amount of data. The URL of a page contains a clear structure and can sometimes reveal information about the page – what nation the web page is in, if it is a university, a brand or company name, etc – but it is imperative to use appropriate search methods and techniques and not rely on decoding protocols.

Via search engines Broder’s Taxonomy[3] gives us assistance with Navigational (I want to go to this one place), Transactional (I want to buy this) and Informational (what do you have on this) queries but we also have the option of using keywords (‘attack 911 New York’), natural language (‘the attack on the trade towers’) and the Boolean techniques of AND / OR / NOT. The more specific a library user can be the better and informational retrieval queries such as Known Item (I want this one thing), Fact (what is this one thing), Subject (all examples of this thing) and Exploratory (what things do you have) all help narrow the field.

In a database system like SQL many of the same concepts apply. A library user on a catalogue database may require resources about Subject A but are open to whether it is in book or journal form. Any effective search system must be able to check all the available resources in all formats to see if the search keywords match and then display them effectively. A relational database using SQL is a solution to this as they allow users to search for a variety of linked topics or keywords.

Within the database, there will be tables of information each representing a different ‘thing’ such as author name, ISBN, etc. By matching a Primary Key(s) against a Foreign Key(s) the database (hidden behind a Graphical User Interface or GUI) searches through the different tables and presents the results which match the search query. As each bit / thing of data is kept in a separate table, inconsistencies should be kept to an absolute minimum as the same data (author name) exists only in one table instead of several. In this way, complex searches can be completed and the right book found from a catalogue containing millions of volumes.

For example, in a (fictional) SQL database for a library the command

select * from author where name = "kumar"

would return all the books with the author Kumar.

For a more complex search the command

select title, author from titles where title like "%medicine%"

would show the title and author for all books with medicine anywhere in the title.

Tim Berners-Lee said “One of the things computers have not done for an organization is to be able to store random associations between disparate things...[4]” Information retrieval systems like SQL databases allow complex and multifaceted searches between related things (publisher, author, year of publication) but they do require effective interaction, instructions and strictures to perform these tasks. Hence the need for the searcher to use the most effective search techniques.

Managing

Despite the advantages and opportunities that have been identified above, search results always must be assessed rather than being taken at face value. Similarly, the techniques and technology available need to be managed correctly to get the best results.

Within an SQL database for example, a robust entity relationship is vital. If there are not unique identifiers for the key(s) then a search will fail or contain unnecessary results. If the tables contain more than one ‘thing’ (e.g. author title and publisher) then searches will fail or return invalid results – all books published by Elsevier regardless of author as opposed to all Elsevier books by Kumar for example.

Similarly, an SQL database must be asked the right questions in the right places. If the system is being accessed with a basic GUI then unless the grammar and syntax are perfect unreadable, incorrect or irrelevant results will occur.

On the WWW the question of relevance is even more pronounced and the value of what is returned is critical, as what the user is searching for may not have a Yes / No answer. A simple calculation[5]

Relevant documents retrieved

divided by

Total documents retrieved =

quantitative evaluation of Relevance.

can provide a quantitative evaluation of how successful the search was and allows us to compare search techniques and engines.

The web pages themselves must also be carefully managed if they are to be accessible and searchable with the protocol HTTP (WWW) , the DNS (city.ac.uk) and a path to the server folder (/library) making a clear and logical structure. Within the web pages Hyper Text Markup Language (HTML) allows the user to move between pages and out onto the WWW. Efficient use of HTML is how a librarian can create a more relational feel to web pages by linking in a more ‘human’ fashion

The library user must always evaluate what has been returned and not just accept anything. There can be a binary answer for relevance, but a graded return is more likely as results will be objectively or subjectively relevant.

The WWW and HTML have limitations however and despite all the efforts made a true universal standard does not yet exist and due to differences in HTML different browsers can interpret things differently, an example being problems with viewing a Internet Explorer designed page on Firefox.

As WWW search techniques and engines become more complex and subtle there may continue to be a place for Boolean and other techniques, but perhaps this will shrink as search engines become more sophisticated. Although it will often depend on what field the library user is working in, natural language and keyword searches can and do return highly relevant information.

Blog address: http://atthebookface.blogspot.com/

[1] Bush, V. (1947) As We May Think, Atlantic Magazine (Online) Available at: http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/3881/?single_page=true (Accessed 24 October 2011).

[2] The size of the World Wide Web (The Internet)(Online) Available at: http://www.worldwidewebsize.com/ (Accessed 24 October 2011)

[3] Broder, A. (2002). A taxonomy of web search SIGIR Forum .(Online) Available at: http://www.sigur.org/forum/F2002/broder.pdf (Accessed 23 October 2011)

[4] Berners-Lee, T. (1998)The World Wide Web: A very short personal history. (Online) Available at: http://www.w3.org/People/Berners-Lee/ShortHistory.html (Accessed 28 October 2011)

[5] MacFarlane, A. (2011) Session 04 – Information Retrieval INM348: Digital Information Technologies and Architectures. (Online) (Accessed 28 October 2011)

Paper Cuts

Sunday, 30 October 2011

DITA Assignment

No comments:

Post a Comment