Sunday 8 January 2012

DITA Assignment Two

The Semantic Web in Information Science

As the evolution of the World Wide Web (WWW) continues, the impact of ‘Web 2.0’ and its associated technologies and paradigms is yet to be fully measured. In the field of information science, the development towards what has been called Web 3.0 or ‘the semantic web’ raises some challenging yet promising issues for how a modern library / information service will evolve.

The Evolving Internet

The internet currently ‘in use’ as it were – using Webs 1.0 and 2.0 - was built on certain protocols which allowed widespread dissemination of information under certain agreed standards and then allowed linking of this to create networks. Both of these iterations of the WWW were designed to be read by humans as opposed to machines and this leads to some inefficiencies as the power of the machines goes underused as Berners-Lee (1998) noted.

The HTML based WWW has meant that it is possible for information to be placed on it via a wide range of devices and generally meant the same information would be displayed in the same format to all. As this information was created by and designed to be read by humans as opposed to machines there has been no attempt to ensure that there was a controlled vocabulary(Morville and Rosenfeld 2007). Hence, when searching for information on the WWW for ‘Biohazard’ the results will also include the band of the same name as well as biologically hazardous data which could be considered ‘correct but not accurate’. As there is no control over what terms mean, there can be and is great interchangeably which degrades results received. The semantic web seeks to solve this issue by making sure that the vocabulary is controlled so that a machine will be able to read the data and interpret it in a similar way to a human for value.

Information published on the WWW has the most value when it can be seen, shared and utilised widely, therefore that which is published must be accessible to the widest group of interested parties (barring restrictions such as commercially sensitive information or paid for services which sit behind password protected walls). The semantic web allows greater accessibility and efficiency of information as it links it in a way that is understandable and based on a common set of parameters.

The World Wide Web Consortium (W3C) describes the semantic web as being “...about common formats for integration and combination of data drawn from diverse sources...” (Herman 2011) . This is a step change as it involves machine readable data and relational links, rather than a web of different documents and artefacts that are linked by human interaction.  This allows more automatic processes to run on the web and to return better, deeper and richer results.

This is done by using some technological standards and tools. Extensible Markup Language (XML) allows control of vocabulary as it defines terms much as HTML does (Butterworth 2010). In XML a word has to have one meaning – for example in a library, ‘year’ cannot mean year of publication, the current year, the book title (such as 1984 by George Orwell)  and year that the book covers. However, XML is limited in what it can achieve as terminology is limited as only one meaning can be applied. If we define year as meaning year of publication there will be a failure when someone uses year as meaning the current year during their search.
As well as XML there is RDF or Resource Description Framework. This works by creating a triple of information (object, predicate, subject) which identifies a piece of data in one unique way only (Butterworth 2010). Thus, when mapped in RDF ‘title’ could always mean the ‘title of the book’ and never a person’s title such as Mr or Miss – the word has been mapped into RDF so that the machine will know to only look in specific areas for matches.  As well as this, RDF allows for a controlled vocabulary to be created (Joint 2007). This vocabulary is mapped in such a way that when different words are used to describe the same (pre defined) ‘thing’, the other meanings which have been prescribed are identified and a positive match can be made. 

For example, although they are both describing the same occurrence, one researcher may use the term ‘infected’ while another uses the term ‘contamination’. If they have both mapped to RDF correctly, the semantic web ‘knows’ that although the words are different they mean the same thing in the way they are used. Hence a search by the first researcher will return the work of the second even though the terminology is different. While it may be possible for a human to determine this by reading the articles this is not as fast or efficient as the search engine doing this, which is where the value of the semantic web and its being machine readable comes to bear fruition.

Digital Data Representation

Semantic web organisation via innovative methodologies such as RDF creates enormous potential in the information sciences and for the organisation and representation of digital data. Even small and apparently insignificant pieces of  information that is mapped to RDF can later have great value as it becomes part of a larger whole and is interrelated in new ways (Joint 2007). If all databases / repositories of information cannot only talk to each other but understand each other without significant human interaction enormous cross use and pollination of ideas is possible.

For example, although they are both describing the same occurrence, one researcher may use the term ‘infected’ while another uses the term ‘contamination’. If they have both mapped to RDF correctly, the semantic web ‘knows’ that although the words are different they mean the same thing in the way they are used. Hence a search by the first researcher will return the work of the second even though the terminology is different. While it may be possible for a human to determine this by reading the articles this is not as fast or efficient as the search engine doing this, which is where the value of the semantic web and its being machine readable comes in.

The organisational potential of the semantic web, comes not from it reordering what already exists but from a better organisation through more accurate definitions of terms and words. Currently, vagueness in search terminology means that returned results can be very broad depending of what is looked for. Semantic web techniques themselves will not solve this alone as vagueness in terminology can be an issue which is why adoption of controlled vocabularies are so important.

This has been an issue in the ‘vascoda’ (Mayr, Mutschke and Petras 2007) general science portal where a fundamental problem has arisen in that the method of searching the different (and highly complex) databases which contain a lot of metadata is a general free text search. This lacks the precision searching functionality that was supposed to be used during development, so despite the efforts made to create a semantic experience the linkage between the different databases is not of sufficient quality to achieve the possible results(Mayr, Mutschke and Petras 2007). 

As well as RDF and XML, OWL is a part of the semantic web as it is an ontology language for it (Butterworth 2010). It contains a taxonomy but layers this with logical rules that allow inferences and therefore links to be made. The link here with RDF triples is obvious as both create defined words within a taxonomy that can be understood not just by humans but also by machines.
Online Digital Environment

As the online digital environment becomes larger and more information is born digital, information related tasks increase in importance. Information needs to be accessible, reference-able, verifiable, findable and shareable and the semantic web will drive this forward by harnessing the power of machine readable metadata (Mayr, Mutschke and Petras 2007).

If each library or information repository had a perfectly worked out internal classification and information architecture system that did not talk to any other in a meaningful way then a huge amount of possible collaboration  and linkages would be (and doubtless currently is) lost(Joint 2007). Many such systems are incompatible with each other as even two similar libraries may use different access methods (Athens versus user ID) or catalogues (the Dewey Decimal System versus National Library of Medicine) even if their collections are identical.

The semantic web can expose a library or information collection and resources to a far greater audience (Joint 2007). Greater usage means higher usage statistics which is a very real practical concern in the information industry. Likewise, driving up usage of a collection can spur greater academic success and output which will have been reliant on the library, thus increasing the reputation and impact of the library.

Open access institutional repositories, like the one at City University London (City 2011), are particularly suited to the advantages of the semantic web because they make peer reviewed and high quality data freely available for others to use and - crucially – make it more likely to be found and used. If it is used more then there is a greater chance of it being cited in other publications which increase its impact factor and therefore its standing and rank in the academic community. If such a repository is mapped in RDF it is available for semantic web ‘discovery’ which further increases the possibility of it being discovered and used.

Conclusion

The impact of the semantic web or ‘Web 3.0’ is impossible to calculate at present, but if the fundamentals are followed through – controlled vocabularies, machine readable metadata et al – the possibilities for the representation and sharing of information is enormous. Previous iterations of the web have put huge amounts of data and information on the internet, whereas the semantic web will, as Butterworth (2010) notes, work towards placing structures of knowledge itself on there.

When two separate and independent databases which adhere to the semantic principles are placed on the web and linked semantically, the user will be able to get matching data from both of them eve when the terms used in each are different. A search for ‘injection’ on A would return ‘inoculation’ as well as ‘injection’, eliminating the need for more complex human driven searches. The metadata in the semantic web will power more collaborative working as the machines and automated processes make linkages faster and more effectively than a human could within the same time across a wider range of disciplines.

Resources and References
Berners-Lee. T. 1998. Semantic Web Road Map. (online) Available at http://www.w3.org/DesignIssues/Semantic.html (Accessed 19 December 2011)
Butterworth, R.  2010. Lecture 08: Web 3.0 -- The Semantic Web. (online) Available at: http://moodle.city.ac.uk/mod/resource/view.php?id=267342 (Accessed 17 December 2011)

City University London. 2011. City Research Online. (online)  Available at: http://openaccess.city.ac.uk/. (Accessed 3 December 2011)
Feigenbaum, L. Herman, I. Hongsermeier, T. Neumann, E.  Stephens, S.  2007. The Semantic Web in Action (online) Available at http://thefigtrees.net/lee/sw/sciam/semantic-web-in-action#single-page. (Accessed 3 January 2012)
Herman, I. 2011. W3C Semantic Web Activity. (online) Available at: http://www.w3.org/2001/sw/ (accessed 3 December 2011)
Joint, N. 2007. The Practitioner Librarian and the Semantic Web. Library Review. (e-journal) 57 (3). Available through EmeraldInsight. (Accessed 13 December 2011).
May, C. 2002. The Information Society : A Sceptical View. Cambridge: Polity Press.
Mayr, P. Mutschke, P. Petras, V. 2007.  Reducing Semantic Complexity in Distributed Digital Libraries. Library Review. (e-journal) 57 (3). Available through EmeraldInsight. (Accessed 13 December 2011).
Morville, P. Rosenfield, L. 2007. Information Architecture for the World Wide Web.  Sebastapol, O’Reilly.
Van Harmelen, F. Femsel, D. 1999. Practical Knowledge Representation for the Web (online) Available at: http://www.cs.vu.nl/~frankh/postscript/IJCAI99-III.html (Accessed 19 December 2011)

Monday 5 December 2011

The End of DITA

So the last DITA session was today and its farewell to this part of the programme.
We were looking at Information Architectures today and it brought back a lot of memories in a strange way. Websites that were permanently Under Construction, dozens of broken links on one page, no way to find what you were looking for even though you knew it was there. Ah, the internet of the mid to late 1990’s, we shall never see your like again.
Other than a few exceptions, back then no one seemed to have any idea how to build a website but everyone was. I remember to build your free one on AOL you were invited to post them pictures via Royal Mail, they would scan and upload them and post them back. Remarkable.
These days due to a better, maybe wider, understanding of architecture born from a trillion rubbish pages, it is less likely a viewer will have the same issues. Yes, things can still be hard to find, yes there can be too many bells and whistles (MySpace being a good example in its early iterations) but on the whole things are much better and more uniform in a good way. Drop someone onto a random website and they will probably be able to navigate it successfully and with some nuance eight times out of ten.
Websites like SourceForge, Tesco and Amazon all have the ability to sign in and  - presumably – personalise your experience to a greater or lesser degree. An interesting element is where and how prominent the ‘log in / sign in’ link is.
Amazon and SourceForge are seemingly designed for the casual browser. They’re not asking you to sign in right away as they encourage browsing. Stroll around, see if you see anything you like. Tesco on the other hands puts sign in front and center.  They want you to sign in as every purchase builds up a greater and more in depth picture of who you are and what you want. Amazon has a similar function but as Tesco has such a wide range of products, including day to day products, the data it has on you is far richer. And to them that means better adverts which keep the cash rolling in.
And what of the mystery vegetable? It looks very familiar, like something that I would be forced to eat by a wife. I name it Pak Choi or Chinese Cabbage. To find this I looked at what the picture was called – mysteryvegetable. By looking for that I found a website which dealt in answering questions about what veg was what. In fact the more I look, the more websites dedicated to veg-identification I found. Is this really what the WWW is here to do?
Websites for groceries seemed to have trouble helping me identify it as I didn’t really know what I was after. But I suppose that the need for tuber identification is already taken care of by the crackpot community, so no need for the corporation to spend money on it.
When it comes to designing information architectures then, a clear plan is necessary...
You need to know what you need to know (do you need to know where your visitors are coming from? Do you need to know their age? Do you need to know anything which would be considered personal?).
You need to know what they need to know (what will they be looking for? Do they need a quick contact page? Or an FAQ? If they need an FAQ is it because your site is too complicated?).
You need to know what you need to know about them (should they be able to set up an account? How much detail will you need? How much of the site can they personalise? How individual will their experience be?).
This is really the first 10 feet of the iceberg as well, the subtle things, the nuances all will be changed by who and what you are selling (or what you are discussing / presenting) and to whom. As a collective experience humanity (not all of it of course, just the long term wired parts) seems to have evolved a generally better innate sense of information architecture and now the complexity lies not in a website that works for all, but one that works for you as an individual surfer.

Monday 28 November 2011

Open This


Searching on the Royal Society website was a little trickier than I thought it would be. It was like being given a key to an extensive and ancient archive but not really knowing what the archive held, where the shelves were, what order things were kept, etc. 

As they have a biological component I looked for any articles on GRIDS or Gay Related Immune Deficiency Syndrome / Symptoms which was a forerunner to the HIV/AIDS in that before we knew what the latter was, epidemiology suggested that there was a specific set of symptoms associated with a set of people which pointed to an underlying medical condition. 

The RS seem to have nothing though. Maybe I’m not searching right and perhaps they don’t cover it but either way I couldn’t find anything. So I went to their website to see what kind of material they have and looked up Darwin instead. Plenty of hits there. 

For the Open Source Software part of the session I looked at Freecode and GitHub. GiitHub sounds like a dating website for people who self confessed gits or a nastier version of mysinglefriend.com and lo and behold there is a picture of a smug, self satisfied bloke on the main page. He even describes himself as a ‘git instructor’. Remarkable. 

Anyhoo, I preferred the older Freecode site as it wasn’t full of meaningless pictures. I understand they are trying to humanise technology, especially social media related, but Freecode felt more professional and listed itself well. Plus the gits seem to be trying to flog you something straight away instead of letting me see if I need training, etc.

The names of the software – vifm, burp, dbeaver, kwave, sunflower et al, all peppered with 3, 4 or 5 digit version numbers of course– is meaningless. They could be nuclear release codes for all I know or ways to hack into my bank and add £45,000 to my account risk free, but it all appears to be so much frippery. 

In the interests of fairness I went over to SourceForge as well and that was much nicer. Set up like an ‘older’ system, it presented options in far easier ways. So out with the new, in with the old I cry!

For the open data mashup I looked at borough data and the rateof male hospital admissions attributable to alcohol per hundred thousandpopulation (2005-06). Tower Hamlets came in at a respectable 1,130 (rounded up). Less than Islington (1,218) and Hounslow (1,194)but more than Richmond Upon Thames (785) and Sutton (692). 

If this data was mapped over to cover male homelessness and demographics as well we could perhaps see two things. First of all, we could see if boroughs with high homelessness had higher rates of male hospital admissions attributable to alcohol.

Secondly if we took the demographic data, sorted by age so we could see the 16 – 22 age group and then looked at male hospital admissions attributable to alcohol we may get an idea if it is ‘students’ (used as a catch most term here) in the population that result in high levels of admission due to binge drinking. 

By looking at Alcohol Non Consumption Zones we may then be able to see if areas with these zones had lower rates of alcohol abuse resulting in hospital admission and homelessness than areas without. 

With this data we may then be able to see if ANCZ's would be useful in cutting down male admissions due to alcohol in two key groups, one vulnerable and the other, often, just a bit silly.  

Just put the MBE in the post, I don't have to to come to the Palace to pick it up at the moment.   


Monday 21 November 2011

Some Antics, Some Antics, they’re up to Semantics!


The semantic web clearly has great usage in areas like epidemiology. Being able to have database type information provided through a more intelligent system can vastly speed up the process of identifying correlations (though not causes) and associated information.

More than that though, the time and money cost of implementing semantic  ideals outweigh the benefits. When I was in the Quality Team at Nokia, one issue that came up was - no surprises – quality. A reason that the company was continually late in delivering software to schedule was that the engineers wanted it to be perfect. All well and good, but this was unacceptable if this dedication caused manufacturing deadlines to slip and product release delays.

So a VP came up with a campaign based around ‘Good Enough is Good Enough’. The simple message was to finish and ship the product. On time. Not when it was ‘ready’, but according to the schedule. Clients, partners and customers wanted a product by the contracted date, not a perfect product at some future point. The former can be altered, the latter cannot. 

And I think this is a major argument and force against the semantic web having the momentum to rally pick up outside of limited areas. Who’s going to want to spend the time creating a far ‘better’ WWW when the one we have now works so well? Only those with a very clear and present need for what it offers.

As a quick aside, perhaps if the adult entertainment / pornography business got behind the idea it would take off. That multi-billion dollar industry has arguably created the internet as a place where you can securely buy products, have high quality video, steam live feeds and driven broadband access just as they led the charge in the move from film to VHS format in the years before. 

Back to DITA and away from the filth merchants, the below is what some of us came up with in relation to RFD triples, taxonomy and ontology...

If X does Y then Z.

In relation to a library the formula could be expressed as...
IF library patron BORROWS library book #765 THEN book status changed to issued.

Clear? Jolly good.