Paper Cuts: January 2012

The Semantic Web in Information Science

As the evolution of the World Wide Web (WWW) continues, the impact of ‘Web 2.0’ and its associated technologies and paradigms is yet to be fully measured. In the field of information science, the development towards what has been called Web 3.0 or ‘the semantic web’ raises some challenging yet promising issues for how a modern library / information service will evolve.

The Evolving Internet

The internet currently ‘in use’ as it were – using Webs 1.0 and 2.0 - was built on certain protocols which allowed widespread dissemination of information under certain agreed standards and then allowed linking of this to create networks. Both of these iterations of the WWW were designed to be read by humans as opposed to machines and this leads to some inefficiencies as the power of the machines goes underused as Berners-Lee (1998) noted.

The HTML based WWW has meant that it is possible for information to be placed on it via a wide range of devices and generally meant the same information would be displayed in the same format to all. As this information was created by and designed to be read by humans as opposed to machines there has been no attempt to ensure that there was a controlled vocabulary(Morville and Rosenfeld 2007). Hence, when searching for information on the WWW for ‘Biohazard’ the results will also include the band of the same name as well as biologically hazardous data which could be considered ‘correct but not accurate’. As there is no control over what terms mean, there can be and is great interchangeably which degrades results received. The semantic web seeks to solve this issue by making sure that the vocabulary is controlled so that a machine will be able to read the data and interpret it in a similar way to a human for value.

Information published on the WWW has the most value when it can be seen, shared and utilised widely, therefore that which is published must be accessible to the widest group of interested parties (barring restrictions such as commercially sensitive information or paid for services which sit behind password protected walls). The semantic web allows greater accessibility and efficiency of information as it links it in a way that is understandable and based on a common set of parameters.

The World Wide Web Consortium (W3C) describes the semantic web as being “...about common formats for integration and combination of data drawn from diverse sources...” (Herman 2011) . This is a step change as it involves machine readable data and relational links, rather than a web of different documents and artefacts that are linked by human interaction. This allows more automatic processes to run on the web and to return better, deeper and richer results.

This is done by using some technological standards and tools. Extensible Markup Language (XML) allows control of vocabulary as it defines terms much as HTML does (Butterworth 2010). In XML a word has to have one meaning – for example in a library, ‘year’ cannot mean year of publication, the current year, the book title (such as 1984 by George Orwell) and year that the book covers. However, XML is limited in what it can achieve as terminology is limited as only one meaning can be applied. If we define year as meaning year of publication there will be a failure when someone uses year as meaning the current year during their search.

As well as XML there is RDF or Resource Description Framework. This works by creating a triple of information (object, predicate, subject) which identifies a piece of data in one unique way only (Butterworth 2010). Thus, when mapped in RDF ‘title’ could always mean the ‘title of the book’ and never a person’s title such as Mr or Miss – the word has been mapped into RDF so that the machine will know to only look in specific areas for matches. As well as this, RDF allows for a controlled vocabulary to be created (Joint 2007). This vocabulary is mapped in such a way that when different words are used to describe the same (pre defined) ‘thing’, the other meanings which have been prescribed are identified and a positive match can be made.

For example, although they are both describing the same occurrence, one researcher may use the term ‘infected’ while another uses the term ‘contamination’. If they have both mapped to RDF correctly, the semantic web ‘knows’ that although the words are different they mean the same thing in the way they are used. Hence a search by the first researcher will return the work of the second even though the terminology is different. While it may be possible for a human to determine this by reading the articles this is not as fast or efficient as the search engine doing this, which is where the value of the semantic web and its being machine readable comes to bear fruition.

Digital Data Representation

Semantic web organisation via innovative methodologies such as RDF creates enormous potential in the information sciences and for the organisation and representation of digital data. Even small and apparently insignificant pieces of information that is mapped to RDF can later have great value as it becomes part of a larger whole and is interrelated in new ways (Joint 2007). If all databases / repositories of information cannot only talk to each other but understand each other without significant human interaction enormous cross use and pollination of ideas is possible.

The organisational potential of the semantic web, comes not from it reordering what already exists but from a better organisation through more accurate definitions of terms and words. Currently, vagueness in search terminology means that returned results can be very broad depending of what is looked for. Semantic web techniques themselves will not solve this alone as vagueness in terminology can be an issue which is why adoption of controlled vocabularies are so important.

This has been an issue in the ‘vascoda’ (Mayr, Mutschke and Petras 2007) general science portal where a fundamental problem has arisen in that the method of searching the different (and highly complex) databases which contain a lot of metadata is a general free text search. This lacks the precision searching functionality that was supposed to be used during development, so despite the efforts made to create a semantic experience the linkage between the different databases is not of sufficient quality to achieve the possible results(Mayr, Mutschke and Petras 2007).

As well as RDF and XML, OWL is a part of the semantic web as it is an ontology language for it (Butterworth 2010). It contains a taxonomy but layers this with logical rules that allow inferences and therefore links to be made. The link here with RDF triples is obvious as both create defined words within a taxonomy that can be understood not just by humans but also by machines.

Online Digital Environment

As the online digital environment becomes larger and more information is born digital, information related tasks increase in importance. Information needs to be accessible, reference-able, verifiable, findable and shareable and the semantic web will drive this forward by harnessing the power of machine readable metadata (Mayr, Mutschke and Petras 2007).

If each library or information repository had a perfectly worked out internal classification and information architecture system that did not talk to any other in a meaningful way then a huge amount of possible collaboration and linkages would be (and doubtless currently is) lost(Joint 2007). Many such systems are incompatible with each other as even two similar libraries may use different access methods (Athens versus user ID) or catalogues (the Dewey Decimal System versus National Library of Medicine) even if their collections are identical.

The semantic web can expose a library or information collection and resources to a far greater audience (Joint 2007). Greater usage means higher usage statistics which is a very real practical concern in the information industry. Likewise, driving up usage of a collection can spur greater academic success and output which will have been reliant on the library, thus increasing the reputation and impact of the library.

Open access institutional repositories, like the one at City University London (City 2011), are particularly suited to the advantages of the semantic web because they make peer reviewed and high quality data freely available for others to use and - crucially – make it more likely to be found and used. If it is used more then there is a greater chance of it being cited in other publications which increase its impact factor and therefore its standing and rank in the academic community. If such a repository is mapped in RDF it is available for semantic web ‘discovery’ which further increases the possibility of it being discovered and used.

Conclusion

The impact of the semantic web or ‘Web 3.0’ is impossible to calculate at present, but if the fundamentals are followed through – controlled vocabularies, machine readable metadata et al – the possibilities for the representation and sharing of information is enormous. Previous iterations of the web have put huge amounts of data and information on the internet, whereas the semantic web will, as Butterworth (2010) notes, work towards placing structures of knowledge itself on there.

When two separate and independent databases which adhere to the semantic principles are placed on the web and linked semantically, the user will be able to get matching data from both of them eve when the terms used in each are different. A search for ‘injection’ on A would return ‘inoculation’ as well as ‘injection’, eliminating the need for more complex human driven searches. The metadata in the semantic web will power more collaborative working as the machines and automated processes make linkages faster and more effectively than a human could within the same time across a wider range of disciplines.

Blog address: http://atthebookface.blogspot.com/

Resources and References

Berners-Lee. T. 1998. Semantic Web Road Map. (online) Available at http://www.w3.org/DesignIssues/Semantic.html (Accessed 19 December 2011)

Butterworth, R. 2010. Lecture 08: Web 3.0 -- The Semantic Web. (online) Available at: http://moodle.city.ac.uk/mod/resource/view.php?id=267342 (Accessed 17 December 2011)

City University London. 2011. City Research Online. (online) Available at: http://openaccess.city.ac.uk/. (Accessed 3 December 2011)

Feigenbaum, L. Herman, I. Hongsermeier, T. Neumann, E. Stephens, S. 2007. The Semantic Web in Action (online) Available at http://thefigtrees.net/lee/sw/sciam/semantic-web-in-action#single-page. (Accessed 3 January 2012)

Herman, I. 2011. W3C Semantic Web Activity. (online) Available at: http://www.w3.org/2001/sw/ (accessed 3 December 2011)

Joint, N. 2007. The Practitioner Librarian and the Semantic Web. Library Review. (e-journal) 57 (3). Available through EmeraldInsight. (Accessed 13 December 2011).

May, C. 2002. The Information Society : A Sceptical View. Cambridge: Polity Press.

Mayr, P. Mutschke, P. Petras, V. 2007. Reducing Semantic Complexity in Distributed Digital Libraries. Library Review. (e-journal) 57 (3). Available through EmeraldInsight. (Accessed 13 December 2011).

Morville, P. Rosenfield, L. 2007. Information Architecture for the World Wide Web. Sebastapol, O’Reilly.

Van Harmelen, F. Femsel, D. 1999. Practical Knowledge Representation for the Web (online) Available at: http://www.cs.vu.nl/~frankh/postscript/IJCAI99-III.html (Accessed 19 December 2011)

Paper Cuts

Sunday, 8 January 2012

DITA Assignment Two