Chapter 5: Building Linked Data Applications

5.1 Introduction

In this chapter we describe how a Linked Data application is built. This draws on what we have covered in the previous chapters. In chapters 2 and 3 we looked at different methods for consuming Linked Data. Chapter 2 focussed on how to use SPARQL queries to extract data from an RDF dataset. In chapter 3 we looked at other ways in which Linked Data can be made available in order to be consumed by an application. As well as providing a SPARQL endpoint, RDF data can be accessed by dereferencing HTTP URIs, parsing RFDa or reading an RDF dump. In chapter 4 we described ways in which Linked Data can be output for human consumption and the broad range of visualisation tools and techniques that can be used. In this chapter we describe how to bring all of these together in a Linked Data application. Once again we make use of the music application introduced in chapter 1 as our motivating scenario (see figure 1). This chapter looks at the overall composition of the Linked Data architecture and some of the frameworks that can be used when building a Linked Data application.

Figure 1: The Music Application Scenario.

5.2 Learning outcomes

On completing this chapter you should understand the following:

  • Examples of Linked Data applications and how they can be used.
  • Different architectural patterns that can be used to build a Linked Data application.
  • How development frameworks can be used to implement a Linked Data application.
  •  How Web APIs can be used to access data from, or for use in, a Linked Data application.

 

Part 1: Characterization of Linked Data Applications

5.3 Categories of Linked Data applications

Any Linked Data application can be expected to have three parts: Linked Data manager, Linked Data consumer and (Web) User interface. First, the Linked Data Consumer is in charge of retrieving Linked Data from data sources. In cases where the retrieved data is not in the RDF format, wrappers can be used to translate the data into Linked Data. Systems that only consume Linked Data are usually called mashups. Second, the Linked Data consumer is responsible for manipulating the consumed Linked Data in order to produce new Linked Data. Third, the User interface provides a way of interacting with the application. This will often, but not necessarily, be a Web interface. The application may include a user interface supporting, for example, visualization of the data and also an API for programmable interaction with the system.

Linked Data applications themselves can be classified into three main types [1, 2]. First, Linked Data browsers consume Linked Data and present them in a way that allows the users to navigate them. Examples of Linked data browsers such as Sig.ma and Sindice were introduced in Section 10 of Chapter 4. A second category is Linked Data search engines. Unlike conventional search engines that are primarily seen as a means for locating human-readable content, a semantic search engine is used to search for ontologies, vocabularies and RDF documents. Semantic search engines such as Swoogle and Watson were described in section 22 of chapter 4. The third category is Domain-specific Linked Data applications. The Linked Data music application discussed throughout the chapters is one example of a domain-specific application. These applications are built to address a particular range of problems within a specified domain. The vast majority of Linked Data applications fall into this third category. Later in the chapter we will see some examples from different domains.

Linked Data applications can also be categorised on dimensions that describe various technical aspects as to how Linked Data is represented and used (see Figure 2). Linked Data applications can use Semantic Web technologies in a way that is extrinsic or intrinsic. If Semantic Web technologies are extrinsically used, then Linked Data is consumed and processed using APIs. Traditional technologies, such as Relattional Database Management Systems (RDBMS) can be used for internal storage and processing. An application may also make intrinsic use of Semantic Web technologies, for example, storing the internal state of the application in a triplestore rather than using RDBMS. A single application may also combine components that make intrinsic and extrinsic use of Semantic Web technologies.

A Linked Data application can also be classified as to whether it consumes Linked Data, produces Linked Data or both. As described earlier, a Linked Data application that only consumes Linked Data can be more appropriately described as a mashup. Applications can also vary in terms of their semantic richness. A relatively shallow representation of semantics may be used, incorporating for example simple taxonomies. For shallow semantics, the RDF and RDFS vocabularies would probably suffice, as they enable the expression of class hierarchies, class membership and properties. Strong semantic richness, expressing more complex relationships between resources would require a representational formalism such as one of the variants of OWL (Web Ontology Language).

Linked Data applications can also be classified as to whether they are isolated from, or integrated with, external vocabularies. A Linked Data application that used its own vocabulary, distinct from other available datasets would be described as being isolated. As described in Chapter 3, Linked Data should wherever possible reuse existing vocabularies or express relationships (for example using owl:sameAs) to other published vocabularies.  An application that extensively reuses and interlinks vocabularies would be described as an integrated application. Even an isolated application could become integrated if the vocabulary used within the application is published, and this vocabulary is used to interlink this dataset  with others.

Figure 2: Categorisation of Semantic Web Applications [2].

5.3 Examples of Linked Data applications

The following sections outline examples of Linked Data applications in a range of domains.

5.3.1 Governmental Open Data

The website data.gov.uk provides a data catalog of UK government information. Over 9000 datasets are available on the site covering themes such as transport, government spending, health, crime and the economy. All of the data is openly available but only a small proportion of it is available via Linked Data technologies. The majority of the datasets are published in tabular format such as CSV. However, a number of Linked Data applications have been built on top the published datasets [3].

Figure 3: The data.gov.uk catalog of UK governmental information [4].

The data.gov portal is the equivalent website in the US. This contains a much larger number of datasets. However, currently none of the datasets are published following Linked Data standards. Some Web and mobile applications have been developed on top of these datasets.

Figure 4: The data.gov catalog of US governmental information [5].

5.3.2 BBC Dynamic Semantic Publishing

The BBC Dynamic Semantic Publishing (DSP) [6] architecture aims at automating the aggregation or publishing of interrelated content within the BBC portal. This initiative started with sports content before moving to other areas of the BBC. This functionality allows the user of BBC content to navigate from an article to semantically related content. For example, an article about a football match might mention the teams, key players, the managers and the ground at which the match was played. From this article the reader could follow links to further articles related to any of those entities, and then navigate further from there. The reader could therefore initially find out about the game, move onto a biographical article about one the managers and then start reading about a previous football team that they managed.

The links between the articles, rather than being manually specified by a journalist are generated based on semantic annotations associated with each article. The Graffiti tool is used to add these semantic annotations, associating the articles with Linked Data concepts such as people and locations (see Figure 5). An OWLIM triplestore is used to store and reason over the RDF data representing the articles and associated data.

As a triplestore is used to drive the data presented on the website, this application can be classified as intrinsic in terms of its use of Semantic Web technologies. However, its use of Semantic Web technologies is not particularly visible to the user, who interacts with it as a conventional website. This can be contrasted with Linked Data browsers that directly expose the RDF data to the user.

Figure 5. The Graffiti tool.

5.3.3 Research Space

ResearchSpace [7] is an environment for conducting cultural historical research. It provides RDF datasets and tools for investigating cultural objects such as paintings. Associated concepts such as artists and locations are semantically represented. A number of tools are provided allowing users both to access the data and also contribute new data in the form of RDF annotations.  Figure 6 shows the Image Annotation tool. This can be used to describe different features of an artwork such as the subject and its location.

Figure 6: Image annotation.

Movie 1: ResearchSpace as an example Linked Data application.

The architecture of ResearchSpace is shown below. An SQL database is used to store documents and media. A triplestore is used to store annotations and associated vocabularies. The Nuxeo content management platform delivers the user interface and has a series of plugins to support annotation and communication with the triplestore. 

Figure 7: ResearchSpace architecture [8].

ResearchSpace also offers a faceted search interface to the cultural resources. Facets for driven by the semantic description of the artefacts according to the CIDOC CRM ontology [9]. This allows filtering of resources according to a number of associated concepts such as object type (e.g. sketch, painting) location and creator. See Section 19 of Chapter 4 for more examples of faceted search over RDF data.

Figure 8: The ResearchSpace CRM Search System [10].

ResearchSpace can be classified as a hybrid tool that intrinsically and extrinsically combines Semantic Web technologies. For example it uses a combination of a triplestore and relational database to store content. It also both consumes and produces Linked Data in combining existing data about museum objects with additional annotations. Semantic descriptions exploit the structure of the CIDOC CRM ontology to express the different features of the described objects.

5.3.4 Open Pharmacology Space

Open Pharmacology Space is a Linked Data application that aims to provide a semantic research environment for pharmacology, comparable to the support offered by ResearchSpace for cultural research. Open Pharmacology Space integrates different data sources and provides an API to access the aggregated data. Three tools that have been built on top of the Open Pharmacology Space are Open PHACTS Explorer, ChemBioNavigator and PharmaTrek.  Open PHACTS Explorer provides essentially a Linked Data browser of the RDF data. ChemBioNavigator is a tool for the visualisation of the chemical and biological space of a molecule group. PharmaTrek is a visualisation tool specifically designed for use with the ChEMBL biochemical database.

Figure 9: Open PHACTS Explorer, ChemBioNavigator and PharmaTrek [11].

The architecture of Open Pharmacology Space is shown below. A number of data sources are consumed in a triplestore referred to as an RDF data cache. Some of these sources have to be harvested and transformed by the application. The semantic data workflow engine drives the logic of the application. Data produced by Open Pharmacology Space can be accessed via an API or a SPARQL endpoint. This data is made available according to a unified Open Pharmacology Space data model.

Figure 10: Open Pharmacology Space architecture [12].

The Open Pharmacology Space makes both intrinsic and extrinsic use of Semantic Web technologies. The storage of data in RDF format and the use of semantics to represent workflow are intrinsic aspects of the application. Open Pharmacology Space is also both a consumer and producer of RDF data. It consumes vocabularies of varied richness and produces data according to its own unified data model.

Movie 2: Screencast of the Linked Data API (LDA).

5.3.5 eCloudManager

eCloudManager is a data center management application. A data centre typically comprises a large number of hardware components from a number of manufacturers. Each of these components will have a reporting system, custom-made by their manufacturer. Together these components provide metadata information on all of the hardware running in the data centre, its location and status. The physical hardware itself will generally be used to run a number of virtual machines that may be migrated across different hardware platforms. The hardware and configuration of virtual machines will deliver a number of software applications each with their own licensing arrangements and separate data stores. A data center therefore has a number of tools that can all provide partial information about the overall state of a data center.

 

Figure 11: A typical data center comprising hardware, virtual machines and software applications [13].

The aim of eCloudManager is to integrate different hardware and software components into a single semantic view. This views brings together the hardware components of various manufacturers, the virtualisation layer delivered by the hardware and the supported range of software applications. The eCloudManager also provides a business view indicating which departments are responsible for different services or hardware, and which customers may be affected by the failure of those components.

The overall integrated view provided by eCloudManager can be filtered depending on the user’s interest. The user could create a view on the system dedicated to the storage infrastructure comprising hardware components from multiple manufacturers. Similar views can also be created on the virtualisation, application or project levels.

Figure 12: The integrated view on a data centre provided by eCloudManager [13].

 

Part 2: Architecture of Linked Data Applications

5.4 Architecture of Linked Data applications

In the previous section we saw a number of example Linked Data applications. Each application not only had a different purpose but also made different architectural decisions in terms of how the data was accessed, processed and stored. In this section we present generic architectural patterns of Linked Data applications and different design decisions associated with these patterns.

The term software architecture describes the components of a software system and the relationships between those elements.  For a web-based system the components could include software modules, databases and web servers. Some parts of the architecture may be legacy components brought forward from previous systems. The relationships between the components of a system indicate which components communicate during operation of the system and the mechanisms by which that communication takes place. As well as specifying the structure of the system, the software architecture also stipulates a set of design practices to be followed in order to create and maintain the architecture. 

5.5 Multitier architecture

An important architectural pattern used in system development is the multitier architecture. A multitier architecture separates functionality into a number of layers from low-level data storage through to user interaction components. This architecture is commonly used for many kinds of web application. As many Linked Data applications are also web applications, they tend to conform to this architectural approach.

An important advantage of the tiered architecture is that it logically separates the functionality of the system into a series of layers and specifies the communication between those layers. This separation makes it far easier to replace a layer of the architecture or reuse a layer of an existing architecture in a new application. For example, an application may have a layer or tier dedicated to data storage. This functionality may be provided by a particular database or triplestore. The storage layer could be reused in an alternative application or replaced by another storage layer providing the same functionality and communication with other layers.

The most commonly used multitier architecture is the three-tier architecture. First, a presentation tier provides a user interface that can accept user input and render results in a human-readable form. Second, a logic tier implements the business logic of the application. This takes the available data and analyses and transforms it to meet the needs of the user. Third, a data tier stores the underlying data in a form independent from the business logic applied to it in the application. Figure 13 illustrates our music example as a three-tier architecture. The presentation tier handles user queries and the returned outputs including textual results and visualisations. The logic tier transforms user queries into SPARQL queries and aggregates the RDF results. The data tier is responsible for storage and in this case uses a triplestore. One important aspect to note in the case of Linked Data applications is that the dividing line between the data tier and the logic tier may not be so clear-cut. If a relational database is used as a storage layer, then all processing of the data beyond the returning of results to a database query is done in the logic layer. However, triplestores are capable of performing various types of reasoning and therefore in some cases a significant part of the business logic can be carried within the triplestore. For reasons of performance it is advantageous to perform reasoning on a lower level as possible.

Figure 13: Music example of the three-tier architecture.

Movie 3: The three tier architectural pattern and how it is used in Linked Data applications.

Figure 14 shows the architecture of our music application with a particular emphasis on the components within the data tier. The presentation layer produces the kinds of visualisations we saw in Chapter 4. The logic layer processes data for presentation. The data tier as well as implementing some of the logic does much more work than just data storage. Data stored within the data tier may be consumed from a number of sources. These may be consumed from SPARQL endpoints or RDF dumps. Wrappers may be required to covert the data into an appropriate format. As described in chapter 3 R2RML may be used to transform data from a relational database into RDF.

Also, as covered in Chapter 3, the retrieved data may use different vocabularies on the schema level and may also on the instance level have different individuals referring to the same thing. As described in Chapter 3, languages such as SKOS can be used to express relationships across vocabularies and the owl:sameAs property can be used to relate resources. Chapter 3 also describes how tools such as the SILK framework, could be used to identify and express relationships across datasets.

Some data cleansing may also be required, for example, to identify ambiguities between names of the resources within the datasets and to fix these ambiguities. Therefore, vocabulary mapping, interlinking and cleansing need to be carried out to produce a consistent dataset. As well as being used by the logic layer, a direct facility to republish the data may be provided.

Figure 14: General architecture of Linked Data applications.

5.6 Architectural patterns

The data consumed and integrated within the application may be accessed in a number of ways. Three main architectural patterns can be identified. First, there is the crawling pattern, in which data is loaded in advance. The data may also need to be transformed as described above. The data is managed in one triplestore so that it can be accessed efficiently. The disadvantage of this pattern is that the data might not be up to date. Second, there is the On-The-Fly Dereferencing Pattern. Here, URIs are dereferenced at the moment that the application requires the data. This pattern retrieves up to date data but performance is affected when the application must dereference many URIs. Third, there is a (Federated) Query Pattern in which complex queries are submitted to a fixed set of data sources. This approach enables applications to work with current data directly retrieved from the sources. However, finding optimal query execution plans over a large number of sources is a complex problem. This third pattern in specific situations can offer a way of accessing up-to-date data with adequate response times.

5.7 Data layer

In the data tier new Linked Data may be consumed from a SPARQL endpoint in RDF. As discussed above, if the data is in another form such as CSV (Comma Separated Values) then a wrapper would be used to translate the data to RDF. Linked Data applications may implement a Mediator-Wrapper Architecture to access heterogeneous sources in which wrappers are built around each data source in order to provide an unified view of the retrieved data. The Mediator-Wrapper Architecture could be used with any of the three architectural patterns (i.e. crawling pattern, on-the-fly dereferencing pattern, federated query pattern). The most appropriate patterns will depend on a number of factors such as the number of sources that need top be accessed, how up-to-date the data is required to be and the speed of response required by the data layer.

A number of tools are available that can be used when implementing a data access component. Linked Data Crawlers are web crawlers designed to harvest RDF data. Linked Data client libraries support the access and traversal of Linked Data. SPARQL Client Libraries provide an API for accessing SPARQL endpoints. Federated SPARQL Engines provide a single access point for multiple heterogeneous data sources. Finally, Search Engine APIs such as Sindice (see chapter 4, section 10.2) support semantic searches that return RDF documents.

Figure 15: Linked Data access mechanisms.

The integrated dataset can then be kept in a local triplestore. A triplestore is required unless data is retrieved on-the-fly, and discarded just after building a response to the user request. A number of commercial or free RDF triplestores are available including OWLIM [14], Jena TDB [15], Cumulus [16], AllegroGraph [17], Virtuoso Universal Server [18] and RDF3x [19]. As described in Chapter 3, SPARQL endpoints and RDF dumps can be used to make RDF data available. As we will see later, data can also be made available via APIs or using functionality provided by your chosen application framework.

5.8 Logic and presentation layers

Once the integrated data is available on-the-fly or in a triplestore, it can be used and accessed by the logic and presentation layers. As mentioned above, some of the logic may be implemented in the data layer by reasoning over the triplestore. Other forms of processing that cannot be implemented on the data later are carried out in the logic layer. As described in section 23 of Chapter 4, this may involve the application of statistical or machine learning processes to make inferences from the data. Other forms of business logic such as workflow are also implemented within this layer. Finally, the presentation layer displays the information to the user in various formats, including text, diagrams or other types of visualization techniques. A range of tools and formalisms for the display of information were outlined in Chapter 4.

 

Part 3: Linked Data Application Development Frameworks: Information Workbench

5.9 Information Workbench

The Information Workbench is a linked data application development platform that approximately corresponds to the three tier architectural model described earlier. The Data Integration and Storage layer is used to access, process and integrate multiple datasets. The Data Management layer supports querying and processing of the integrated dataset. The Presentation, Interaction and UI Customisation layer provides various visualization, navigation and authoring tools.

Figure 16: The Information Workbench tiered architecture [20].

The component tools of the Information Workbench that can be used in building a Linked Data application are shown below. Starting from the Resource level at the bottom of the figure, data can be accessed from different locations including social networking systems via their APIs. Tools within the platform layer can then use this data. These perform various workflow, analysis and processing tasks that we expect to find in the logic layer of a three tier architecture. Above that in the SDK (Software Development Kit) layer is a toolset for building rich Linked Data applications. This includes an API for interfacing with external programs and a range of ways of configuring the application. As well as the accessed data, the platform also stores any ontologies to be used by the application. In many cases this will be a pre-authored ontology with which the accessed data will be aligned in order to deliver the application. As well as workflows produced by the workflow engine, the SDK layer also contains widgets that can be used to generate user interface components. These were introduced in chapter four on visualisation.

Figure 17: The Information Workbench extensible architecture.

All of these components can then be used to produce applications (examples are shown on the solution layer). One such example is the MusicBrainz Explorer introduced in chapter 4. Another example would be a data centre manager such as the eCloudManager described earlier in this chapter.

Movie 4: Screencast of the Information Workbench.

5.10 Data storage, access and integration

For data access and storage the Information Workbench uses an open source RDF processing framework called OpenRDF Sesame. This is popularly used for data management in a number of Linked Data applications. Low-level data storage within the Sesame architecture is replaceable. Above this is the SAIL layer that interfaces with the stored RDF data. Above that is the API that can be used to query or modify data in the triplestore.

Figure 18: Data storage and access

The Information Workbench offers three different ways of configuring back-end storage. The most simple would be to use the built-in, native Sesame store as a local repository. Alternatively, another store could be used locally such as OWLIM. A further option would be to use a remote Sesame repository in which a remote sesame client communicates over HTTP with a sesame server, which in turn accesses a sesame native or other store using the SAIL API. A third lesser-used option would be to directly run the information workbench against a remote SPARQL endpoint.

Figure 19: Back-End configuration options.

A series of data providers can be used to extract data from external sources and load it into a central repository. These may be run just once to load the external data, or if the external data source changes over time, run periodically. The Information Workbench has data providers for dealing with a number of formats. For example, a data provider can be used to run a CONSTRUCT SPARQL query (described in Chapter 2) against a triplestore and load the returned triples into the repository. Other data providers can access RDF in other forms such as data dumps. A set of data providers can translate data from an alternative format into RDF. XML2RDF can be used to translate data from XML format to RDF. R2RML can be used to extract data from relational databases. This process of transforming relational database content to RDF was described in Section 3.9.2 of Chapter 3. Additionally, the Groovy Script language can be used to write new providers, drawing on existing libraries, to transform any other required data format.

This approach described above essentially adopts a data warehousing approach in which data is copied from multiple sources into a single store. Queries can then be run over the content of a single data warehouse. An alternative approach is to use a federation pattern in which a single query is transformed into a set of queries against multiple SPARQL endpoints. The process of breaking down and distributing queries across SPARQL endpoints is often complex. The federation layer may already have knowledge of which queries or parts of queries can be answerable by different SPARQL endpoints. The federation component will be responsible for decomposing the input query into sub-queries that can be answerable by the federated SPARQL endpoints, posing them to the endpoints, and gathering the received answers. In some cases the federation component may not know which endpoints can satisfy the query and will therefore need to replicate the query across multiple endpoints and merge the returned results.

The Information Workbench features a tool called FedX that can support the management of federated search. FedX provides assistance in coordinating queries across the SPARQ endpoints and also offers some degree of optimisation in order to minimize overload for any individual endpoint and minimise traffic through the network.

Figure 20: Data integration with FedX.

5.11 User interaction

RDF data, whether stored by the Information Workbench or accessed in real-time, can be made available through a customisable user interface. A demo user interface of the music data can be accessed at [21]. An example snapshot of the interface is shown below. The main view area is shown in the middle. In this case it is presenting a list of music artists including The Beatles. On the left is a series of tabs that can be used to select different views. The view currently shown is the Wiki View. This can be used to change, create and interlink data. Each tab offers a particular type of view such as a tabular or graph view on the data.

Movie 5: Introduction to how Information Workbench Views support user interaction with resources.

Figure 21: The customisable user interface [21].

The concept behind the Information Workbench user interface is that there is a one-page view for each resource in the dataset. For example there are distinct one-age views for The Beatles, Depeche Mode and Rihanna. Any page will probably have hypertext links to other pages, reflecting relationships between those resources in the RDF graph.

Figure 22: The one-page URI concept.

If those resources have rdf:type relations to concepts in an associated ontology, then custom templates can be defined for the display of resources depending on their type. For example, resources that are of the type MusicArtist in the music ontology, may use a template that displays their discography on a map or timeline and provide links to members should that MusicArtist be a band. All resources of type MusicArtist would be presented using the same template. For each type, the template therefore defines which links in the RDF are shown and how they are visually presented within the page. Templates may be defined for other types of resource such as music releases.

Figure 23: Resource pages rendered using the template for mo:MusicArtist.

As well as having a wiki view other views can be used to inspect a resource. The Table View shows all triples in the dataset that have the resource as a subject. This is called the immediate description (see Section 3.6.2 of Chapter 3). The table view may show additional information that was not selected for inclusion in the template. The graph view provides a visual overview of the neighbourhood of the resource. Similar to the hierarchical visualisations in Section 4.6.2 of Chapter 4 the size of different sectors within the graph gives an indication of the number of relations of different types within the graph. The pivot view gives a statistical display of the frequency of different types of individuals associated with the inspected resource.

Figure 24: Different views on a resource.

User interfaces are constructed around a series of widgets that provide different views on the data. A number of standard widgets are provided with the platform. Further widgets can be authored using the API. Standard visualisation widgets are available for displaying maps and timelines, Analytics and Reporting widgets can be used to construct charts of frequency data. Mashup widgets can be used to bring in associated data from social media streams. For example, this can be used to stream tweets mentioning a music artist to the music artist template. Authoring and content creation templates can also be used to modify the underlying dataset.

All widgets can be configured using the wiki syntax. The SPARQL query embedded within the wiki syntax defines the part of the dataset to be rendered by the widget. In the example, the SPARQL query returns the labels of musical releases and the number of times they have been released. As they are in descending order of their release frequency and limited to ten results, the query returns the ten most released musical works. Above the SPARQL query the widget type is specified (i.e. BarChart). Below the SPARQL query the parameters input and output are used to map the values returned from the SPARQL query to the x and y axes of the bar chart.

{{#widget: BarChart |

 query ='SELECT distinct (COUNT(?Release) AS ?COUNT) ?label WHERE {

  ?? foaf:made ?Release .

  ?Release rdf:type mo:Release .

  ?Release dc:title ?label .

 }

 GROUP BY ?label

 ORDER BY DESC(?COUNT)

 LIMIT 10

 '

 | input = 'label'

 | output = 'COUNT'

 }}

Figure 25: Using the wiki syntax to specify a barchart.

As we described above, each individual in the dataset can be rendered as a single page by instantiating the template associated with its type and the data associated with that individual. The figure below illustrates how the resource page for Barak Obama is built by rendering data about Barak Obama according to the template devised for any instance of the class foaf:Person. Any configuration of the specific view for the Barak Obahma resource takes precedence over template definitions at the class (i.e. foaf:Person) level.

Figure 26: Producing a resource page for a person.

One of the main types of resource in our music example is MusicArtist. Pages can be constructed for displaying a class as well as individuals. The example below shows a page rendered for the MusicArtist class, listing each individual and their country of origin. This is a custom page generated using the wiki syntax for displaying individuals of the MusicArtist class.

Figure 27: A list of MusicArtist individuals [22].

Each music artist listed in this overview page hyperlinks to information about that artist. This page is rendered using the template of the associated class, which in this case is the MusicArtist class. We can see that the page rendered for The Beatles contains information about the band, its members, country of origin and also recent tweets related to the band.

Figure 28: Information about the Beatles rendered by the MusicArtist template [21].

Mashups with external sources such as twitter and Youtube are created using mashup widgets. The query below searches Youtube for resources that contain the string that is the foaf:name of the music artist. In the case of The Beatles the Youtube mashup would contain resources returned by a string search of “The Beatles” on Youtube.

{{#widget: Youtube

 | searchString = $SELECT ?x WHERE { ?? foaf:name ?x . }$

 | asynch = 'true’ }}

 Figure 29: Using the wiki syntax to search for relevant Youtube resources.

The Triple Editor, shown below can be used to add or edit triples that have the selected resource as their subject. The Triple Editor is an edit mode of the Tabular view described earlier. This is available to a user in the Tabular View if they have sufficient permissions. If the subject of the triples belongs to a class, then properties defined for, or inherited by that class are suggested when adding new triples through the Triple Editor. For example, a music artist may have a fan page or discography.

Figure 30: Suggested properties for the class MusicGroup.

Data can also be validated using the type restrictions on the domain or range of a property. For example, the range of a property may be restricted to integer or Boolean values. In the snapshot below, the second value throws up a validation error as “abc” is not of type integer.

Figure 31: Input validation using property restrictions.

5.12 Further information

The following citations provide further information about the Information Workbench:

  • Information Workbench product page [20]
  • Demo system [21]
  • Free download Community Edition version of Information Workbench [23]
  • Online documentation [24]

 

Part 4: Linked Data Application Development Frameworks: Calimachus, lmf and Synth

5.13 Calimachus

Other frameworks are available for building Linked Data application. Calimachus is a scalable platform for creating and running data-driven websites [25]. Calimacus can be deployed on a server and used as a web-based tool for building Linked Data applications. A visual web interface can be used for constructing a set of components that make up the Linked Data application.

Figure 32: The Calimachus framework.

5.14 lmf

The Linked Media Framework (abbreviated as lmf) is provided by the Apache foundation and offers a number of advanced services for linked media management [26]. These are built on top of three key components. First, Apache Marmotta is a Linked Data platform in its own right. Marmotta provides a series of capabilities for consuming and republishing LD and comes with its own tripelstore.

Apache Stanbol is a tool for the extraction of Linked Data from non-Linked Data sources such as plain text. This could be used for example to identify terms from a SKOS thesaurus in a collection of text documents. Apache Solr is a tool for indexing content, i.e. associating content with Linked Data resources. This can used to support the efficient search or faceted browsing of content.

5.15 Synth

Synth [27] is a Linked Data application development environment that is implemented on Ruby on Rails. A unique aspect of Synth is that applications are built using a set of principles termed the Semantic Hypermedia Design Method (SHDM). These principles describe how the application should be built and run. Using SHDM, an application is described using a series of models that define particular aspects of the application, such as requirements, navigation, interface, behaviour and access rights.

Figure 33: The Synth framework.

 

Part 5: Using Web APIs

5.16 HTTP communication

If a Linked Data application is built on the web then it may use Web APIs to either provide data or consume data from other sources. The HTTP protocol is the fundamental technology on which Web APIs are built. The serving of documents on the Web (for example the serving of an HTML page to a web browser) is carried out using the HTTP protocol.

HTTP communication is based on an interaction that involves a series of requests and responses. A client sends a request to a server. The server sends back a response to the client.

Figure 34: Request-Response interaction.

5.16.1 HTTP Request

Each HTTP request contains a method, URI, Header and optionally a body. The method indicates the type of the request that the server should perform. The most familiar types of HTTP request are GET and POST. A GET request means the client wants to retrieve content. The URI sent with the GET request is the resource from which the content should be retrieved. A POST request is used to send data. The accompanying URI indicates where the data should be sent. Web forms generally use POST requests to send data to the server.

The other types of HTTP request are used more broadly in web-based client-server communication but not necessarily important when using a web-browser to retrieve and send content. A PUT request to used to store data at the specified URI. A DELETE request is used to delete the specified URI.

Other types of HTTP request include HEAD, TRACE, CONNECT, OPTIONS and PATCH. For example, a HEAD request is used to retrieve header information. A TRACE request is used to allow the client to see what the server is receiving. This is generally used for diagnostics.

As well as the method and URI a HTTP request contains header information that gives additional detail about the HTTP request. A body is required with POST and PUT methods and expresses the data that is to be sent to, or stored at the specified URI. For example, if you submit a web form, then the data entered into the form can be represented in the body of the HTTP request.

5.16.2 HTTP Response

A response to a HTTP request contains a numerical response code, a header and optionally a body. The response code gives overall status information on how the request is being handled. The response code is a three digit number beginning with a 1, 2, 3, 4 or 5. Response codes beginning with 1 are provisional responses indicating that the request has been received and is being acted upon.

Requests starting with 2 indicate that the request has been successfully received, understood and accepted by the server. Codes beginning with 3 indicate that further action needs to be taken by the client that issued the initial request. Codes beginning with 4 indicate that the request is erroneous and cannot be met by the server. The most commonly seen response code beginning with 4 is the 404 response code. The 404 response code informs the client that the request was erroneous because it asked for a resource that does not exist on the server. Finally, codes beginning with 5 indicate that there is an error, but in this case the problem is with the server and it is unable to fulfil the valid request. 

5.16.3 HTTP Request Response pattern

We can now put the request and response together to illustrate the HTTP request-response pattern, in which a client requests and then receives web page. The client issues a GET request with the URI for the Wikipedia page about The Beatles. This tells the server that the client wants to retrieve information from the provided URI. The final response from the server has a response code 200 (indicating that the request has succeeded) and returns the HTML page about The Beatles.

Figure 35: The HTTP request-response pattern.

5.16.4 HTTP content negotiation

When requesting data in a particular media from DBpedia rather than Wikipedia, the communication pattern between client and server is more complex. In this case the URI refers to the concept of The Beatles rather than any particular document describing The Beatles. As shown below, the request uses the method GET and declares that a response is required in HTML format. The URI refers to the concept of The Beatles, which is not a resource in HTML format. The server responds with a code of 303 and another URI. This tells the client to instead make a request for HTML from this alternative URI. The client then makes a second request (not shown) for HTML using the new URI and receives a HTML page about The Beatles with a response code of 200.

Figure 36: Content negotiation when requesting HTML using a Linked Data URI.

In the above HTTP conversation the client requested information in a certain format using a particular URI. The server responded with an alternative URI. The client then made a second request with the new URI. The server responded with the requested information. This conversation between client and server to determine the correct resource is called content negotiation and is often abbreviated to conneg.

The reason this conversation to determine the appropriate content is required is that different types of content can potentially be returned associated with the same resource. If information is being accessed via a web browser, then HTML is likely to be the preferred format. If the client is a Linked Data application consuming data, then the requested information about The Beatles will be preferred in an RDF format such as Turtle (see chapter 1 for more information about the Turtle format). In the figure below we have a GET request using the same URI as in the HTML example above. However, this time the client requests text/turtle rather than text/html. The server responds with the status code 303 (i.e. see other) and a URI where the information can be accessed in Turtle format.

Figure 37: Requesting data in Turtle format

The client then issues a second request and this time retrieves the data in Turtle format along with a status code of 200. This approach is routinely used to publish Linked Data, in which a series of URIs name concepts used in the Linked Data application, such as The Beatles and Paul McCartney. Requests for data to a Linked Data URI are directed to another URI depending on the data format requested. Multiple RDF formats of the same data may be made available such as RDF/XML and Turtle.

Figure 38: Retrieving data in Turtle format.

5.17 Web APIs

The HTTP conversations we have seen above provide the foundation for Web APIs. Web APIs are particularly important when a Linked Data application needs access to data that is being dynamically created. If a Linked Data application is using data about music artists and releases from the 1960s then few updates to the stored data will be required. If dynamically changing data is being used such as current weather conditions or traffic levels, then web APIs can be used to provide access to new data. A Web API can provide a range of functionalities, giving access to views on that data and also transforming the data in ways of use to other applications.

Over the past few years there has been a huge growth in the number of Web APIs, though most of these are not Linked Data Web APIs. The most common form of API is the REST (Representational State Transfer) API. REST is an architectural style that uses the HTTP protocol for communication between client and server. The Programmable Web is a general directory of Web APIs. This allows providers to register their API and other application developers to search for available APIs. The vast majority of Web APIs registered with the Programmable Web use the REST model.

Figure 39. Growth in WEP APIs [28].

5.18 Richardson Maturity model for Web Services

The Richardson Maturity Model provides a way of thinking about the main elements that make up a REST architecture. The model is divided into a number of layers, each layer being a precondition of the one above. All of these layers can be seen as a necessary requirement of a REST architecture. Starting from the lowest level, we have resources and their URIs. For example, as we saw above, The Beatles resource is identified by the URI http://dbpedia.org/resource/The_Beatles. The second level is HTTP verbs. These are essentially the HTTP methods (such as GET and POST) we saw earlier. These define actions to be carried out such as sending or retrieving data. The third level is known by the acronym HATEOAS (Hypertext As The Engine Of Application State). This describes the Hypermedia controls, in other words, the higher-level functions provided by the Web API such as inspecting and modifying the music releases associated with a music artist.

Figure 40: The Richardson Maturity Model [29]

On the lower Resource level, the Web API just makes available URIs that identify resources. As we saw in the previous section, a Linked Data URI may direct the client to alternative representations of that resource in, for example, Turtle or RDF/XML. On the second level, the HTTP verbs are methods that can act on those resources. Different methods can be used such as GET, POST and DELETE. The METHOD option is used by a client to get information about the types of request currently available. The server responds requests with an appropriate code such as 200 (OK), 303 (see other) or 404 (not found). The methods and their associated response codes give us a standardized form of communication in terms of the HTTP protocol. This therefore defines the different types of request that a client can make and the ways in which a server can respond to those requests.

HTTP verbs can be characterised as to whether they are safe and whether they are idempotent. A verb is characterised as safe if it cannot change the resource addressed on the server. For example, GET is safe because it merely retrieves information from the resource. It does not attempt to modify the resource. A HTTP verb is characterised as idempotent if the effect of sending one request is identical to sending multiple identical requests. The method DELETE is idempotent. Sending a single request to DELETE a resource will remove it form the server. Sending the DELETE request multiple times results in the same state. The resource stays deleted no matter how many times the request is sent. The only difference when sending multiple DELETE requests is that once the resource has ben deleted the server will respond with the code 404 (not found) rather than 200 (OK) as the resource is no longer available for deletion.

Figure 41: HTTP verbs

On the third level, HATEAOS describes how we use resources to drive the application through a series of dynamic states. For example, a client may send an order for a music album to a Web API. In this case the request will need to identify the album to be purchased, such as Revolver by The Beatles. The Web API creates a new resource to identify the order and sends a response to the client. The response indicates the resource identifying the order and the options available to the client. In this case, the server indicates that the order is awaiting payment and the price to be paid. The client could then respond by sending payment details such as a credit card number of the resource created for the order. If payment was accepted, then the Web API may trigger a physical shipment of the album or send the customer details of where a digital copy of the music can be accessed.

Figure 42: Transitions through states in a music ordering process.

5.19 Freebase API

Freebase offers an API for retrieving RDF that is commonly used by Linked Data applications [30]. Freebase is a service, now owned by Google, which collects semantic data on a variety of topics. As described in section 3.8.2 of chapter 3, Freebase is used to enrich Google search results with disambiguation panes. Freebase offers a number of APIs to application developers including an API to access an RDF representation for a resource. The URI for the Web API is the URL for the Freebase RDF service followed by the Freebase identifier of the topic for which RDF is requested.

https://www.googleapis.com/freebase/v1/rdf/<id>

As the Freebase id for The Beatles is m/07c0j the Web API request for RDF data on The Beatles would be as follows.

GET https://www.googleapis.com/freebase/v1/rdf/m/07c0j

Internally, Freebase maps its stored facts about The Beatles into RDF triples. Mappings are also made to RDF Schema. For example, the Freebase property /type/object/name that associates a resource with its name is mapped to rdfs:label. The response contains up to 100 triples found for each predicate of the resource.

5.20 Non-RDF APIs

A number of other commonly used APIs can be used to retrieve data from online services such as Twitter, LastFM and Foursquare. None of these Web APIs return RDF, but can be used in conjunction with a wrapper to translate the data into RDF for storage within a Linked Data application. Twitter can provide data on, among other things, timelines, tweets, direct messages, followers, users and places [31]. The LastFM Web API can be used to access music-related data such as albums, artists, events and venues [32]. Foursquare can be used to access check-ins at locations, tips and reviews [33].

5.21 Summary

After studying this chapter you should achieve the following outcomes:

  • An understanding of Linked Data applications as systems that both consume and also transform and/or produce Linked Data.
  • An appreciation of some examples of Linked Data applications from different domains.
  • Knowledge of the dimensions on which Linked Data applications can be classified. Linked Data applications can be classified according to Semantic Web technology depth, information flow direction, semantic richness and semantic integration.
  • An understanding of multitier architectures and how Linked Data applications can consume data using crawling, on-the-fly de-referencing and federated search patterns.
  • An understanding of the main components of Linked Data applications such as the triplestore, logic components, user interaction components, data access and integration components and republishing components.
  • Knowledge of how the Information Workbench can be used as a framework for building Linked Data applications. An awareness of alternative frameworks such as Calimachus.
  • An understanding of the request-response pattern of HTTP communication, the methods used in a HTTP request and the response codes that are returned. An understanding of how these methods are used in content negotiation.
  • Knowledge of REST APIs and the Richardson Maturity Model.

5.22 Further reading

[1] Hausenblas, M. (2009). Linked Data Applications. Technical Report, DERI, Galway.

[2] Martin, M. and Auer, S. (2010). Categorisation of Semantic Web Applications. International Conference on Advances in Semantic Processing (SEMAPRO 2010), Florence, Italy.

[3] http://data.gov.uk/apps

[4] http://data.gov.uk

[5] http://catalog.data.gov/dataset

[6] http://www.bbc.co.uk/blogs/bbcinternet/2012/04/sports_dynamic_semantic.html

[7] https://sites.google.com/a/researchspace.org/researchspace

[8] https://confluence.ontotext.com/display/ResearchSpace/RS+Infrastructure

[9] http://www.cidoc-crm.org

[10] https://www.youtube.com/watch?v=HCnwgq6ebAs

[11] http://www.openphacts.org/open-phacts-discovery-platform

[12] Williams A., Harland L., Groth P. et al. (2012). Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today, 17 (21-22), 1188-1198.

[13] http://www.fluidops.com/ecloudmanager

[14] http://www.ontotext.com/owlim

[15] http://jena.apache.org/documentation/tdb

[16] https://code.google.com/p/cumulusrdf

[17] http://www.franz.com/agraph/allegrograph

[18] http://virtuoso.openlinksw.com

[19] https://code.google.com/p/rdf3x

[20] http://www.fluidops.com/information-workbench

[21] http://musicbrainz.fluidops.net

[22] http://musicbrainz.fluidops.net/resource/mo:MusicArtist

[23] http://www.fluidops.com/information-workbench/iwb-download

[24] http://help.fluidops.com/help/topic/iwb.help-2.5/help.html

[25] http://callimachusproject.org

[26] https://code.google.com/p/lmf

[27] http://www.tecweb.inf.puc-rio.br/synth

[28] http://programmableweb.com

[29] Richardson, L. and Ruby, S. (2007). RESTful Web Services O'Reilly.

[30] https://developers.google.com/freebase

[31] http://dev.twitter.com

[32] http://www.last.fm/api

[33] http://developer.foursquare.com