Chapter 3: Providing Linked Data

3.1 Introduction

In the previous chapter we described how SPARQL is used to formulate queries and retrieve information from a dataset. In this chapter we look at how such datasets can be created and made available in the first place. We will describe how data can be extracted from different sources such as texts and databases and then represented as linked data using the RDF data model. We also show how relationships can be used to express connections between datasets as well define concepts used in a dataset in terms of a different vocabulary. We will also describe the steps involved in making the dataset accessible and discoverable.

In this chapter we once again use MusicBrainz as a motivating example. MusicBrainz is used to show how different music data sources, stored in for example, databases and texts can be prepared, matched to the MusicBrainz schema and made available as a linked data source.

The final sections provide a practical introduction to a number of tools that assist in the process of providing linked data.

3.2 Learning outcomes

On completing this chapter you should understand the following:

  • The main stages in the Linked Data lifecycle from its creation through to its publication and use.
  • How to design new URIs and link to existing vocabularies when creating data.
  • How the Simple Knowledge Organisation System (SKOS) can be used to specify links between the created data and existing datasets.
  • How to publish a dataset, providing access to both metadata about the dataset as well as the dataset itself.
  • How to manage datasets and make them available in linked data repositories.
  • How linked data is used to enhance search engines.
  • How a range of tools can be used to create linked data from different formats such as spreadsheets, databases and texts.
  • How the SILK framework can be used to automatically discover relationships between datasets.


Part I: Creating, Interlinking and Publishing Linked Data

3.3 The Linked Data Lifecycle

The process of providing linked data is often characterised as a Linked Data lifecycle in which data is created, reusing other data sources, prepared and then made accessible for use. The new data can then become one of a number of data sources used in the preparation of further datasets. This illustrates the cyclic nature of linked data creation and publishing. Before looking in more detail at different perspectives on the stages of the linked data lifecycle, we will recap the four Linked Data Principles introduced in Chapter 1. In summary, these principles are as follows:

1) Use URIs as names for things.

2) Use HTTP URIs so that users can look up those names.

3) When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL).

4) Include links to other URIs, so that users can discover more things.

As described in Chapter 1, a clear motivation for these principles is to publish “things” as URIs in way that is accessible (principle 2), useful (principle 3) and connects to other published things (principle 3). Essentially this means publishing data in a way that enhances its practical value and facilitates the use of that dataset as input to other datasets. This therefore creates a cycle in which data is created, published and then used in the creation of new data.

Movie 1: Linked Data lifecycle. In this webinar clip, Dr Barry Norton describes the four Linked Data principles and ways of characterizing the Linked Data lifecycle.

Different ways of understanding the linked data lifecycle and its constituent steps have been proposed. Sören Auer [1] proposes an eight stage linked data lifecycle. This involves searching or finding sources from which data can be extracted and stored, interconnecting these sources of data, establishing the quality of the new dataset, maintaining the data as it evolves and making it available as a data source for further browsing, search or exploration.

Figure 1: Sören Auer (2011) “The Semantic Data Web” [1].

The linked data lifecycle proposed by José M. Alvarez compresses this to four cyclical stages: Produce, Publish, Consume and Feedback/Update. A process of validation may be applied to any or each of these four stages.

Figure 2: José M. Alvarez. (2012) “My Linked Data Lifecycle” [2]

Finally, Michael Hausenblas presents a linked data life cycle in which the four central stages cover similar processes in which data is modelled, published, discovered and then integrated with other sources of data.  

Figure 3: Michael Hausenblas (2011) “Linked Data lifecycle” [3]

In the following sections we will focus on three main stages that can be found throughout the various formulations of the linked data lifecycle. These three stages can also be mapped to the linked data principles.

Creating Linked Data: This involves data extraction, the creation of HTTP URIs and vocabulary selection. This relates to linked data principles 1 and 2 as it involves finding or creating HTTP URI names for things.

Interlinking Linked Data: This involves finding and expressing associations across datasets. This relates to linked data principle 3 as it involves providing links to other things.

Publishing Linked Data: This involves creating metadata about a dataset and making the dataset available for use. It therefore relates to linked data principle 3 in that it ensures useful information is returned about the dataset.

3.4 Creation of Linked Data

New data may be initially authored in linked data format. However, often data of interest may already be stored in some alternative format. The most common formats are:

  • Spreadsheets or tabular data
  • Databases
  • Text

Several tools have been developed to support data extraction from these sources. We will look at a number of these in the final sections of this chapter. OpenRefine is a tool for translating spreadsheet or tabular data into linked data. R2RML [4] is a W3C recommendation for specifying mappings between relational databases and linked data. A number of tools exist to support extraction of data from free text including GATE, Zemanta and DBpedia Spotlight.

The aim of all these tools is to allow disparate, possibly messy data sources to be viewed in terms of the RDF data model. As described in chapter 1, RDF represents knowledge in the form of subject-predicate-object triples. The subject and object are the nodes of the triple. Nodes represent the concepts or entities within the data. A node is labelled with a URI, blank node or a literal. Relations between the concepts or entities are modelled as arcs, which correspond to predicates within the data model. Predicates are expressed by a URI.

Figure 4: An RDF Subject Predicate Object triple shown graphically.

As described above, Linked Data principle number 1 states that URIs should be used as names for things. URIs can be used to name both nodes and arcs within RDF triples. More specific guidance can be found on how to design effective URIs. These are referred to as Cool URIs. The guidelines for the design of Cool URIs [5] can be summarised as follows:

  • Leave out information about the data relating to: author, technologies, status, access mechanisms, etc.
  • Simplicity: short, mnemonic URIs
  • Stability: maintain the URIs as long as possible
  • Manageability: issue the URIs in a way that you can manage

Vocabularies model concepts and relationships between them in a knowledge domain. These can used to classify the instance-level nodes expressed in the RDF triples. You should avoid defining your own vocabulary from scratch unless absolutely necessary. Try to make use of (and therefore link to) well-known vocabularies that are already available. A large number of vocabularies are available as Linked Open Data. Many can be discovered through the Linked Open Vocabularies (LOV) dataset [6]. Using LOV you can free text search for vocabularies (for example vocabularies related to the term “music”) and filter the results according a number of facets such as the associated domain (e.g. science, media) of the vocabulary.

LOV also allows a number of vocabularies to be visualized in terms of how interconnected they are with other vocabularies. For example, we can see that the Music Ontology (referred to as “mo” in the figure) references 25 other vocabularies, the principal ones including Friend Of A Friend (FOAF) [7] and DCMI Metadata Terms (dcterms) [8]. FOAF is useful in this context at it provides a way of describing the people who feature as music artists. For example, FOAF can be used to represent their name, gender and contact details. The namespace dcterms can be used to describe core features of the products of music artists (i.e. albums and tracks) such as their title and description.

The Music Ontology is itself referenced by seven other vocabularies in the LOV dataset. These visualizations help in gauging how widely used a vocabulary is and how it aligns with other available vocabularies.

Other lists of well-known vocabularies are available. The W3C SWEO Linking Open Data community project maintains a list of the most commonly used vocabularies [9]. The Library Linked Data Incubator Group maintains a list of vocabularies related to the linked open library data [10].

Figure 5: Analysis of the Music Ontology [11]

Movie 2: Linked Open Vocabularies. In this webinar clip, Dr Barry Norton discusses how Linked Open Vocabularies can be used to find existing vocabularies for reuse in a new dataset.

3.5 Interlinking Linked Data

Besides making use of existing vocabularies, the author or maintainer of a dataset should investigate how entities in the dataset can be linked out to entities in other datasets. This follows Linked Data principle number 4 by linking to other URIs so that the user can discover more things. RDF links between entities in different datasets can be specified on two levels: the instance level and the schema level. On the instance level links can be made between individual entities (e.g. people, places, objects) using the properties rdfs:seeAlso and owl:sameAs. The property owl:sameAs is used to express that two URI references actually refer to the same thing. The property rdfs:seeAlso indicates that more relevant information can be found by following the link.

In MusicBrainz the property owl:sameAs is used to connect resources referring to the same music artist. The property rdfs:seeAlso is used to connect albums produced by artists. The reason rdfs:seeAlso is used instead of owl:sameAs is that a URI may refer to a particular release of an album (such as the US rather than UK release). It would be incorrect to express that an express an owl:sameAs relationship between them as they may differ in terms of release date and geographical market. They may also differ in other ways such as their track listings and album covers. An approach to modelling different album releases and their characteristics will be described in section 3.7.2.

On the schema level, which contains the vocabulary used to classify the instance-level items, a number of relationships can be expressed using RDFS, OWL and the SKOS Mapping vocabulary. The RDFS properties rdfs:subPropertyOf and rdfs:subClassOf can be used to declare relationships between two properties or two classes from different vocabularies.

As described in chapter 2, OWL also provides predicates for stating that two classes, or two properties, have the same meaning, as follows:

mo:MusicArtist owl:equivalentClass ex:musician .

foaf:made owl:equivalentProperty ex:creatorOf .

The first of these triples means that all instances of one of these classes are also instances of the other. The second of these triples means that if two resources are connected by one of the properties then they are also connected by the other property. SKOS mapping properties can also be used to express alignment between concepts from different vocabularies. These will be discussed later in the section.

The process of detecting links between datasets is known as link discovery. Datasets are heterogeneous in terms of their vocabularies, format and data representation. This makes the process of link discovery far more complex. Determining whether two entities from different datasets refer to the same thing is an example of what is know as the entity resolution problem. Two types of ambiguity can make the process more challenging. Name ambiguities can result from typos, or the use of different languages or homonyms to describe a thing. Structural ambiguities result from entities having possibly inconsistent relationships to other entities in their respective datasets. These are resolved using ontology and schema matching techniques.

Mappings between datasets (either on the instance or schema level) can be discovered and expressed both manually and automatically. The manual comparison of pairs of entities from different datasets is impractical for larger datasets. SILK is a tool that can be used to discover and express relationships between datasets. This is will be described in section 3.7.4.

SKOS (Simple Knowledge Organisation System) is a data model for expressing and linking Knowledge Organisation Systems such as thesauri, taxonomies, and classification schemes [12]. SKOS is expressed as RDF triples. Here will we consider three SKOS mapping properties in particular: 

  • skos:closeMatch expresses that two concepts are sufficiently similar that they could possibly be used interchangeably 
  • skos:exactMatch expresses that two concepts can be used interchangeably. This property is transitive.
  • skos:relatedMatch expresses that there is an associative mapping between the two concepts

Movie 3: Dataset mappings with SKOS. In this webinar clip, Dr Barry Norton discusses how SKOS can be used to define mappings between concepts from different datasets.

We will now give some examples that make use of the following prefixes.

@prefix mo: <>

@prefix dbpedia-ont: <>

@prefix schema: <>


The example below expresses that the concept MusicArtist from the Music Ontology is an exact match for MusicalArtist from the DBpedia ontology.

mo:MusicArtist skos:exactMatch dbpedia-ont:MusicalArtist.


Here are some other examples:

mo:MusicGroup skos:exactMatch dbpedia-ont:Band.

mo:MusicGroup skos:exactMatch schema:MusicGroup.


The following triples express that the concept SignalGroup (meaning the group of musical or audio signals captured in a recording session) from the Music Ontology is a close match for both the MusicAlbum concept from and the Album concept from DBpedia.

mo:SignalGroup skos:closeMatch schema:MusicAlbum.

mo:SignalGroup skos:closeMatch dbpedia-ont:Album.


Certain integrity conditions apply in order to ensure a consistent mapping between two vocabularies. Two concepts cannot be both a related and exact match. The properties skos:closeMatch, skos:relatedMatch and skos:exactMatch are all symmetric. Only skos:exactMatch is transitive.

Figure 6: Partial Mapping Relation diagram with integrity conditions.

3.6 Publishing Linked Data

Once the RDF dataset has been created and interlinked, the publishing process involves the following tasks:

  1. Metadata creation for describing the dataset 
  2. Making the dataset accessible
  3. Exposing the dataset in linked data repositories
  4. Validating the dataset

These will be described in the following four subsections.

3.6.1 Providing metadata about the dataset

A published RDF dataset should have metadata about itself that can be processed by search engines. This metadata allows for:

  • Efficient and effective search of datasets.
  • Selection of appropriate datasets (for consumption or interlinking).
  • Acquiring general statistics about the dataset such as its size.

The frequently used vocabulary for describing RDF datasets is VoID (Vocabulary of Interlinked Datasets) [13]. An RDF dataset is expressed as being of the type void:Dataset. 

The VoID schema covers four types of metadata:

  1. General metadata
  2. Structural metadata
  3. Descriptions of linksets
  4. Access metadata

General metadata

General metadata is intended to help users identify appropriate datasets. This contains general information such as the title, description and publication date. It also identifies contributors, creators and authors of the dataset. The VoID schema makes use of both Dublin Core and FOAF predicates. A list of general VoID properties is shown in Figure 7.

Figure 7: VoID General metadata.

The VoID general metadata also describes licencing terms of the dataset using the dcterms:licence property (see [14] for a discussion of licensing issues). The topics and domains of the data are expressed using the dcterms:subject property. The property void:feature can be used to express technical features of the dataset such as its serialisation formats (e.g. RDF/XML, Turtle).

Structural metadata

This provides high-level information about the internal structure of the dataset. This metadata is useful when exploring or querying the dataset and includes information about resources, vocabularies used in the dataset, statistics and examples of resources in the dataset.

In the example below a URI (which happens to represent The Beatles) is identified as being an example resource in the MusicBrainz dataset.

:MusicBrainz a void:Dataset;


     <> .


It is also possible to specify the string that prefixes all new entity URIs created in the dataset. Below, all new entities in the MusicBrainz dataset are specified as beginning with the string

:MusicBrainz a void:Dataset;

   void:uriSpace "" .

The property void:vocabulary identifies the most relevant vocabularies used in the datset. It is not intended to be an exhaustive list. The example below states that the Music Ontology is a vocabulary used by the MusicBrainz dataset. This property can only be used for entire vocabularies. It cannot be used to state that a subset of the vocabulary occurs in the dataset. 

:MusicBrainz a void:Dataset;

   void:vocabulary <> .


A further set of properties are used to express statistics about the dataset such as the number of classes, properties and triples. These statistics can also be expressed for any subset of the dataset.

Figure 8: VoID statistics about a dataset.

The void:subset property is defines parts of a dataset. The example below states that MusicBrainzArtists is a subset of the MusicBrainz dataset.

:MusicBrainz a void:Dataset;

  void:subset :. MusicBrainzArtists


The properties void:classPartition and void:propertyPartition are subproperties of void:subset. A subset that is the void:classPartition of another dataset contains only triples that describe entities that are individuals of this class. A subset that is the void:propertyPartition of another dataset contains only triples using that property as the predicate. A class partition has exactly one void:class property. Similarly, a property partition has exactly one void:property property. The example below asserts that there is a class partition of MusicBrainz containing triples describing individuals of mo:Release. It also asserts that there is a property partition that contains triples using mo:member as the predicate.

:MusicBrainz a void:Dataset;

  void:classPartition [ void:class mo:Release .] ;

  void:propertyParition [ void:property mo:member .] .


Descriptions of linksets

A linkset is a set of RDF triples in which the subject and object are described in different datasets. A linkset is therefore a collection of links between two datasets. The RDF links in a linkset often use the owl:sameAs predicate to link the two datasets. In the example below, LS1 is declared as a subset of the DS1 dataset. LS1 is a linkset using the owl:sameAs predicate. The linkset declares sameAs relations to entities in another dataset (DS2).

Figure 9: A collection of links between two datasets. Based on [15].

In the MusicBrainz example below a class partition named MBArtists is defined. This is a linkset that has skos:exactMatch links between MusicBrainz and DBpedia.

@PREFIX void:<

 @PREFIX skos:<>


 :MusicBrainz a void:Dataset .

 :DBpedia a void:Dataset .


 :MusicBrainz void:classPartition :MBArtists .

 :MBArtists void:class mo:MusicArtist .


 :MBArtists a void:Linkset;



      void:target :MusicBrainz, :DBpedia .


Access metadata

The VoID schema can also be used to describe methods for accessing the dataset, for example the location of a URI where entities in the dataset can be inspected, a SPARQL endpoint or file containing the data. The predicate void:rootResource can be used to express the top terms in a hierarchically structured dataset.

Figure 10: Methods for accessing metadata.

3.6.2 Providing access to the dataset

A dataset can be accessed via four different mechanisms:

  • Dereferencing HTTP URIs
  • RDFa
  • SPARQL endpoint
  • RDF dump

These will be described below.

As we saw earlier, the first two linked data principles state that URIs should be used as names for things and that HTTP URIs should be used so that users can look up those names. Dereferencing is the process of looking up the definition of a HTTP URI. 

The URI is used to name The Beatles. It is not possible to send The Beatles over HTTP. However if you access this URI you will be forwarded to a document at some other location that can provide you with information about The Beatles. The HTTP conversation goes as follows:

  1. You request data from a URI used to name a thing such as The Beatles (e.g. You may request data in a particular format such as HTML or RDF/XML.
  2. The server responds with a 303 status (meaning redirect) and another location from which the data in the preferred format can be accessed
  3. You request data from the location to which you were redirected.
  4. The server responds with a 200 status (meaning your request has been successful) and a document in the preferred format.

If you request data about The Beatles in HTTP format you will be redirected to a web page ( If you request the data in RDF/XML format then you will be redirected to an alternative document ( If you are providing rather than requesting the data then you need to decide which RDF triples should be returned from your dataset in response to dereferencing a HTTP URI about an entity (such as The Beatles) which cannot itself be returned. Guidance on what to return can be found in [16]. This can be summarised as follows:

  • Immediate description: All of the triples in the dataset in which the originally requested URI was the subject. 
  • Backlinks: All triples in which the URI is the object. This allows browsers or crawlers to traverse the data in two directions. 
  • Related descriptions: Triples not directly linked to the resource but likely to be of interest. For example, information about the author could be sent with information about a book as this is likely to be of interest  
  • Metadata: Information about the data, along the lines described in 3.4.1 such as the author of the data and licensing information. 
  • Syntax: There are a number of ways of serializing RDF triples. The data source may be able to provide RDF in more than one format, for example as Turtle as well as RDF/XML.

RDFa stands for “RDF in attributes”. RDFa is an extension to HTML5 for embedding RDF within HTML documents. The advantage of RDFa is that a single document can be used for both human and machine consumption of the data. A human accessing the data via a web browser need not be aware that an alternative RDF representation is embedded within the page. RDFa can be thought of a bridge between the Web of Data and the Web of (human readable) Documents.

Figure 11 lists the main attributes of RDFa. The about attribute specifies the subject that the metadata is about. The typeof attribute specifies the rdf:type of the subject. The property attribute specifies the type of relationship between the subject and another resource. The vocab and prefix attributes specify the default vocabulary and prefix mappings.

Figure 11: RDFa attributes.

Below we can see a portion of HTML+RDFa contained in a HTML <div> element. The subject of this fragment of RDFa is specified using the about attribute. Here the subject is the MusicBrainz URI for The Beatles. In the line below we see the typeof property which is used to specify the rdf:type of the subject. The type of The Beatles is specified as the MusicGroup concept from the Music Ontology.

<div class="artistheader" 






Below we can be see this RDF triple extracted from the HTML+RDFa.





Figure 12 shows an example of a page in HTML+RDFa format. This is the MusicBrainz page about The Beatles ( As mentioned earlier, the human reader need not be aware of the RDF embedded within the page. 

Figure 12: MusicBrainz page in HTML+RDFa format.

An RDFa distiller and parser can be used to extract the RDF representation. In Figure 3.13 the URL for the MusicBrainz page about The Beatles has been entered into the form ( 

Figure 13: Extracting RDF from a MusicBrainz page in HTML+RDFa format.

Figure 14 shows a fragment of the RDF contained in the page, represented in the N-Triples format. 

Figure 14: Extracted RDF in N-Triples format.

An RDF Dump is a file that contains the whole or some subset of an RDF dataset. A dataset may be split over several data dumps. An RDF dump may use one of a number of formats. RDF/XML encodes RFD in XML syntax. N-Triples is a subset of the Turtle format in which the RDF is represented as a list of dot-separated triples. The format N-Quads is an extension of N-Triples in which a fourth element specifies the context or named graph of each triple. A site that maintains a list of available RDF data dumps can be found in [17].

SPARQL is a language that can be used to query an RDF dataset. A SPARQL endpoint is service can that processes SPARQL queries and return results. SPARQL queries can be used to retrieve particular subsets of the dataset. See chapter 2 for more information on SPARQL. Lists of publicly available SPRAQL endpoints are maintained at [18] and [19].

3.6.3 Exposing the dataset in linked data repositories

Data catalogs, markets or repositories are platforms that provide access to a wide range of contributed datasets. They assist data consumers in finding and accessing new datasets. Catalogs generally offer relevant metadata about the dataset. The open source platform CKAN can be used for managing and providing access to a large number of datasets. CKAN would be recommended for a large institution that wanted to manage access to a number of datasets. The Data Hub is a public linked data catalog to which datasets can be contributed. CKAN and Data Hub will be described in more detail in section 3.7.

3.6.4 Validating the dataset

There are three different ways in which an RDF dataset can be validated. The first set (labelled accessibility in the Figure 15) checks that URIs are dereferenced correctly, using the HTTP client-server dialogue as described in section 3.5.2. A second set (labelled parsing and syntax) is used to validate the syntax of the RDF that is returned. Separate services are available for validating RDF/XML and RDFa markup. Finally, RDF:Alerts is a general purpose validation service for checking syntax and other problems such as undefined properties and classes and data type errors.

Figure 15: Ways of validating an RDF dataset.

3.7 Providing Linked Data: Checklist

Creating Linked Data

  • Are all the relevant entities/concepts were effectively extracted from the raw data? 
  • Are all of the URI you have created dereferenceable?
  • Are you reusing terms from widely accepted vocabularies and only invented terms when one did not already exist?

Interlinking Linked Data

  • Is the dataset linked to other RDF datasets?
  • Are the created vocabulary terms linked to other vocabularies?

Publishing Linked Data

  • Do you provide dataset metadata?
  • Do you provide information about licensing?
  • Do you provide additional access methods such as data dump or SPARQL endpoint?
  • Is the dataset registered in LD catalogs, using CKAN within an organisation or publicly via the Data Hub?


Part II: Linked Data Catalogs and Tools for Providing Linked Data

3.8 The Web and Linked Data

In this section we will look in more detail at linked data catalogs and also how linked data can be used to support search.

3.8.1 Linked data catalogs

As we saw in the previous section, data catalogs are platforms that provide access to a wide range of datasets from different domains. Below we describe CKAN that can be used to build data catalogs and The Data Hub, which is a public catalog of datasets.

CKAN [20] is an open source platform for developing a catalog for a number of datasets. CKAN may be used by an organisation for internally managing their datasets. These datasets need not be publically available as part of the Linking Open Data cloud. CKAN features a number of tools for data publishers to support:

  • Data harvesting
  • Creation of metadata
  • Access mechanisms to the dataset
  • Updating the dataset
  • Monitoring the access to the dataset

Figure 16. CKAN.

Movie 4: The CKAN platform. In this webinar clip, Dr Barry Norton presents the CKAN platform,the Data Hub and publishing to the Linked Open Data cloud.

CKAN has a schema for describing contributed datasets. This is similar to the VoID schema described in section 3.5.1.

Figure 17: Overview of the CKAN portal (from [20]).

The Data Hub [21] is a community-run data catalog that contains more than 5,000 datasets. Data Hub is implemented using the CKAN platform and can be used to find public datasets. In The Data Hub, datasets can be organised into groups each having their own user permissions. Groups may be topic based (e.g. archaeological datasets) or datasets in a particular language or originating from a certain country. 

Figure 18: The Data Hub.

The group “Linking Open Data Cloud” catalogs datasets that are available on the Web as Linked Data.

Figure 19: The group Linking Open Data Cloud on The Data Hub.

Every bubble in the Linking Open Data cloud (shown in Figure 3.20) is registered with the Data Hub. For a dataset to be included in this cloud it must satisfy the following criteria:

  • The dataset must follow the Linked Data principles (see section 3.3)
  • The dataset must contain at least 1,000 RDF triples
  • The dataset must contain at least 50 RDF links to a dataset that is already in the diagram
  • Access to the dataset must be provided

Once these criteria are met, the data publisher must add the dataset to the Data Hub catalog, and contact the administrators of the Linking Open Data Cloud group.

Figure 20: Linking Open Data Cloud.

3.8.2 Linked data and commercial search engines

Search engines collect information about web resources in order to enrich how search results are displayed. Snippets are the few lines of text that appear underneath a search result link to give the user a better sense of what can be found on that page and how it relates to the query. Rich snippets provide more detailed information, by understanding something about the content of the page featured in the search results. For example, a rich snippet of a restaurant might show a customer review or menu information. A rich snippet for a music album might provide a track listing with a link to each individual track.

Figure 21: Example of a Rich Snippet.

Rich snippets are created from the structured data detected in a web page. The structured data found in a web page may be represented using RDFa as the mark up format and as the vocabulary. is a collaboration between Google, Microsoft and Yahoo! to develop a markup schema that can be used by search engines to provide richer results. provides a collection of schemas for describing different types of resource such as:

  • Creative works: Book, movie, music recording, … 
  • Embedded non-text objects
  • Event
  • Health and medical types
  • Organization
  • Place, local business, restaurant
  • Product, offer, aggregate offer 
  • Review, aggregate rating

Data represented using is recognized by a number of search engines such Bing, Google, Yahoo! and Yandex. also offers an extension mechanism that a publisher can use to add more specialised concepts to the vocabularies. The aim of is not to provide a top-level ontology, rather it puts in place core schemas appropriate for many common situations and that can also be extended to describe things in more detail. 

Google knowledge graph uses structured data from Freebase to enrich search results. For example a search for the Beatles could include data about the band and its membership. In the snapshot of Figure 3.22 this can be seen in what is called a disambiguation pane to the right of the search results. This additional information could help the user to disambiguate between alternative meanings of their search terms. Google knowledge graph can also allow users to access directly related web pages that would otherwise be one or more navigation steps away from their search results. For example, the search for a Beatles album could provide links giving direct access to tracks contained on the album.

Figure 22. Google search for The Beatles showing a disambiguation pane.

Bing is now providing similar functionality to Google Knowledge Graph but built on the Trinity graph engine. A Bing search for “leopard” would produce structured data and disambiguation as shown in Figure 23.

Figure 23: Bing search for Leopard showing disambiguation.

The above examples use data graphs that connect entities to enrich search results. The Open Graph Protocol, originally developed by Facebook, can be used to define a social graph between people and between people and objects. The Open Graph Protocol can used to express friend relationships between people and also relationships between people and things that they like: music they listen to, books they have read, films they have watched. These links between an object and person are expressed by clicking Facebook “like” buttons that can be added by publishers to websites outside the Facebooks domain. RDFa embedded in the page provides a formal description of the “liked” item. The Open Graph Protocol supports the description of several domains including music, video, articles, books, websites and user profiles.

Figure 24: Example Open Graph relationships (from [22]).

The Open Graph Protocol can be used to express different types of actions for different types of content. For example, a user can express that they want to watch, have watched or give a rating for a movie of TV programme. For a game, a user may record an achievement or high score.

This social graph of people and objects can then be used in Facebook Graph Search. This can be used to search not only for objects, but for objects liked by friends that have other properties such as living in a particular location.

Figure 25: Facebook Graph Search.

3.9 Tools for Providing Linked Data

In this section we will look at some of the tools that can assist with the creation and interlinking of linked data, introduced in sections 3.3 and 3.4.

  • Extracting data from spreadsheets: OpenRefine.
  • Extracting data from RDBMS: R2RML.
  • Extracting data from text: OpenCalais, DBpedia Spotlight, Zemanta, GATE.
  • Interlinking datasets: Silk.

3.9.1 Extracting data from Spreadsheets: OpenRefine

First, we will look at how RDF data can be created from tabular data such as that found in spreadsheets. This relates to the part of the architecture shown in Figure 3.26. Tabular data can be represented in a number of common formats. CSV (Comma Separated Values) and TSV (Tab Separated Values) are two common plain text formats for representing tables. Tables can also be represented in HTML and in spreadsheet applications. Tabular data can also be represented in JSON (Javascript Object Notation), originally developed for use with the Javascript language and now a common data interchange format on the Web.

The transformation of tabular data to an RDF dataset will involve mapping items mentioned in tables to existing vocabularies, interlinking to entities for other datasets and to some extent data cleansing where alternative names or mistyped names for items mentioned in the tables need to be handled.

Figure 26: Integrating chart data.

Below we can see an example of data represented in CSV format. This shows sales data for a number of music artists. Each row is divided into two cells by the comma. So, for example, the first row tells us that The Beatles have sold 50 million records. Elvis Presley is not far behind with 203.3 million records.

The Beatles, 250 million

Elvis Presley, 203.3 million

Michael Jackson, 157.4 million

Madonna, 160.1 million

Led Zeppelin, 135.5 million

Queen, 90.5 million


Below we can see the first row of sales data represented in JSON format. Here the rank order of the artists based on sales is made explicit as an additional column in the data. The cells in the data also have labels, reflecting the column labels you might find in a HTMTL table, and which are also implicit in the CSV format. 

   "artist": { 

     "class": "artist", 

     "name": "The Beatles" 


   "rank": 1, 

   "value": 250 million




Finally, in Figure 27 we see sales data represented as a HTML table. Here we have explicit column labels and also additional information related to the active period and first release of each artist. 

Figure 27: List of best-selling music artists (from [23]).

We will now look at how OpenRefine [24] can be used to translate tabular formats to RDF. The OpenRefine tool was originally developed by the company Metaweb as a way of extracting data for Freebase, a vast collaborative knowledgebase of structured data. This tool later became GoogleRefine and was then renamed OpenRefine when released as an open source project. When using OpenRefine, the first step is to create a project and import tabular data in a format such as Microsoft Excel or CSV.

Figure 28: Importing data to OpenRefine.

Movie 5: Screencast of OpenRefine.

As illustrated in Figure 3.29, OpenRefine assists us in transforming a tabular format such as CSV to RDF data. A number of processes are involved in this transformation. First, we can see that the serialisation has changed from comma-separated rows of data to RDF data in Turtle format. Second, the artists listed in the first column have been transformed from text strings to MusicBrainz URIs. Third, the sales figures have been transformed to a number with an integer data type. Finally, we have a relation (totalSales) linking the artist to the number. We will now look step-by-step at how this transformation is carried out.

Figure 29: Translating CSV data to RDF.

The first step involves defining the rows and columns of the dataset. This can involve deleting columns not required in the RDF data and splitting columns based on specified conditions. To help with this process OpenRefine provides a powerful expression language called OpenRefine Expression Language. This has a number of functions for dealing with different types of data such as Booleans, strings, and mathematical expressions. The OpenRefine Expression Language is also still often known by the acronym GREL. This dates back to when it was known as the Google Refine Expression Language.

In the example of Figure 3.30, we used GREL to split the word “million” from the number in Column 2 to create Columns 2 2 and 2 3. We then multiply Column 2 2 by the number 1,000,000 to create the Total Sales column. 

Figure 30: Data transformation in OpenRefine.

We then need to map the artists listed in Column 1 to MusicBrainz URIs. For this we can use the RDF Refine plugin developed by DERI ( The process of mapping between multiple representations of the same thing (in this case artists represented by a string and a MusicBrainz URI) is known as entity reconciliation. Entity reconciliation can be carried out against the SPARQL endpoint of an RDF dataset. Textual names for things (such as The Beatles) are matched against the text labels associated with the entities in the dataset. In the figure the artist names have been reconciled with the MusicBrainz URIs listed in a column headed musicbrainz-id.

Figure 31: Entity reconciliation in OpenRefine.

The data is then transformed into RDF triples. In the example of Figure 3.32, we have specified that a triple should be generated for each row in which the MusicBrainz URI is connected to the Total Sales data by the totalSales property. The RDF Preview tab shows what the first 10 rows of data will look like as RDF triples.

Figure 32: Previewing RDF triples in OpenRefine.

3.9.2 Extracting data from relational databases: R2RML

For data stored in multiple tables in a relational database we need a more expressive way of defining mappings to an RDF dataset. R2RML (Relational Database to RDF Mapping Language) [4] can be used to express mappings between a relational database and RDF that can then be handled by an R2RML engine. R2RML can be used to publish RDF from relational databases in two ways. First, the data could be transformed in batch as an RDF dump. This dump could then be loaded into a RDF triplestore. The endpoint of the triplestore could then be used to run SPARQL queries against the RDF dataset. Second, the R2RML engine could be used to translate SPARQL queries on-the-fly into SQL queries that can be run against the relational database.

Figure 33: Integrating relational databases and linked data.

In 2012, the W3C made two recommendations for mapping between relational databases and RDF [25]. The first recommendation defines a direct mapping between the database and RDF. This does not allow for vocabulary mapping or interlinking, just the publishing of database content in an RDF format without any additional transformation. The direct mapping recommendation is not relevant here as are we also wish to map items in the database (for example music artists) to existing URIs even though those URIs are not included in the database itself.

The second recommendation is R2RML that provides a means of assigning entities in the database to classes and mapping those entities in subject-predicate-object triples. It also allows the construction of new URIs for entities and interlinking them with the rest of the RDF graph. 

Figure 3.34 shows the core database tables and relationships in the MusicBrainz Next Generation Schema (NGS) that was released in 2011. In the diagram, the Primary Key (PK) of a table indicates that each entry in that column can be used to uniquely reference a row. A Foreign Key (FK) uniquely identifies a row in another table. MusicBrainz NGS provides a more complex way of modelling musical releases. For example, before NGS it was not possible to relate together multiple releases of the same album at different times and in different territories.

The NGS defines an Artist Credit that can be used to model variations in artist name. This can describe multiple names for an individual and different names for various groups of artists. For example, the song “Get Back” is credited to “The Beatles with Billy Preston” rather than “The Beatles”.  This would be difficult to represent in MusicBrainz without NGS.

Another major change in MusicBrainz NGS is how musical releases are modelled. A Release Group is an abstract “album” entity. A Release is a particular product you can purchase. A Release has a release date in a particular county and on a particular label. Many releases can belong to the same Release Group. A Release may have more than one Medium (such as MP3, CD or vinyl). On each Medium, the Release will have a tracklist comprising a number of tracks. Each track has a Recording that is a unique audio version of the track. This could be used, for example, to distinguish between the single and album versions of a track. Artist Credit can be assigned to each individual track as well as the Recording, Release and Release Group. Artist Credit can also be assigned to a Work, which represents the composed piece of music as distinct from its many recordings. 

Figure 34: Relational Schema for the Music Database.

In Figure 3.35 we can see a few core classes in the Music Ontology to which we can map when generating RDF data from data represented in the MusicBrainz NGS Relational Database. The Music Ontology models a MusicArtist as composing a Compostion, which is then produced as a MusicalWork. A Performance of a MusicalWork can be recorded as a Signal that can be Produced from Recordings of that Performance.

Figure 35: The Music Ontology (from [26]).

Mapping a database table to a class in an ontology is relatively straightforward. Here we will map the Artist table in MusicBrainz NGS to the MusicArtist class in the Music Ontology. Mappings in R2RML are mostly specified as instances of what are referred to TriplesMaps. The TriplesMap specified below has the identifier lb:Artist. The logicalTable is the source data table from which the triples are derived. In this case the logicalTable is the table named “artist” in the relational database. As we shall see in the next example, the logicalTable can also be the result of a SQL query across a number of tables rather than a single table in the database.

The subjectMap defines the subject of the triple. The subject is constructed as a MusicBrianz URI with the entry from the gid column of the Artist table being inserted between and the # symbol. The specified predicate is mo:musicbrainz_guid which links a MusicBrianz URI to its ID in the form of a string. The object of the triple is also the entry from the gid column but represented as a string.

lb:Artist a rr:TriplesMap ;

  rr:logicalTable [rr:tableName "artist"] ;


    [rr:class mo:MusicArtist ;


          "{gid}#_"] ;


    [rr:predicate mo:musicbrainz_guid ;

     rr:objectMap [rr:column "gid" ; 

                   rr:datatype xsd:string]] .


Database columns can also be mapped to properties. In the example below we supply a name property to each of the MusicBrianz URIs generated in the previous example. In this case the logicalTable used in the TriplesMap is an SQL query that returns a table of results. This query joins two tables to link the gid of an artist to the artist’s name. The subject of the triple is the same as the subject specified using the TriplesMap from the previous example. The predicate is foaf:name. The object is the name column from the logicalTable

lb:artist_name a rr:TriplesMap ;

  rr:logicalTable [rr:sqlQuery 

    """SELECT artist.gid,  

       FROM artist

         INNER JOIN artist_name ON ="""] ;

  rr:subjectMap lb:sm_artist ;


    [rr:predicate foaf:name ;

     rr:objectMap [rr:column "name"]] .


MusicBrainz Next Generation Schema (NSG) also provides Advanced Relationships as a way of representing various inter-relationships between key MusicBrainz entities such as Artist, Release Group and Track. The table l_artist_artist is used to specify relationships between artists. Each pairing of artists would be represented as a row in the l_artist_artist table.  Each pairing of artists refers to a Link. One link would be member_of that would specify a relation between an artist and a band of which they were a member.

Figure 36: NGS Advanced Relations.

The R2RML triplesmap below shows how we would specify that an artist is a member of a band. Here the logicalTable is the result of a complex query that associates artists with a band. The Music Ontology member_of predicate is used to asociate an artist with a MusicBrainz URI identifying the band. 

lb:artist_member a rr:TriplesMap ;

   rr:logicalTable [rr:sqlQuery

     """SELECT a1.gid, a2.gid AS band

        FROM artist a1

          INNER JOIN l_artist_artist ON = l_artist_artist.entity0 

          INNER JOIN link ON = 

          INNER JOIN link_type ON link_type = 

          INNER JOIN artist a2 on l_artist_artist.entity1 = 

        WHERE link_type.gid='5be4c609-9afa-4ea0-910b-12ffb71e3821'

          AND link.ended=FALSE"""] ;

   rr:subjectMap lb:sm_artist ;


     [rr:predicate mo:member_of ;

      rr:objectMap [rr:template "{band}#_" ;

                    rr:termType rr:IRI]] .

Movie 6: Screencast of R2RML.

3.9.3 Extracting data from text: DBpedia Spotlight, Zemanta and OBIE 

The previous tools worked on data that was already in some tabular or relational structure. Work carried out by the tools largely involved transforming this existing structure to a triple structure as well as some mapping and interlinking. Text is more open and ambiguous and involves more than transformation. As we shall see later, text extraction is only correct to some level of precision and recall.

OpenCalais [27] can be used to automatically identify entities from text. OpenCalais uses natural language processing and machine learning techniques to identify entities, facts and events. OpenCalais is difficult to customise and has variable domain-specific coverage.

Figure 37: OpenCalais.

DBpedia Spotlight [28] can be used to identify named entities in text and associate these with DBpedia URIs. In the snapshot of Figure 3.38, recognised entities in the submitted text have been hyperlinked to their DBpedia URIs. DBpedia Spotlight is not easy to customise or extend and is currently only available in English.

Figure 38: DBpedia Spotlight.

Zemanta [29] is another general-purpose semantic annotation tool. Zemanta is used by bloggers and other content publishers to find links to relevant articles and media. Best results require bespoke customization.

Figure 39: Zemanta.

GATE (General Architecture for Text Engineering) [30] is an open-source framework for text engineering. GATE started in 1996 and has a large developer community. GATE can be more readily customized for text annotation in different domains and for different purposes. GATE is used worldwide to build bespoke solutions by organisations including the Press Association and National Archive. Information extraction is supported in many languages. GATE can also parse text as well as recognise entities and can therefore be used to identify entities depending on their function in the sentence. For example GATE could be used to extract an entity only when used as the noun phrase, rather than the verb phrase, of the sentence. 

LODIE [31] is an application built with GATE using DBpedia. LODIE uses Wikipedia anchor texts, disambiguation pages and redirect pages to help find alternative versions of things.

Figure 40: LODIE.

Precision and recall measures can be used to compare text annotation tools. Precision and recall are both values from 0 to 1. Precision indicates what proportion of annotations are correct. Recall indicates what proportion of possible correct annotations in the text are identified by the tool.

Figure 3.41 shows precision and recall figures for these tools. Precision and recall figures are shown in pairs, precision on the left, recall on the right separated by a slash. Precision and recall are shown for types of entity (person, location and organisation) as well as in total. We see that DBpedia Spotlight has relatively good precision but relatively low recall, identifying 39% of entities in the text. Zemanta has similar precision but higher recall. LODIE has the highest recall but lower precision. 

Good results can also be achieved by combining the methods. Only annotating entities suggested by both Zemanta and LODIE (i.e. the intersection of Zemanta and LODIE) gives very high precision. The union of Zemanta and LODIE gives high recall but lower precision.

Figure 41: Comparison of DBpedia Spotlight, Zemanta and LODIE.

An alternative to the generic services provided by DBpedia Spotlight and Zemanta is to build a GATE processing pipeline specifically for your domain. For this we take an RDF dataset and use this to produce what is called a GATE Gazetteer, which is a list of entities in a domain and associated text labels used to refer to those entities. We can produce a gazetteer using the RDF data produced from the R2RML transformation of the MusicBrainz NGS Relational Database (see section 3.7.2). A SPARQL endpoint to this data can be used to populate a custom gazetteer for the music domain. For example, the query in Figure 3.43 returns solo artists and music groups and their foaf:name. It also returns albums (represented by the SignalGroup class in the music ontology) and their dc:title. This provides a vocabulary of artists and albums with associated labels that can be used in the gazetteer.

Figure 42: Producing a GATE Gazeteer.

A GATE pipeline can be run locally or uploaded to the GATE cloud. Once set up, text can be submitted and then annotated using the MusicBrainz data. The annotated text can then be output in a format such as RDFa.

Figure 43: GATE Cloud.

3.9.4 Interlinking datasets: SILK

As mentioned in section 3.4, manually interlinking large datasets is not feasible. SILK [32] is a tool that has been developed to support the interlinking of datasets. In our case we may wish to define links between artists mentioned in MusicBrainz and the same entities in DBpedia. This process fits specifically in the interlinking phase of the diagram of Figure 44.

Figure 44: Interlinking datasets with SILK.

SILK is an open source tool for discovering RDF links between data items within different Linked Data sources. The Silk Link Specification Language (Silk-LSL) is used to define rules for linking entities from two different datasets. For example, a rule may express that if two entities belong to specified classes and have matching labels then they should be linked by a certain property. This property could be owl:sameAs or some other property such as skos:closeMatch (see section 3.4). 

SILKS can run in different variations. It can be run locally on a single machine, on a server or distributed across a cluster. The SILK workflow is shown in Figure 3.44. The first step is to select the two datasets. Generally, one dataset would be your own and another including some of the same entities but named with different URIs. In the second step, we specify the two datasets by either loading an RDF dump or pointing to a SPARQL endpoint. We also specify the types of entities to be linked. In the third step, we express the linkage rules in Silk-LSL. The discovered links can then be published as a linkset (see section 3.5.1) with your dataset. 

Figure 45: SILK workflow based on [33].

The rules for comparing two entities can consider not only the two entities themselves but also additional data items found in the graph around each of those entities. For example, a rule may compare the rdfs:label of one entity with the foaf:name of another. The paths from the compared entities to these additional data items are specified as RDF paths. Different transformations can be also be performed on the compared data items. For example, if they are strings (such as labels or names) then they may be both transformed into lower case to prevent the use of lower and upper case leading to a mismatch. The linkage rules also define the comparators used to compute the similarity of the two data items. When comparing two strings, an exact match may be required. Alternatively, similarly may be computed as a Levenshtein edit distance. This indicates the number of single character changes that would need to be made in order to turn one string into the other. This provides a way of matching data items that may contain typos. Similarity metrics can be used for other data types such as dates. Finally, aggregations can be computed from data items associated with the entity. For example, two potentially matching albums could be compared in terms of the number of track that they contain. If the number of tracks is equal then this could be further evidence that these two entities refer to the same album.

The SILK Workbench is a web application built on top of SILK that can be used to create projects and manage the creation of links between two RDF datasets. The SILK Workbench has a graphical editor that can be used to create linkage rules. Support is also provided for the automatic learning of linkage rules. Figure 3.46 shows a snapshot of the SILK Workbench. A new project has been created with the name “MyLinkedMusic”. Two datasets have been added, labelled as DBpedia and MyMusicBrainz. The sections below this are concerned with the specification of linkage rules and the location and format of the output.

Figure 46: Overview of a Silk Workbench project

Figure 47 shows how the graphical editor can be used to specify linage rules. In this example the foaf:name of the MusicBrainz entity and the rdfs:label of the DBpedia entity are both transformed to lower case and then compared in terms of their Levenshtein edit distance.

Figure 47: Adding a linkage rule in the SILK workbench.

The linkage rules can then be used to generate a set of links as shown in Figure 3.48. Each of these links between a MusicBrainz and DBpedia identifier has a confidence score. The larger the Levenshtein edit distance between the foaf:name and rdfs:label, the lower the confidence. All of the examples listed have a confidence score of 100% indicating a zero edit distance between the two literals. Confidence can be accumulated from a number of sources. When comparing music groups, this could include other data such as their membership and formation date.

Figure 48: Generating links with the SILK workbench.

The SILK Workbench also provides an interface for examining automatically learned rules. These suggested rules can then be added to the set of linkage rules or rejected.

Figure 49: Rule learning with the SILK workbench.

3.10 Further reading































[31] Ciravegna, F., Gentile, A., Zhang, Z.. (2012) LODIE - Linked Open Data for Web-scale Information Extraction. Workshop on Semantic Web and Information Extraction (SWAIE 2012). Workshop in conjunction with the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012).


[33] LOD2 Webinar Series: Silk -(Simplified) Linking Workflow by Rober Isele.

3.11 Summary

After studying this chapter you should achieve the following outcomes: 

  • Understanding of the linked data lifecycle and the different ways in which it has been characterised.
  • An understanding of how linked data is created including guidelines for the design of cool URIs.
  • An understanding of how links can be specified between datasets using rdfs:seeAlso, owl:sameAs and SKOS properties.
  • An understanding of the processes involved in the publication of linked data including the construction of dataset metadata, dataset validation and the mechanisms that can be used for making a dataset available. 
  • A knowledge of how datasets can be managed with private and public data catalogs and registered with the Linking Open Data cloud.
  • An understanding of how linked data is used to enrich search engine results.
  • A practical understanding of the tools that can be used to extract data from spreadsheets, relational databases and text.
  • A practical understanding of how SILK can be used to interlink datasets.