Chapter 4: Interaction with Linked Data

4.1 Introduction

In the previous chapter we described how linked data could be made available, for example, via a data dump or SPARQL endpoint. The emphasis was on providing data in a form readable by machines such as RDF/XML or Turtle. The results of a SPARQL query could be provided in a format such as JSON that a software developer could then use to develop an application. In this chapter, rather than focussing on how data can be made available to applications in machine-readable form, we will look at how linked data can be presented for use by a human audience.

In this chapter we will once again use music as a motivating example but focus on modules that enable people can interact with and explore music-related data. We consider how RDF data could be visualised and also how statistical and machine learning techniques can be used to extract interesting patterns from data.

Movie 1: An overview of the Linked Data visualisation process.

Applying visualization techniques to RDF data provides much more engaging ways of communicating data. As illustrated in figure 1, a user may initiate the process by entering a search query. The semantics of the resources returned from the search can then be used to construct a coherent presentation of the information. Text descriptions of RDF resources can be embedded with links to other resources and also other media such as images. Sets of resources, such as the single or album releases of a band, can be aggregated and presented in different forms such a map (showing for example where they were released) and bar chart (showing for example the number of individual releases of each work).

Figure 1: Visualization of music-related data (from [1]).

Visualizations have a number of advantages in how they communicate data. Visualizations make it possible to tell a story with the data, giving it some meaning and interpretation. Visualizations also make it easier for people to spot patterns in the data such as changes over time or between different locations. Visualizations can also reveal differences between datasets that may not be apparent from simple descriptive statistics such as the mean and variation of a set of values. An interesting example is Anscombe's quartet of datasets (see figure 2). The four datasets have a number of statistical properties in common such as the mean and variance of the x and y variables and the correlation and regression values between x and y. However differences between the datasets are clearly apparent when visualized.

Figure 2: Anscombe's quartet of datasets having similar statistical properties but appearing very different when plotted [2].

4.2 Learning outcomes

On completing this chapter you should understand the following:

  • The process of extracting and transforming Linked Data for visualization.
  • The range of visualization techniques available or different types of data
  • The types of Linked Data visualization tools currently available
  • How the Information Workbench [3] can be used to visualize data
  • Approaches to visualizing the Linking Open Data cloud
  • The use of dashboards to provide summary information about a dataset
  • How semantics can be used to drive search and display search results
  • Tools that can be used to search for semantic data
  • How data can be aggregated and analysed statistically
  • How machine learning can be used to identify patterns in a dataset

 

Part I: Linked Data Visualization

4.3 LD Visualization Techniques

Linked Data visualization techniques aim to provide graphical representations of some information of interest within a dataset. Visualizations need to be selected that match the type of data, for example whether it be numerical data or location information. Visualizations also need to be selected that match the task that the user is trying to perform, bringing to the foreground the types of data and patterns in the data that they wish to work with.

Figure 3 illustrates the way in which raw RDF data needs to be transformed to produce visualizations. First, the data of interest is extracted from the dataset. Performing a SPARQL query can do this. Second, the data needs to be transformed in order that it can be displayed with the intended visualization methods. A simple example would be to extract a numerical value from a string so that it could then be visualized on a bar chart. Third, the data needs to be mapped to the constructs of the visualization. For example, the numerical value could be mapped to the y-axis of a bar chart. The view resulting form this process may not be a static image. It may provide ways in which the user can interact with the data, by zooming or clicking to trigger further visualizations.

Figure 3: Linked Data visualization process (partially based on [3])

An example of the Linked Data visualization process is shown in figure 4. A SPARQL query is used to extract the data of interest from the MusicBrianz dataset, in this case the number of Beatles releases per country. In the second step the string value representing the country is transformed into a country code that can be used on the visualization. In the third step, this data is passed to a heatmap visualization. The country code is used to identify an area on the map. The number of releases is mapped to the warmth of the colour of that region. The resulting heatmap is shown in figure 5.

Figure 4. Example of the Linked Data visualization process.

Figure 5: Heatmap visualization of Beatles releases.

4.4 Challenges for Linked Data visualization

The domain of Linked Data delivers a number of challenges for Linked Data. First, there is the challenge of scalability. The data of interest, for example returned from a SPARQL query of a dataset could be vast. Visualization techniques need to be used that can scale to large amounts of data. The visualizations tools also need to be powerful and efficient enough to render the information with an acceptable timescale.

One useful way of addressing scale is to provide visualizations that enable user interaction. All data of potential interest then does not need to be provided within a single static view. The user has control over the visualization, allowing navigation through the data. User interaction functionality may also provide support for user editing of the data or annotation of the visualization itself. When visualizing the data, the user may spot errors or omissions that could be fixed interactively through the visualization. The user may also wish to highlight or make comments about some region of the data, essentially adding metadata to the dataset.

Linked data visualizations and the software mechanisms used to construct them should ideally be reusable. Developing tools to produce visualizations such as such as maps and timelines involves a lot of effort. It is therefore more efficient to produce generic tools that can be reused with many datasets. The emergence of standards for representing types of data (such as time and location information) facilitates the use of visualization tools. Ideally, the resulting visualizations should also be reusable and sharable using standard formats.

4.5 Challenges for Open Linked Data visualization

When we consider Linked Open Data rather than just Linked Data, further challenges need to be addressed. First, the data of interest may be partitioned across different repositories. Assembling the data of interest will therefore require access to multiple datasets. Second, the assembled heterogeneous data may model the concepts in different ways. Alternative formats may also be used for values. For example in different repositories date information may be variously represented as DD-MM-YYYY, MM-DD-YYYY or just YYYY. Third, working with a dataset assembled from multiple repositories increases the likelihood of missing data. Visualizations will need to be able to handle the level of missing data and perhaps also indicate to the user data that cannot be represented in the selected visualization.

4.6 Classification of visualization techniques

Visualization techniques can be classified according to the type of analytical task that the user is attempting to perform on the data. Visualization techniques such as pie charts are appropriate for comparing the attributes or values of different variables within the dataset. If the user wants to analyse relationships and hierarchies then graphs and other related techniques can be used. The analysis of data in time or space can be supported with timelines and maps. A scatter plot can be used to analyse three-dimensional data. Higher dimensional data can be visualised using techniques such as radar charts. The following subsections will describe in more detail a range of example visualization techniques and how they can be used.

Figure 6: Visualization techniques appropriate for different data analysis tasks.

4.6.1 Comparison of attributes/values

The most appropriate visualization for comparing attributes or values will depend on the nature of the data and the task. To compare absolute values (such as total number of sales) associated with a list of items (such as different albums) then a bar chart would be appropriate. If only relative rather than absolute values were of interest then a pie chart could be used. For bar charts and pie charts, the items associated with value (e.g. albums) do not necessarily have any predefined order or position in the chart. If the items do have a pre-define order (for example the release date of the albums) a line chart can be used to show the trend. Finally, frequency distributions for an ordered variable can be visualised using a histogram. This could be used for example to plot the frequency of tracks of varying lengths.

Figure 7: Visualizations for comparing attributes or values. Top left: Using a bar chart to compare values across a set of categories [1]. Top right: Using a pie chart to compare proportions [4]. Bottom left: Using a line chart to visualise a series of data points against an ordered set of points on the x-axis [6]. Bottom right: Using a histogram to visualise frequency distribution [4].

4.6.2 Analysis of relationships and hierarchies

Relationships between nodes can be visualised using a standard graph notation in which relationships are represented as lines. Graphs data can also be visualized using an arc diagram in which the nodes are organised linearly. Relationships between nodes are represented as half circles connecting the two nodes. When using an adjacency matrix, the nodes of the graph and placed on both the x-axis and y-axis. Relationships between the nodes are represented as entries in the grid.

Figure 8. Visualizing relationships using a graph (left), an arc diagram (middle) and adjacency matrix diagram (right) [4].

Some visualizations are specifically designed for hierarchical graph data. The indented tree is a familiar formalism commonly used for visualising hierarchies and navigating file directories. The node-link tree is a tree visualization in which the root node is placed in the centre. This provides a visual cue as to the population level of different sections of the hierarchy.

Figure 9: Visualizing hierarchies using an indented tree (right) and node-link tree (right) [4].

A number of space filling visualization techniques have been designed specifically to give an indication of the population level of different parts of the hierarchy. Treemaps visualise nodes within a hierarchy as a set of rectangles. Containment can be used to represent hierarchical relationships between nodes. The size of the rectangle is generally used to represent the number of individuals of a node (i.e. class) within the dataset. Colour can be used to represent some feature of the nodes (i.e. classes) such as the discrete set of superclasses to which they belong.

The icicle visualization can be used to show a node hierarchy and gives a clear indication of depth at different parts of the hierarchy. The sunburst essentially folds the icicle visualization into a circle. Similarly, the rose diagram uses the size of sectors to indicate the population of parts of the hierarchy. Finally, a circle-packing visualization uses containment to represent the hierarchy and size of the circle to represent containment.

Figure 10: Space filling visualization of a hierarchy using treemaps (left) and icicles (right) [4].

Figure 11: Space filling visualization of a hierarchy using sunburst (left), circle-packing (middle) and a rose diagram (right) [4].

4.6.3 Analysis of temporal or geographical events

Timeline visualizations can be used in combination with both discrete data where, for example, individual events are marked as dots on the timeline. Timelines can also be used to represent continuous data. For example, changes in the frequency of different types of event over time could be represented using colour to indicate event type and thickness of the band to indicate the frequency of those event types at that point in time.

Figure 12: Visualizing discrete [1] (left) and continuous [5] (right) data over time.

Data can also be associated with different types of map visualization. This could involve plotting coordinate points on the map. If the granularity of interest is areas (such as countries) rather than specific points in space, then a choropleth maps can be used. Colour can indicate some feature of the area such as the number of specific data points associated with it. The heat map of Beatles releases shown in figure 5 is an example of a choropleth map. If it is not necessary to indicate the borders between areas (such as country borders) then a Dorling cartogram can be used in which the centre of each circle falls within its associated area and both colour and size of the circle are used to visualize additional data.

Figure 13: Visualising data on a map [6] (left), choropleth map [1] (middle) and Dorling cartogram [4] (right).

4.6.4 Analysis of multidimensional data

Some visualizations can be used to represent data having 3 or more dimensions. Three-dimensional data can be represented using a scatter plot. As well as the x-axis and y-axis of the chart, the size of each dot placed on the scatter plot is used to represent a third dimension. Radar charts or parallel coordinates can be used to represent higher dimension data. In a radar (or star) chart, each multi-dimensional point is represented as a shape whose border connects each axis. The axes of a radar chart are represented as spokes of a wheel. In a parallel coordinates visualization, the axes correspond to vertical lines. A multi-dimensional point is show as a line connecting each axis.

Figure 14: Visualizing multidimensional data using a scatter plot (left) and radar or star chart (right) [4].

Figure 15: Visualizing multidimensional data using parallel coordinates [4].

4.6.5 Other visualization techniques

Text-based visualizations use word size to represent frequency. In a standard tag cloud, words or phrases indicate tags or annotations that has been associated with resources. The larger the word, the more commonly it has been used to tag the resources. A variation on the standard tag clod is the phrase net visualization. This is often used to visualise a document or larger text corpus. Size reflects frequency of the word and also lines connect words that are in close proximity in the text.

Figure 16: Text-based visualization of a tag cloud [7] (left) and network of phrases [8] (right).

4.7 Applications of Linked Data visualization techniques

Visualizations can be applied to Linked Data in order to satisfy a number of aims. First, particularly given the potential scale of the Linked Data of interest, visualizations can be used to provide an overview of the data to guide further analysis. Visualization can be used to identify and analyse relevant resources, classes or properties in the dataset.

Visualizations can be used to reveal a great deal about the vocabulary or taxonomy used in the dataset to define resources. For example, the various methods of visualising a hierarchy, shown in section 4.6.2, would quickly reveal attributes related to the depth and breadth of the hierarchy and the frequency of use of different concepts.

Visualizations can also be used to reveal different types of desired and undesired patterns in the data. A graph visualization could reveal missing links between nodes or uncover new paths between resources. Visualizations may also be used to uncover hidden patterns, errors or outliers in the data (along the lines illustrated in figure 2).

4.8 Summary of Linked Data visualization tool requirements

As described in section 4.4 Linked Data visualizations need to offer data navigation and exploration capabilities. Particularly, given the scale of the data, it is unlikely that a static visualization will provide an adequate view on the data for all purposes. This will involve providing user interaction capabilities such as being able to query data of interest, filtering values and folding or expanding parts of the visualization. Visualizations should also exploit the data structures that are inherent in Linked Data such as such as ontology or taxonomy hierarchies. Ideally, it should be possible for the user to publish or share their visualizations using standard presentation formats for easy distribution. This ability to share should also apply to the extracted data, that is, the particular viewpoint on the data established in the selection and user manipulation of the visualization tool.

4.9 Linked Data visualization tool types

Linked Data visualization tools can be organised into four categories. These range from text-based Linked Data browsers to toolkits with extensive functionality for data transformation and graphical visualization.

1) Linked Data browsers with text-based representation

A text-based LD browser dereferences URIs to retrieve a resource description that is then presented to the user. Most LD browsers will present not only text descriptions but also other media such as images associated with the resource. Text based browsers will generally include hypertext links to connected resources. These may be part of the immediate description (i.e. triples that link from this to another resource) or backlinks (i.e. triples that link from other resources to this resource). See section 3.5.2 of chapter 3 for guidance on presenting resource descriptions.

2) Linked Data and RDF browsers with visualization options

Some Linked Data and RDF browsers make more extensive use of media associated with the resource and organise this media into more coherent or engaging presentations of the data. These browsers also offer greater use interaction. As well as links to connected resources they provide ways of querying and filtering the data. These types of browser can therefore be used to analyse as well as traverse the data.

3) Visualization toolkits

Visualization toolkits bring together a range of visualization techniques. They generally incorporate methods for transforming the raw data in order that it can be rendered by the visualization. Some visualization toolkits are specifically designed to consume Linked Data.

4) SPARQL visualization

Finally, SPARQL visualizations are tools that dynamically transform the output of SPARQL queries to produce visualizations. These tools can be used to support analysis of the dataset that is accessed by the SPARQL endpoint.

As summarised in the figure below, a number of current tools fall into these four categories. In the next section we will introduce Sig.ma and Sindice as examples of Linked Data browsers and then the Information Workbench will be introduced as an example of a visualization toolkit that can also visualize the results of SPARQL queries.

Figure 17: Types of Linked Data visualization tool.

4.10 Linked Data visualization examples

4.10.1 Sig.ma

Sig.ma [9] is a text-based Linked Data browser. Interaction is usually initiated by a text search. Sig.ma returns all triples associated with the search terms. The returned triples are grouped according to predicate. Sig.ma lists all found values for each predicate. The sources of each triple are displayed in the right hand panel. Each source has a number that is used to reference each triple in the main body of the page.

Figure 18: The Sig.ma Linked Data browser.

For a particular predicate, such as title or label, Sig.ma may display the same information in multiple languages. Property values that are URIs can be followed in order to view the RDF data of that connected resource.

Figure 19: The Sig.ma Linked Data browser showing multiple values for a predicate.

4.10.2 Sindice

Interaction with Sindice [10] also begins with a keyword search. A set or results that match the query are displayed. These can be filtered by document type, for example filtering results to only RDF documents. For each document, the user can inspect the RDF triples that it contains. The user can either inspect a cached set of triples or retrieve triples live from the resource.

Figure 20: The Sindice Linked Data browser.

This list of cached or live triples is displayed as a subject/predicate/object table. Sindice also offers other viewing options such as a graph of the triples.

Figure 21: Using Sindice to view the live or cached triples contained in a document.

4.10.3 Information Workbench

The Information Workbench [11] is a platform targeted at the whole lifecycle of linked data application development including integrating, managing, analysing and exploring linked data. Here we focus on the visualization capabilities of the Information Workbench. Generally, visualizations are constructed using data returned from a SPARQL query. A number of different visualization techniques can be applied to the data including bar charts, pie charts, Google maps and timelines. The visualizations allow for user interaction including browsing and exploring the data. A demo system visualizing MusicBrianz using the Information Workbench is available at [12].

Movie 2: Interacting with Linked Data using the Information Workbench.

Movie 3: Search Capabilities of the Information Workbench

Movie 4: Visualizing SPARQL Query Results with the Information Workbench

As with Sig.ma and Sindice, user interaction may begin via text search. In the figure below “The Beatles” has been used as a search term. The user may select full text search. In this case, different types of resources will appear in the results set. For example the resource representing the album “With The Beatles” will be returned as well as a resource representing the band itself. The search can also be limited to particular types of resources, such as music artists. More structured searches can be specified, for example, matching the query only to artists from a specified country. Structured search queries are re-represented as SPARQL queries issued against the dataset.

Figure 22: Keyword and structured search using the Information Workbench.

Having searched for the Beatles and selected the band from the results page, the user is directed to an information page about the Beatles. The top of the page is composed of mash-ups with web services. The panel to the top right show tweets mentioning the Beatles. Below this is a Google map showing the UK as their country of origin. Bottom right is a mash-up with last.fm and YouTube data.

Figure 23: Top of the Beatles page showing mash-ups with web services.

Further down the page we see actual visualizations of the data, each making use of particular SPARQL queries against the dataset. The visualizations show: a table of the track numbers and playing times for Beatles albums (top right), the locations of Beatles releases (top left), a timeline of Beatles release (bottom right) and the number of releases for different albums (bottom right).

Figure 24: Visualizations of Beatles data using a table, map, timeline and bar chart.

The visualizations provided by the Information Workbench also enable user interaction. To the left of the figure below we see a tag cloud of music artists. The size of the text represents number of releases. The text label of each artist is a hypertext link that directs the user to data associated with the resource representing that artist. From there, the user can continue to navigate across the dataset.

Figure 25: Linking from a tag cloud to an information page about this artist.

In the Information Workbench all visualizations are implemented as widgets. The region of the dataset of interest is specified using a SPARQL query. The SPARQL query below requests the top ten Beatles releases based on their duration. The easiest way to visualize a result set is displaying it in a table.

Figure 26: Returning a result set for a SPARQL query.

For many visualizations some level of configuration may need to be specified as well as the SPARQL query. In the example below, we can see that the type of visualization is specified on the first line. This is followed by the SPARQL query. Finally, any configuration settings are provided. If the returned data is to be presented on a bar chart then the author of the visualization will need to specify in the configuration settings the variables returned from the SPARQL query that correspond to the x-axis and y-axis. In the example below, release labels are placed on the x-axis and number of releases is placed on the y-axis. Other features of the visualization may be configured such as the colour and height.

Figure 27: Using a widget to specify a bar chart.

Using the same SPARQL query but with different widgets and configuration setting it is possible to create other visualizations of the same data such as a line chart or pie chart.

Figure 28: Line chart and pie chart visualizations of the same SPARQL query result.

The system can also suggest appropriate widgets for visualization depending on the returned data. In the example below, the results return the playing times for different Beatles albums (labelled 2). The widget auto suggestion link (labelled 1), provides a list of visualization types on the left. On the right (labelled 3) is the selected bar chart visualization. The auto-suggest facility provides a good way to experiment with different ways of visualizing the data.

Figure 29: Using the Information Workbench to auto-suggest visualization widgets.

4.11 Other Linked Data visualization tools

There are other tools available for the visualization of Linked Data. LOD live [13] provides a graph visualization of Linked Data resources. Clicking on the nodes can expand the graph structure. LOD live can be used for live access to SPARQL endpoints. LOD visualization [14] can produce visual hierarchies using treemaps and trees from live access to a SPARQL endpoint.

Figure 30: LOD live [13] and LOD visualization [14] tools.

4.12 Visualizing the Linking Open Data cloud

In chapter 3 (section 3.8.1) we looked at the Linking Open Data cloud diagram that represents connections between the Linked Open Data datasets. This is constructed by hand but there are also tools that can visualize connections between datasets.

Figure 31: The Linking Open Data cloud diagram [15].

Gephi is a platform for visualizing networks, graphs and hierarchies [16]. Gephi can be used to visualize the Linking Open Data cloud. As in the hand-crafted representation, dbpedia.org is the largest node in the network densely connected to other datasets. Colour as well as size is used to represent properties of the dataset. Link length is also used to encode information about the data structure.

Figure 32: Linking Open Data cloud generated using the Gephi platform [16].

Protoviz [17] can also be used to automatically visualize the Linking Open Data cloud. The colour of the node reflects the CKAN rating for the dataset (see section 3.8.1 of chapter 3 for a description of CKAN). The intensity of the colour reflects the number of ratings. The proximity of nodes reflects the level of interconnection between the datasets. Outlying nodes in the graph could indicate broken links to other datasets or a genuine lack of semantic relatedness to other datasets [18]. Clicking on a node takes the user to the CKAN page for that dataset.

Figure 33: Linked Open Data Graph by Protovis [17].

4.13 Linked Data reporting and Google’s Structured Data Dashboard

Visualization techniques can also be used in the creation of reports that provide descriptive statistics for a dataset. Often visualizations are displayed in a dashboard that enables user interaction. Several tools exist that can be used for the construction of dashboards including Google Webmaster tools [19], Information Workbench [11] and eCloudManager [20]. We saw earlier some of the visualization capabilities of Information Workbench. The eCloudManager is a specific solution for data centres and cloud management. In the rest of this section we focus on Google Webmaster tools and how they can be used to provide webmasters with information about structured data embedded in websites that is recognised by the Google search engine.

Google Webmaster tools can be used to provide general data about a website, in terms of its traffic and how it is indexed by the Google search engine. As a part of this, Google Webmaster tools provides dashboards on the structured data within a website. The Structured Data Dashboard has three levels. A site-level view aggregates the structured data across all pages according to the classes defined in the vocabulary. An item-type-level view provides separate details for each type of resource. A page-level view shows attributes of every type of resource in a given web page.

Figure 34: A site-level view showing the number of resources of different types that have been detected. The chart shows how the amount of structured data is evolving over time [21].

Figure 35: A page-level view showing the metadata of the imaginary product featured on that page of the website. The detected metadata defines the resource type, image, name and description [21].

 

Part II: Linked Data Search

In the first part of this chapter we have seen a number of examples of Linked Data visualization. Many of these have used search to identify the data of interest and initiate interaction with the visualization. In this part we will look in more detail at search and in particular semantic search.

4.14 Semantic search process

The figure below provides a way of thinking about semantic search and the potential roles for visualization within that. A user query will be specified possibly as keywords or in natural language. This query can be matched to the underlying graph of data, in order to rank and retrieve results to be returned to the user. Visualization can play two roles. First, visualization can be used to present the search results. For example, search results related to locations may be visualized on a map. Second, visualization can be used to present the query. The user may be able to refine the query by interacting directly with the visualization. Faceted search (discussed more later) would be one example of this. In this case the description of the resources returned in the search can be used to construct a range of filters based on their properties such as time, people and locations. Selecting from the facets limits the part of the data graph of interest.

The user can alter the query by manipulating the visualization. This allows them to modify the underlying SPARQL query with needing to use SPARQL directly.

Figure 36: The role of visualization in search (based on [22]).

4.15 Semantic search

One possible method of input to semantic search is a pseudo natural language query. Entity extraction techniques can then be applied to the text query (a number of entity extraction tools were presented in chapter 3, section 3.9.3).

Figure 37: Extraction of entities from a pseud-natural language query.

By a process termed query expansion, the entities can be mapped to resources within our dataset. For an entity such as ”song” this may involve expanding to a number of candidate synonyms and then attempting to map to these to resources in the dataset.

Figure 38: Mapping the entity “song” to the class Track in the music ontology.

Similarly, the entity “written by” may be mapped to the entity composer in the music ontology.

Figure 39: Mapping the “written by” entity to the composer property.

In some cases there may be a direct mapping between an entity and resource in the dataset. For example, the entity “member (of)” may be mapped to the member_of property in the music ontology. This is the inverse of the member property in the music ontology that is used to define artists as members of groups.

Figure 40: Mapping the “member (of)” entity.

A process of contextual analysis is used to decide between candidate mappings. If we start to piece together the parts of the query into the same subgraph we see that the range of the mo:member property is the class mo:MusicGroup. According to this subgraph we would expect the “the Beatles” to be a music group. We would therefore select this expansion of the entity over musical works, posters and books that also have “the Beatles” as their label.

Figure 41: Piecing together a subgraph for the query.

Once the entity mappings have been established, the pseudo natural language query can then be expressed as a SPARQL query (shown visually below) that retrieves tracks (variable ?x) composed by someone (variable ?y) who is a member of The Beatles. (See chapter 2 for an introduction to SPARQL.)

Figure 42: The SPARQL query (shown visually) for tracks by members of The Beatles.

4.16 Semantic search versus SPARQL query

As we can see from the above example, semantic search aims at understanding the meaning of the entities specified in the query. This can involve expanding the entity into a number of candidate synonyms and then matching those to the dataset. Contextual analysis can also be used to decide between alternative meanings by comparing against the subgraph that is being produced by mapping the entities. In some cases (though not seen in the example above) reasoning may be applied in order to derive answers that are not explicitly contained in the data, but can be derived from the data.

Comparing semantic search against querying a dataset using SPARQL we see a number of differences. In semantic search, entities extracted from the search string can be expanded and mapped to resources in the dataset. In SPARQL, generally direct reference will be made to the resources. Semantic search may also allow fuzzy matching in which a weighting or certainty is applied to a mapping between a search term and a resource. Both SPARQL and semantic search work with graph patterns. SPARQL queries are graph patterns applied against the dataset. In semantic search, the sub-graph built up to represent the query can also be used to analyse context. Finally, semantic search can apply reasoning to identify new paths in the data and derive links between resources not explicit in the dataset.

Figure 43: A comparison of semantic search and SPARQL queries.

4.17 Semantic search in Google

Many of the features of semantic search are increasingly found in web search. For a number of queries, Google, as well as providing a page ranked list of links, uses the Google Knowledge Graph to provide direct answers to questions. As we see below, a search for “paul mccartney albums” results in a horizontal panel of albums. On the right we see the disambiguation pane (described in section 3.8.2 of chapter 3 in the context of Rich Snippets) giving additional information about the primary entities extracted from the query and mapped to the Google Knowledge Graph.

Figure 44: Semantic search in Google.

4.18 Semantic search in DuckDuckGo

Similar semantic features are found in other web search engines such as DuckDuckGo [23] that combines pseudo natural language expressions and semantics to assist the user in focussing down their search. In the example below a natural language query has produced a list of answers rather than the more conventional list of documents matching the search query.

Figure 45: DuckDuckGo producing a list of answers.

DuckDuckGo can also perform disambiguation to offer alternative matches for a search query. In the example below, alternative meanings of the search term jaguar have been offered, grouped into classes such as companies and music. The user can select one of these classes to drill down to their intended meaning.

Figure 46: Query disambiguation in DuckDuckGo.

4.19 Faceted Search

The Information Workbench, which we saw earlier, is one tool that can provide faceted search over a dataset. Facets are derived from the properties used to describe the resources returned from the search query. From within each facet, the user can select the values. In the figure below the location facet (using the foaf:based_near property) allows the user to drill down to artists (represented as images) from a particular county. The values of the property are sorted according to frequency, therefore giving a higher rank to values that identify the largest number of resources.

Movie 5: An introduction to faceted search and the challenges associated with supporting faceted search.

If the facets are largely independent (i.e. do not demark the same subsets of resources) then a small set of facets can be used to filter quickly a large dataset to a small number of items of interest.

Figure 47: Faceted search using the Information Workbench.

An interesting challenge when supporting faceted search is determining which of the potentially large number of facets to prioritize in the interface. This can be a particular problem with heterogeneous data, typical of Linked Open Data, where different properties (and therefore facets) apply to different parts of the dataset. Addressing this challenge involves not only prioritizing facets that have good coverage of the resources of interest but also have values that discriminate between them. Facets having values that split the resources into subsets of similar size would then be optimal for filtering.

As the most appropriate facets depend on the properties and values of the resources of interest, they can be expected to change if keyword search is used to select an initial set of resources for faceted browsing.

Another challenge is the real time computation of previews [24]. With a large dataset it can be computationally too expensive to calculate the frequencies for all values of all facets. Counts then need to be predicted from a sample giving the user an indication of what they can expect.

4.20 FacetedDBLP

A well-known example of faceted search is FacetedDBLP [25], an interface for browsing DBLP [26], a bibliography of computer science publications. Facetted browsing is generally initiated by a keyword search to identify a region of interest. The returned results (i.e. publication records) can then be filtered using facets to restrict the results in terms of publication year, publication types, venue and authors. The publication records queried via FacetedDBLP are stored in a MySQL database. An RDB2RDF server is used to provide a SPARQL interface to the dataset (see section 3.9.2 of chapter 3 for a description of how to map a relational database to RDF).

Figure 48: FacetedDBLP.

4.21 Classification of search engines

The figure below classifies a number of search systems as to whether they support semantic search and/or faceted search. A small number are at the intersection of these two including the Information Workbench that we described earlier.

Figure 49: Semantic search systems and faceted search systems.

4.22 Searching for semantic data

In the examples we have seen so far, search is primarily about finding content. Semantics assist in the search for content by, for example, disambiguating search terms. However, some search engines are directed at finding semantic data. These tools can be used to search for ontologies, vocabularies or particular RDF documents. These tools are therefore aimed more at data specialists rather than the end user. One of the first semantic search tools was Swoogle [27]. It adopted a Google style interface, listing RDF documents associated with the query. Watson provides a clearer snippet or summary under each result, providing a clearer indication of how the document matches the query.

Figure 50. Swoogle semantic web search engine [27].

Figure 51. Watson semantic web search engine [28].

Vocabularies can be searched using the LOV portal (this was mentioned as a source of reusable vocabularies in section 3.4 of chapter 3). A keyword search can be used to retrieve a set of vocabularies that can then be filtered using a number of provided facets. The results returned also have a confidence score giving an explicit indication of relevance of that vocabulary to the query. Similar to Watson, the snippets or previews also provide an indication of how the search terms match the vocabulary.

Figure 52: Using the LOV portal to search for vocabularies.

Finally, SWSE and Sindice are more aimed at retrieving instance data rather than ontologies or vocabularies. SWSE [29] brings together data related to the same instance, similar to the way disambiguation panes in Google present data related to an entity in the search term. Sindice [10] as we saw in section 4.10.2 can be used to search for resources and filter results by document format.

Figure 53: SWSE (Semantic Web Search Engine).

Figure 54: Sindice

 

Part III: Methods for Linked Data Analysis

4.23 Data analysis methods

In the first part of this chapter we looked at how visualization techniques can be used to reveal patterns in the data. In this section we consider how statistical and machine learning techniques can be used to identify patterns in data. Statistical and machine learning techniques are complementary rather than an alternative to visualization. When analysing data statistically it is good practice to first visualize data to get an idea for what statistical patterns may be expected and therefore which statistical methods to apply. Similarly, visualization is commonly used to explore and describe any pattern detected from statistical analysis and machine learning.

A preliminary step in data analysis is data aggregation. This may be used to merge or summarise the data. This creates the view over the data required for more advanced analysis. This is described in section 4.23.1. Once the appropriate view over the data has been constructed then statistical techniques can be applied, finding for example correlations between the properties. Statistical analysis of linked data is introduced in section 4.23.2. Finally, machine learning can be applied to data in order to learn new groupings or clusters in the data not explicitly defined in the dataset. Machine learning is introduced in section 4.23.3.

4.23.1 Aggregation and filtering of Linked Data

Much of the data aggregation and filtering that might be required to construct the desired view over the data can be carried out using SPARQL. Data can be aggregated using SPARQL functions such as COUNT, SUM and AVG. COUNT returns that number of times the expression has a bound value when the query is run. The functions SUM and AVG return the sum and average bound value. The GROUP BY operator is used in SPARQL to divide the solutions into the groups for which an aggregate value should be calculated. Chapter 2 describes a SPARQL query for returning the playing time of an album by calculating the SUM of its track durations.

SPARQL also has operators that can be used for filtering the results to be returned. FILTER restricts results to those that match a specified triple pattern. HAVING operates in the same way as FILTER but on sets of solutions returned by the GROUP BY operator. See chapter 2 for examples of SPARQL queries using the FILTER and HAVING operators.

Figure 55: Aggregation and filtering using SPARQL.

4.23.2 Statistical analysis of Linked Data

More complex forms of statistical analysis go beyond what can currently be supported by SPARQL. As we have seen above, SPARQL can be used to calculate an average or sum but could not be used to perform other statistical techniques such as regression or analysis of variance. Some approaches [30, 31] can be used to apply statistical techniques directly to data retrieved from a SPARQL endpoint. Without these techniques, the data to be analysed would need to be downloaded in a tabular format and then opened using a statistical package.

R [32] is a free computing package that can be used to carry out a range of statistical techniques including linear and non-linear modelling, time series and analysis of variance. It can also be used to perform machine learning tasks such as clustering and classification. R has a graphical user interface and can also be used to generate visualizations of the data.

Figure 56: The R statistical computing package.

The R for SPARQL package can be used to retrieve data from a SPARQL endpoint over HTTP. A SELECT query returns a result set as what is referred to as a data frame. Visualizations, using some of the techniques described in section 4.6, such as a choropleth map, can also be generated from a data frame.

Movie 6: An introduction to how statistical techniques can be applied to SPARQL query results.

4.23.3 Machine learning on Linked Data

Machine learning techniques can be used to discover hidden patterns with the dataset. A number of different machine learning techniques can be applied to linked data for different purposes.

Clustering is a technique used to organise a set of items into groups or clusters based on their attributes. In the case of Linked Data, clustering could be used to organise a set of resources on their properties and values. For example, resources representing music albums could be organised. This might reveal certain clusters, based on properties such as duration, artists and year of release. It may be possible to associate names with some of the identified clusters, for example the music albums in a cluster may form a particular genre not explicitly represented in the dataset. This may lead the data analyst to explicitly specify this class in the dataset and assign the albums in the cluster to this class.

Association rule learning is a data mining technique often used to discover relations between variables. Association rules express some regularity or predictability. In the case of Linked Data, association rule learning may discover a pattern. For example if the dataset contained information on albums and which of these are “liked” or owned by a number of people, a rule may be identified that can predict a preference for one album from other albums. This could potentially be used to assert additional relations between the albums included in the rule.

Decision tree learning is a machine learning technique that can be used to define and then predict the classification of a set of items. The decision tree produced by the learning process can predict the classification of an item in answer to a series of questions that lead from the root node to a leaf of the decision tree. In a Linked Data context, decision tree learning could be used to describe and then predict the class membership of a set of instances from their properties and values. For example a subclass of albums representing albums of a particular genre could be predicted from properties such as artist and recording label. This could be used to propose class membership for a set of albums in the dataset that have this information missing.

WEKA [33] is a data mining framework that can be used to apply machine learning techniques to a dataset represented in a tabular format.

The application of machine learning techniques to Linked Data raises a number of challenges [34]. Linked Data is heterogeneous. Different URIs from different datasets may refer to the same resource. Also similar properties but with different constraints may be drawn into the same dataset from different sources. There can also be a high level of redundancy among strongly related parts of the dataset of different origin. This noise and duplication can create performance problems for machine leaning algorithms. Another problem for machine learning is the lack of negative examples. For example, decision tree learning can be used to predict membership among disjoint classes. However, in datasets, even if classes are disjoint that is rarely specified. Also the property owl:differentFrom  expressing two resources are not the same can assist the application of machine learning but these negative forms of statement are rarely used.

Machine learning can have a number of applications to Linked Data. First, machine learning can be used to rank nodes according to their relevance to a query. This can be used to prioritize results in a user interface. Second, machine learning can be used for link prediction, proposing new edges between nodes in the RDF graph. Third, entity resolution can be supported, identifying URIs that potentially refer to the same real World object rom similarities in their properties and values. Fourth, techniques such as clustering can be used to propose taxonomies classifying a set of instances. This can be particularly useful when the taxonomy available to classify a set of instances is weak or absent in the dataset.

4.24 Further reading

[1] http://musicbrainz.fluidops.net

[2] http://en.wikipedia.org/wiki/Anscombe's_quartet

[3] Brunetti , J.M.; Auer, S.; García, R. The Linked Data Visualization Model.

[4] http://mbostock.github.io/protovis

[5] http://www.kottke.org/08/08/2008-movie-box-office-chart

[6] Google Map API

[7] http://www.wordle.net

[8] http://many-eyes.com

[9] http://sig.ma

[10] http://sindice.com

[11] http://www.fluidops.com/information-workbench

[12] http://musicbrainz.fluidops.net

[13] http://en.lodlive.it

[14] http://lodvisualization.appspot.com

[15] http://lod-cloud.net

[16] http://twitpic.com/17qj1h

[17] http://inkdroid.org/lod-graph

[18] Dadzie, A.-S. and Rowe, M. (2011). Approaches to Visualising Linked Data: A Survey. Semantic Web surveys and applications, 2 (2), pp. 89-124.

[19] https://www.google.com/webmasters/tools

[20] http://www.fluidops.com/ecloudmanager

[21] http://googlewebmastercentral.blogspot.de/2012/07/introducing-structured...

[22] Tran, T., Herzig, D., Ladwig, G. SemSearchPro- Using semantics through the search process.

[23] https://duckduckgo.com

[24] Teevan , J., Dumais, S., Gutt. Z. Challenges for Supporting Faceted Search in Large, Heterogeneous Corpora like the Web

[25] http://dblp.l3s.de

[26] http://dblp.uni-trier.de/db

[27] http://swoogle.umbc.edu

[28] http://watson.kmi.open.ac.uk

[29] http://swse.deri.org

[30] “R for SPARQL” by Willen Robert van Hage & Tomi Kauppinen

[31] “Performing Statistical Methods on Linked Data” by Zapilko & Mathiak

[32] http://www.r-project.org

[33] www.cs.waikato.ac.nz/ml/weka

[34] http://www.cip.ifi.lmu.de/~nickel/iswc2012-slides

4.25 Summary

After studying this chapter you should achieve the following outcomes:

  • An understanding of the processes involved in transforming RDF data for visualization.
  • An understanding of visualization techniques and the types of data to which they can be applied.
  • A knowledge of the types of Linked Data visualization tool currently available.
  • An understanding of the semantic search process and how faceted browsing can be used to navigate search results
  • A knowledge of tools that can be used to search for semantic data rather than content
  • An understanding of how RDF data can be aggregated for use in statistical analysis or machine learning
  • An understanding of how statistical and machine learning techniques can be applied to Linked Data and some of the tools currently available.