Chapter 6: Scaling up Linked Data

6.1 Introduction

In the previous chapters we present all of the components required to build a Linked Data application. In chapter 2 we describe how RDF data can be managed and accessed using SPARQL. Chapter 3 illustrates how datasets can be mapped to vocabularies and interlinked, and describes mechanisms for providing Linked Data to other applications. In chapter 4 we look at how humans can interact with Linked Data using visualisation, search and browsing tools. Chapter 5 brought all of these components together to describe how a Linked Data application is built and how development frameworks can be used to implement a Linked Data application. All of these are described in the context of the Music application scenario that incorporates: (i) consuming data from different sources, (ii) mapping, interlinking and cleansing the data to create an integrated dataset, and (iii) giving access to the integrated dataset for human interaction and consumption by other applications.

Figure 1: The Music Application Scenario.

In this chapter we outline current research on how Linked Data applications can be scaled up to deal with very large volumes of Linked Data.

6.2 Learning outcomes

On completing this chapter you should understand the following:

  • What is meant by the term Big Data and the relationship between Big Data and Linked Data.
  • NoSQL databases and how they can be used to managed large volumes of Linked Data.
  • The Hadoop framework and how it can be applied to large scale RDF reasoning.
  • How Linked Data streams, such as sensor data, can be processed.
  •  Some current research initiatives related to working with very large datasets.

 

Part 1: Introduction to Big Linked Data

6.3 The 3 Vs of Big Data

Big Data refers to a variety of approaches for managing complex and large volumes of data using non-traditional tools and solutions. The term Big Data is not only used to refer to the size of the dataset. Size only covers one of the three key aspects referred to as the 3 Vs of Big Data: variety, volume and velocity.

Variety indicates that the data comes from a number of sources that may provide structured, semi-structured or unstructured data. A high degree of variety among the data sources poses a challenge of data integration. Semantic Web technologies can help solve this problem. As described in Chapter 3, Linked Data techniques can assist with tasks such as interlinking datasets and extracting data from plain text.

A dataset comprising petabytes of data would be considered to be high volume.

The challenge posed by a high volume of data is being able to reason over and query the whole dataset with a satisfactory level of efficiency. Techniques such as distributed or parallel processing can be used to develop practical approaches to reasoning and querying over high volume datasets. For example, Section 9 of Chapter 5 describes how the Information Workbench, a Linked Data application development framework can be used to perform federated search across multiple data sources. Later in this chapter we describe in how NoSQL databases can be used to query and reason over very large volumes of data.

A dataset that is rapidly changing can be described as having high velocity. Examples would be data streams originating from sensors or other sources. The challenge posed by high velocity data is also reasoning and querying, though with the added difficulty that this may need to be carried out in (close to) real-time to ensure responses to requests are up-to-date. Later in this chapter we describe approaches to working with Linked Data streams.

Figure 2: The 3 Vs of Big Data.

Movie 1: The 3Vs of Big Data.

6.4 The extended Vs of Linked Data

The original three Vs of Big Data can be extended by another three Vs. These are veracity, variability and value. These characterise additional features of Big Data and introduce additional challenges. The veracity (i.e. accuracy) of any part of the overall dataset may be low or uncertain. The dataset may contain variability in which the meaning (e.g. of a particular term) may vary across different contexts. Finally, the value of the overall dataset may be unclear. Even if the data is accurate, it may not be possible to extract the required insights from it.

Unlike the original three Vs (variety, volume and velocity) the additional three Vs (veracity, variability and value) are much harder to measure. The original three Vs can be largely measured subjectively, using characteristics such as the size of the dataset and the speed of response to a query. These objective measures can be used to formulate benchmarks against which approaches to addressing the original three Vs can be compared. However, the veracity, variability and value of a dataset and how it can be measured may differ across contexts.

These additional three Vs serve as a reminder that dealing with variety, volume and velocity is not enough. The dataset is being consumed and processed in order to meet some end. Diego Basch (VP of Software Engineering at LinkedIn) said:

“Many companies think they have a “big data “problem when they really have a big “data problem”.

Again, what this emphasises is that a key challenge can often be working out how to gain maximum benefit from the data as well as dealing with the first three Vs.

Linked Data and Semantic Web technologies do though offer ways of addressing these challenges. For example, mapping terms to vocabularies and interlinking vocabularies can help to address variability. The Gartner Report into 2013 technology trends [1] identifies Semantic Technologies as a way of maximally exploiting available information, especially when it has the characteristics of Big Data.

6.5 Big Linked Data and Linked Big Data

Having outlined the properties of Big Data, we can now consider Big Linked Data. Returning again to the 3Vs of Linked Data, variety is the characteristic of Big Data that is most broadly applicable to Linked Data. RDF provides a simple and flexible data model that can be used to describe varied data sources. In Chapter 3 we saw example tools that can be used to extract and transform data from varied sources. OpenRefine can be used to transform spreadsheet data (Section 3.8.1). R2RML can transform data held in relational databases to RDF (Section 3.8.2). Tools such as DBpedia Spotlight, Zemanta and GATE can be used to extract data from text (Section 3.8.3).

In terms of volume, the Linking Open Data Cloud has grown exponentially (see figure showing the growth from 2007 to 2011). However, this only gives a partial view of the scale of the data being stored and used in Linked Data applications. A number of enterprise applications are built on Linked Data but are held privately and do not feature in the Linking Open Data Cloud. For example, in 2011 it was reported that Franz’s AllegroGraph RDF database had become the first to load one trillion RDF triples [2]. This data store was developed to support applications for a major telecom company.

Figure 3: Growth of the Linking Open Data Cloud from 2007 to 2011.

In terms of velocity, much of the work on Big Linked Data is concerned with using Linked Data principles to manage data streams, including streams of sensor data. Research initiatives into RDF streams and semantic sensors are described later in the chapter.

Big Data and Linked Data can be seen to complement in each other in two important ways. First, Big Data techniques can be applied to Linked Data challenges. As described above, there has been a huge growth in Linked Data over the past five years. Techniques developed in the Big Data research community can be adopted to handle large amounts of Linked Data. This adoption of Big Data techniques by the Linked Data community could be described as Big Linked Data. This application of Big Data techniques is particularly appropriate for dealing with high volume and high velocity data.

Second, Linked Data techniques can be applied to Big Data challenges. Big Data can have a high level of variety and Linked Data techniques can be applied to the problem. Linked Data tools can be used to enrich legacy content and improve data discovery and integration. Interlinking datasets using a common format can help to reduce data duplication. This application of Linked Data techniques to Big Data challenges can be described as Linked Big Data.

Part 2: NoSQL Databases for Linked Data

6.6 RDF databases

A number of RDF databases are available for storing RDF datasets. These may be native RDF stores or built on top of a relational database. Examples of RDF databases include OWLIM [3], Virtuoso Universal Server [4], Stardog [5], AllegroGraph [6], Systap Bigdata [7], Jena TDB [8], RDF-3X [9], TripleBit [10], and 4store [11].

Storing the data in RDF allows shared or global URIs to be used as names for resources, allowing interlinking across datasets. The schema is also flexible and extensible. In the case of a relational database, accommodating new data structures may require the addition or modification of database tables. Another advantage of RDF databases is that reasoning can be used to infer additional facts from those explicitly represented in the RDF database. The RDF database may employ a forward-chaining, backward-chaining or hybrid reasoning strategy to infer these facts. This characteristic means that an RDF database, unlike a relational database, can perform directly more of the application or business logic that would usually be performed independently of data storage (see discussion of Multitier architecture in Section 5.5 of Chapter 5).

RDF databases also have a more expressive query language in the form of SPARQL. This again, allows for more of the logic to be performed on the database layer rather than separately extracting the data from the store and then processing that data into the required structure.

RDF and SPARQL also adhere to a well-defined set of W3C standards [12, 13]. RDF and SPARQL are currently the only standardised way of storing and accessing RDF. This is not the case with the NoSQL databases that we will consider next. NoSQL databases are an emerging technology without established standards. As a consequence there can be portability problems across different NoSQL databases.

6.7 NoSQL databases

The term NoSQL was originally used to describe database technologies that did not use SQL but this definition has now been widened and the term is generally used to describe databases that are “Not Only SQL”.

NoSQL databases do not adhere to the relational data model. This shift away from the relational data model was undertaken in order to address requirements of the data to be stored and the ways in which it was to be used. A key requirement that led to the development of NoSQL databases was the need to handle large volumes of data, have high availability and deal with high volumes of queries. These big datasets are also often distributed across a number of data centres.

As well as requirements related to performance and size, NoSQL databases were also driven by requirements typically met by RDF databases, in having a flexible schema and being able to handle hierarchical and graph structures. NoSQL technologies can be classified into four groups as follows:

  • Key/value stores in which each key is associated with a value. These essentially function as persistent hash tables.
  • Wide-column stores in which each key is associated with many attributes rather than a single value.
  • Document databases in which each key is associated with a structured document. This document associated with the key may itself be comprised of other sub-documents.
  • Graph databases in which data is represented as nodes and edges as in the RDF data model.

Figure 4: How data is structured in key/value stores (top left), wide-column stores (top right), document databases (bottom left) and graph databases (bottom right).

These are described in more detail below.

6.7.1 Key/value stores

Key/Value stores provide a very efficient way of doing key/value lookups. Key/Value stores are schema-less. Unlike wide-column stores or document databases, key-value stores associate a key with a relatively simple and unstructured data value. The simplicity and lack of structure means that read and write and operations can be carried out with very low latency. Examples of Key/Value stores include Amazon DynamoDB [14], Windows Azure Table Storage [15], Riak [16], Redis [17], MemcacheDB [18] and Voldemort [19].

6.7.2 Wide-column stores

A Wide-column store associates each key with a set of attributes organised into columns. Each column is associated with a particular type of value such as artist, album or song. Wide column stores can be an efficient way of storing aggregations over data of the type that might be returned by an SQL or SPARQL query. Wide-column stores are also schema-less or have dynamic schema and additional columns can easily be added to the store. A set of columns can also be grouped together into a column family. For example, a number of columns associated with characteristics of a music artist (e.g. name, age, career sales) may be grouped together in a column family. Examples of Wide-column stores include HBase [20] and Cassandra [21].

HBase [20] is an open-source wide-column store based on Google’s BigTable. It is built on top of the Hadoop distributed computing framework, which we describe later in the chapter. HBase is horizontally scalable. This means that large tables can be split over different servers in order to balance load, but that the tables still function as a logical whole. HBase also has automatic sharding meaning these subparts of the data table can be automatically distributed across multiple machines.

In terms of performance, HBase has high availability. This is supported by automatic failover, meaning the system can automatically switch to standby or duplicate data stores in order to maintain performance. HBase has strong consistency for read/writes. This means that once data has been written successfully it is immediately accessible by a read operation. There is no delay in which an out-of-date result may be accessed. HBase can also be used via a Java/REST API.

6.7.3 Document databases

Document Databases associate a key with complex data structures called documents. The data therefore has a more complex structure than is found with Key/Value or Wide-column stores. Documents can contain key/value pairs, key/array pairs or even nested structures. Document databases are schema-less and dynamic. New fields can easily be added to the document structure. Typical document formats used are JSON and XML. Examples of document databases include Couchbase [22] and MongoDB [23].

The figure shows an example of how RDF data could be represented as a document in Couchbase. The document about The Beatles contains key value pairs such as the origin of the band being Liverpool. The albums of The Beatles are represented as an array of values. There is further nesting of the data as each album is also represented as an array.

Figure 5: Example documents representing The Beatles and Elvis Presley.

Couchbase is one of the most popular document oriented stores. Documents are represented in JSON. Couchbase has a flexible schema and the structure of the document is easy to change. Couchbase is optimised to run in memory and then eventually stores all of the data to disk in case of failure. This process is called eventual persistence. Couchbase also uses ejection, which is the process of removing data from memory to free up room for frequently used items. The ejected data can be recovered from disk as required.

Couchbase also allows indexes and views to be defined over the data to speed up queries. Couchbase has a number of characteristics to deal with big data. It is highly scalable. The data can be expanded or shrunk across a number of nodes of the server cluster, a process called rebalancing. Couchbase also has replication and failover capabilities, allowing switching between duplicates of the data in order to maintain performance. Couchbase also has a RESTful API.

Movie 2: Introduction to Document databases.

6.7.4 Graph databases

Graph databases are based on the property graph model. They represent data as nodes and relationships between those nodes. In the figure, nodes (e.g. The Beatles) are represented as circles. Relationships are represented as lines connecting the nodes (e.g. created). Properties as key-value pairs can be associated with nodes and possibly relationships within the graph. The homepage and origin of The Beatles are represented as key-value pairs associated with the node representing The Beatles.

Figure 6: The Property Graph Model.

Graph databases provide support for query languages and also core graph-based tasks such as reachability, traversal, adjacency and pattern matching. Examples of graph databases include Neo4j [24], Dex [25], and HyperGraphDB [26].

Neo4j [24] is a graph database. Neo4j is used to model graphs as nodes and relationships between those nodes. A path can be defined as a series of nodes and relationships in the graph. Neo4j also allows properties (i.e. index/value pairs) to be directly assigned to any node or relationship in the graph. Neo4j has a flexible schema and a graph query language called Cypher. Neo4j has ACID (Atomicity, Consistency, Isolation, Durability) transactions. This is a set of properties that guarantee that database transactions will be processed reliably. Neo4j provides high reliability and distributed clusters. It also has RESTful and Java APIs.

DEX [25] is a database engine that relies on labeled attribute multigraphs to model linked data. To speed up the different graph-based tasks, DEX offers different types of indexing: (i) attributes, (ii) unique attributes, (iii) edges to index their neighborhood, and (iv) indexes on neighborhoods. A DEX graph is stored in a single file; values and resource identifiers are mapped by mapping functions; and maps are modeled as B+-tree. Bitmaps are used to store nodes and edges of a certain type. DEX provides an API to create and manage graph datasets. Different methods are offered to traverse and select graphs: (i) sub-graphs whose nodes or edges are associated with labels that satisfy certain conditions, (ii) explode-based methods that visit the edges of a given node, and

(iii) neighbor-based methods that visit the neighbors of a given node.

De Abreu et al. [27] report on an evaluation of three graph databases (Neo4j [24], Dex [25] and HyperGraphDB [26]) and the RDF database RDF-3X [9] for consuming and mining Linked Data (see also [28, 29]). A set of graph-based tasks were devised for the evaluation. HypergraphDB was found to perform poorly across the specified tasks. Neo4j and DEX were found to perform similarly well but with complementary strengths. Neo4j was found to be superior on depth-first and breadth-first traversals of the graph. DEX was to be superior in adjacency tasks such as finding the set of adjacent nodes to a specified node or the set of edges between two nodes. This work indicates the importance of evaluating graph databases across a range of tasks in order to understand their respective strengths and weaknesses.

6.7.5 NoSQL databases for RDF: Rya

Rya [30] is a distributed RDF store based on NoSQL technology. Rya is based on Accumulo [31], which is a variant of Google’s BigTable NoSQL store. Rya represents RDF triples using a 3-table index of SPO, POS, and OSP. The figure below represents an example triple in tabular form. This states that a particular Professor earned their degree from a particular university. The table below shows how the triple is stored in the three table indexes SPO, POS, and OSP.

Figure 7: A triple and its storage in the in the SPO, POS and OSP table indexes (from [30]).

These three tables are sufficient for all triple patterns. For example, if O (i.e. Object) is unknown then it can be retrieved by scanning the SPO table. If only O is known, then S and P can be retrieved by scanning the OSP table.

Figure 8: Triple patterns mapped to table scans (from [30]).

In each table, all parts of the triple are encoded in the index for that row. The rows are sorted in lexicographical ascending order to allow queries to be executed very efficiently.

Rya has a Sesame parser and query engine on top of the Accumulo score. Queries are translated into a series of scans against the SPO, POS and OSP table indexes. Parallel scans allow multiple parts of the SPARQL query to executed simultaneously and can lead to a 20-fold improvement in response times. Batch scans are used to limit the number of times any part of the data needs to be scanned and allow joins across triples in the query to be performed more efficiently.

Statistics are used to predict the number of results that will be returned from any particular triple pattern. These are used to reorder the parts of the query to place the triple patterns predicted to return the least results first. This reduces the amount of overall processing required to execute the query.

When evaluated against the Lehigh University Benchmark (LUBM) there is no degradation in performance with a 2 or 3 fold increase in the data. However, this is not the case for all queries. Performance can degrade particularly with more complex queries. This indicates that a particular RDF store is not universally superior across all contexts. Task characteristics such as query complexity need to be considered when selecting the most appropriate RDF store. This work does though indicate the potential benefit of using a NoSQL database for handling high volume of data. The next section reports on an empirical evaluation of four NoSQL databases.

6.7.6 Evaluation of NoSQL databases for RDF

Cudré-Mauroux et al. [32] carried out an evaluation of four different systems for storing and querying RDF data on top of a NoSQL database. The four systems they compared were HBase, Couchbase, Hive and CulumusRDF.

HBase (see section 6.7.2) is a wide-column store. It uses Jena for SPARQL querying. HBase uses a 3-table index of SPO, POS, and OSP (see section 6.7.5), in which all three elements of the triple are encoded in the index for the table. The SPARQL query is translated into a series of look-ups against the three tables.

Hive is a data warehouse system based on HDFS and Hadoop (see Part 3 of this chapter for a description of Hadoop and HDFS). Hive stores the triples in a Property table. A row of the table is used to represent each unique subject in the SPO triple. A column is used to represent each unique property. The object of the triple is stored in the cell for that particular row and column. Multiple object values for the same subject and property are stored with different timestamps in the same cell.

CumulusRDF uses the Cassandra [21] wide-column store and Sesame for SPARQL queries. Similarly to Rya [30] and HBase [20] it uses the SPO, POS, and OSP index tables. SPARQL queries are translated into a series of lookups against the three index tables held in the Cassandra store.

Couchbase (see section 6.7.3) is a document database. All triples with the same subject are stored in the same document. JSON arrays are used to structure the associated properties and objects in each document. Jena is used as the SPARQL query engine. Three indexes (termed views) are built for the SPO, POS and OSP triple patterns.

The four systems were tested and compared against the 4store native RDF store [11] using the Berlin SPARQL Benchmark (BSBM) [33]. The benchmark comprises a series of queries performed against datasets containing 10 million, 100 million and 1 billion triples. All experiments were performed using the Amazon Cloud infrastructure. This allowed the measurement of not only query response time but also cost of the calculations. The benchmarks were performed on environments composed of 1, 2, 4, 8 and 16 nodes.

The overall cost in US dollars for running the benchmarks against each system are shown in the table. As can be seen, the native RDF stores were significantly cheaper than the four NoSQL systems.

Figure 9: Cost in US dollars of running the benchmarks against 100 million triples using a 16-node environment (from [32]).

The figure below shows the performance of each of the five systems for 12 different queries against 100 million triples using a 16-node environment. As can be seen, relative performance differs considerably across the 12 queries, illustrating that the types of queries to be performed needs to be considered when selecting a data store. Overall, the execution time for the NoSQL stores is comparable and sometimes faster than the native RDF store. Complex queries were found the significantly worse with a NoSQL database compared to the native RDF store. The NoSQL stores performed well with relatively simple queries. Further work on query optimisation, such as reordering parts of the query based on the number of results each part is predicted to return, can be expected to increase performance for complex queries.

Figure 10: Execution time for each of the five systems for 12 different queries against 100 million triples using a 16-node environment (from [32]).

The MapReduce operations, that are specific to Hive and Couchbase, introduce high latency for updating the view indexes and therefore have a negative impact on query execution. This can be expected as MapReduce is intended for batch processing rather than returning results in real-time. MapReduce is described in the next section.

Part 3: Hadoop for Linked Data

6.8 Working with distributed data

So far we have looked at how NoSQL databases can support the storage and querying of RDF data. Now we consider a large-scale approach to reasoning. The inference rules predefined in RDF or OWL specify how new triples can be inferred from those already represented. Current NoSQL systems are unsuitable for large-scale reasoning. However, current research is investigating how Hadoop could be used to infer new triples from existing triples.

Hadoop is an open source implementation of MapReduce, which is a framework for the distributed batch processing of large volumes of data. The Map phase partitions the input set into a number of distributed subparts. The Reduce phase performs aggregated computations over the individual partitions. Each node is responsible for dealing with one of the partitions. As described in the previous section, MapReduce and Hadoop are not really suitable for operations that require low latency. MapReduce involves tasks such as the copying of data across nodes, which negatively effects response times. MapReduce is therefore more suited to offline batch processing of large amounts of data.

Hadoop is generally employed on large clusters of up to 1000 nodes. A core component of Hadoop is its distributed file system, known as HDFS (Hadoop Distributed File System). Each node is responsible for processing its local data and contributing to part of the overall process. Hadoop is suitable for reasoning with large-scale datasets. However there is still a significant challenge in understanding how this can be applied to perform RDF reasoning. The work of Urbani et al [34], described below, investigates the specific problems related to implementing RDF reasoning on top of the MapReduce paradigm and proposes concrete optimizations and solutions.

RDF reasoning to produce new triples follows the 13 rules shown in the figure. In each rule, one or two antecedents (i.e. existing triples) are used to infer a new triple. The rules with one antecedent are relatively easy to handle as they are performed on individual triples and have no dependency on other triples contained in the dataset. The rules with two antecedents are more complex as two triples have to be joined in order to make the inference. For example, rule 9 infers rdf:type statements using the sub-class hierarchy. This rule states that if s is of type x, and x is a subclass of y, then x is of type y. The difficulty in performing such reasoning on a large dataset is that the triples to be joined may be located on different parts of a large cluster. Each node in the cluster only has a partial view of the data available in the system. Therefore, data needs to be efficiently exchanged in the cluster in order that the required reasoning can be carried out.

Figure 11: The RDFS reasoning rules (from [34]).

MapReduce is employed in order that the reasoning can be carried out efficiently across the distributed cluster. Map is used to partition the data in order that triples on the left hand of the same rule are collocated. The reduce function then performs the join on the antecedent triples in order to infer the new triple.

The mappers translate the triples into a series of key/value pairs. The key indicates the partition to which the data should belong. The value represents the triple itself. The reducer then processes the data within the partition to make the inference. The key of the key/value pair, that is used to determine the partitions, is the term on which the triples are being joined. In the example shown in the figure the term C1 used to index three triples. For two of the triples, C1 is the object. These state that a and b are of type C1. The third triple has C1 as the subject and states that C1 is a subclass of C3. The Map operation assigns the key, allocating these to the same partition. The reduce operation infers two new triples, stating that both a and b are of type C3.

Figure 12: Encoding RDFS rule 9 in MapReduce (from [34]).

When evaluated, the system was found to perform sub-optimally. This was caused by the number of duplicate triples produced as a result of applying the set of RDFS rules. The duplicate triples were found to outnumber the unique triples by 50 times. An optimisation method used to reduce the number of duplicates was to hold all schema triples in memory. Schema triples have the predicate rdfs:domain, rdfs:range, rdfs:subPropertyOf, or rdfs:subClassOf. These define the structure rather than the instances of the dataset. Schema triples generally represent a very small proportion of the overall dataset but are included in every inference that requires a join between two triples. It is therefore possible to hold the schema triples in memory and dynamically join these to matching instance triples.

Ordering the application of the RDFS rules can further optimise performance. Reordering the rules can limit the number of iterations needed for full closure (i.e. to make all possible inferences). An optimal ordering was determined by categorising the rules in terms of the predicate of the triple inferred and the predicates of the triples in the antecedent. The figure shows an optimal ordering, starting from the bottom, that minimizes the number of iterations required. A cross is used to mark seemingly possible dependencies between rules that are in fact not required.

Figure 13: Relations between the RDFS rules (from [34]).

The optimized system was then tested on some large datasets. The figure shows the performance on computing full closure with the DBpedia dataset with differing numbers of nodes. Going beyond 16 nodes did not significantly improve performance. Beyond this point, adding computational power does not significantly speed up the return of results.

Figure 14: Performance on computing full closure with the DBpedia dataset (from [34]).

This significant drop in efficiency once computational power exceeds a particular level is a general problem of Hadoop and other distributed systems. To maintain efficiency, distributed systems need to balance tasks and data across the available nodes. Often this is not the case and some nodes in the cluster may become either overloaded or underused. For example, a node may be responsible for an RDFS rule that is little used or extensively used in the dataset. As the cluster becomes larger it becomes increasingly difficult to ensure optimal loading. There is also the additional overhead of scheduling and the associated transfer of data between clusters. Allocating a significant amount of processing power to scheduling may help optimize load balance but reduces the amount of processing power applied to performing the actual tasks.

Huang et al [35] also employ Hadoop to support reasoning across a large dataset partitioned across multiple machines. Hadoop is used to manage the processing of queries that draw on distributed sources of data. Queries are decomposed into chunks that can be performed independently in parallel. The chunks are then reconstructed using the Hadoop MapReduce framework.

6.9 Lessons learned from large-scale reasoning

Drawing on the above experiments with Hadoop and other related work, Urbani [36] formulates some lessons learned from large scale reasoning in the form of three laws. These are titled as follows:

  • 1st Law: Treat schema triples differently
  • 2nd Law: Data skew dominates data distribution
  • 3rd Law: Certain problems only appear at a very large scale

The first law proposes that schema triples should be treated differently. Schema triples make up a small part of the database but feature in all joins that need to be made when reasoning over the dataset. As proposed in section 6.8, replicating the schema triples across all nodes significantly improves performance.

As described in section 6.9, in a distributed system data required for the same task is collocated on the same node. Characteristics of the data will therefore affect the load on each node. The level of data exchange between nodes may also vary considerably. Datasets of the same overall size may therefore show markedly different performance depending on their internal characteristics. Dealing with variations in the data would require a smart strategy for finding the optimal data distribution depending on the characteristics of the dataset.

The third rule points out that some problems only appear at very large scale. A prototype may be tested on a limited proof-of-concept dataset. This prototype may work very differently when applied at web-scale. A number of workload-specific problems may occur that could not be predicted when working on a small scale concerned with e.g. memory management, optimisation and communication between modules.

Part 4: Stream Processing for Linked Data

6.10 Stream processing

The work we have considered so far has mainly been concerned with dealing with large volumes of data. When working with data streams, velocity becomes a particularly important Big Data characteristic. Data streams involve data that is continuously updating or changing at a rapid rate. Examples of data streams include social networking data (such as Facebook or Twitter), traffic data, sensor networks and financial markets.

Stream data cannot be processed by traditional data warehousing approaches in which all data is gathered in a repository and then queried. Instead there is a need to perform continuous queries over transient or changing data. One important reason why traditional approaches would not be appropriate is that in the case of steams, the most recent data is generally the most valuable. In the case of traffic data, the current state of the roads is probably of most interest. Historical data is generally only of use for prediction and modelling applications.

Data streams are observed through windows. A window is a temporal region of the data. The data contained in the window changes as the stream updates. Continuous queries are iteratively run against the continuously changing window, generating a stream of answers. The continuous query can also draw on other sources of data when producing the stream of answers. For example, locations of traffic congestion appearing in the stream might be supplemented with additional geographical information to produce the stream of answers.

Figure 15: Continuous queries against a data stream.

Movie 3: Introduction to stream processing.

6.11 Linked Stream Data

The Linked Data principles (see Part 2 of Chapter 1) can bring particular benefits to stream data. Sensor data represented as Linked Data can be enriched with semantics by representing sensor data using HTTP URIs and including links to other vocabularies and datasets. RDF can also provide a common representation for heterogeneous sources of Linked Data. Adopting a Linked Data approach can therefore support data discovery and data integration.

Data steams also provide a range of challenges to Linked Data. RDF triples representing stream data must be annotated with timestamps. Extensions to the SPARQL query language are then required to deal with windows, continuous queries and stream operators. Semantic queries over the data need to be handled continuously, in contrast to static queries against a relatively constant dataset. Tools for dealing with Linked Stream Data need not only to cope with large volumes of data but also the high velocity and low latency requirements of sensor data. As we saw in section 6.8, systems like Hadoop and MapReduce are not really suitable for low latency scenarios and more suited to batch, offline processing. Approximate reasoning, in which soundness or completeness is sacrificed for speed [37], may be used as a strategy to address latency requirements.

6.12 Querying streams with SPARQL extensions

When querying a stream rather than a static dataset, extensions to the SPARQL query language are required. In this section we will look at two approaches to this: C-SPARQL and SPARQLStream. Continuous queries are different from static queries in that new results to the query are returned as new data enters the window that the query is operating on. C-SPARQL and SPARQLStream extend the SPARQL standard and use streaming operators based on CQL (Continuous Query Language) [38].

6.12.1 C-SPARQL

The input to C-SPARQL is a sequence of RDF triples each annotated with a timestamp. Some extensions are required to run SPARQL continuous queries against the data stream. The FROM clause is used in the SPARQL standard to specify a named graph against which the query should be executed. In C-SPARQL, FROM STREAM is used to specify the stream and the window within that stream.

A window extracts the most recent data from the stream. A window may be specified as a Logical Window or Physical Window. A Physical Window is specified in terms of a number of triples (using TRIPLES). A Logical Window is specified in terms of a timespan (using RANGE) and rate at which the window is refreshed (using STEP). For example, the window may be defined with a timespan corresponding to the last 40 seconds of the stream and a refresh rate of every 5 seconds. A logical window may be defined as tumbling or sliding. For a tumbling window, the size of the timespan is the same as the refresh rate (e.g. a window size of 5 seconds refreshed every 5 seconds). If a window is defined as tumbling, then each data triple only appears in one window. For a sliding window, the refresh rate is shorter than the timespan, meaning the same data triple appears in more than one window.

Figure 16: C-SPARQL query extensions.

Another extension in C-SPARQL is the registration of the query either as a stream (using STREAM) or as a continuous query (using QUERY). Only CONSTRUCT and DESCRIBE queries (see Part 1 of Chapter 2) can be registered as streams as these are the queries that return RDF triples which, once annotated with timestamps, can be provided as a stream.

Figure 17: Registration of a stream.

In the example, the query has been registered with the name CarsEnteringInDistricts. The query is a SELECT query and therefore cannot be registered as a stream. The query is run against a stream named as <www.uc.eu/tollgates.trdf>. The window covers the last 40 seconds of the stream and is refreshed every 10 seconds. The query returns a list of cars registering at a toll as they enter a street.

Figure 18: An example C-SPARQL query (from [39]).

The figure shows data triples passing through the window of the continuous query. The triple labeled d1 is present in the first window shown on the left and moves further along the next three windows. The triple d1 is no longer included in the fifth window.

Figure 19: Representation of a time window moving across the available data triples (from [40]).

6.12.2 SPARQLStream

SPARQL Stream [41] uses a similar approach to C-SPARQL. Again, the data stream is represented as triples annotated with a timestamp. A unit of time is used to specify both the size of the window and the rate at which the window is refreshed.

Figure 20: Definition of SPARQLStream.

The example query retrieves the last 10 minutes of a stream of data and refreshes the query every minute.

Figure 21: An example SPARQLStream query.

6.12.3 Classification of existing systems

C-SPARQL and SPARQLStream are two of the approaches to querying data streams. A larger set of systems, including these two, is compared in the figure below. Each has its own advantages and disadvantages depending on the type of data and the scenario in which it is to be used.

Figure 22: Classification of existing systems (from [40]).

6.13 W3C semantic sensor networks

The Semantic Sensor Network (SSN) ontology [42] was developed by the W3C Semantic Sensor Network Incubator Group. The SSN ontology was developed using the OWL DL dialect of the Web Ontology Language (see Chapter 2 for more coverage of knowledge representation and reasoning with OWL). The ontology is used to semantically describe sensors, sensor networks, observations made the sensors and sensor outputs. The group also provided recommendations and best practices for applying the SSN ontology to various scenarios dealing with real-time streams of large sensor networks.

Figure 23: Overview of the SSN ontology.

The SSN ontology provides different perspectives on the sensors, the data observed by the sensors, the sensor system and the properties or features that can be sensed.

Figure 24: Perspectives of the SSN ontology.

Part 5: Scaling up even further

In this final part of the chapter we briefly outline some case studies in the successful application of Linked Data to Big Data scenarios.

6.14 A Trillion RDF Triples

As mentioned in section 6.5, Franz’s AllegroGraph NoSQL database has been used to successfully load 1 trillion triples. This was done during the development of a customer management application for a major telecoms company. As mentioned earlier, the Big Data requirements of enterprise applications can be significantly greater than applications using the Linking Open Data Cloud.

Figure 25: Press release on the loading of 1 trillion triples using Franz’s AllegroGraph [2].

6.15 uRiKA appliance

The uRiKA Appliance was developed by YarcData, which is a subsidiary of Cray, the supercomputing company. uRiKA is a Big Data appliance for RDF analytics and graph analytics. It has 1TB of RAM and 8K processors. It provides an in memory RDF database with SPARQL 1.1 support. It is therefore able to process very high volumes of RDF data.

Figure 26: uRiKA Appliance.

6.16 RDFS Reasoning on GPUs

Earlier we looked at how Hadoop and MapReduce (section 6.8) can be used to perform reasoning on distributed RDF data. The approach taken can be described as a shared nothing architecture. A system that is a shared nothing architecture is comprised of a set of components or nodes that independently hold and process their own data. Processing and data is not shared among the components of the architecture. Heino and Pan [43] propose an alternative approach in which memory is shared within a massively parallel architecture. The approach made use of Graphical Processing Units (GPUs). GPUs provide a high level of parallelism by distributing processing among a large number of relatively simple computational units. Therefore, although the approach is similar to Hadoop reasoning, it exploits GPUs in a shared memory architecture in which multiple processing units access a shared memory store.  

As in the case of Hadoop and MapReduce, reasoning rules that have a single triple as the antecedent are simple to deal with. The challenge is to deal efficiently with rules that require a join to be performed between two triples. As in Hadoop and MapReduce, a reordering approach is used to minimise the number of times rules need to be activated to compute full closure. An ordering is proposed. Each rule processing is proceeded by a global synchronisation step in which all inferred triples are returned to shared memory. These are then available to all parallel processing units when executing the subsequent rule.

Figure 27: Rule ordering for RDFS reasoning on GPUs (from [43]).

The efficient implementation of a shared memory architecture with GPUs is complex. The amount of memory to which each processing unit has direct access is very small and accessing other parts of memory is costly. Synchronisation is therefore a complex an expensive operation.

The approach taken involves applying a thread of the architecture to each triple in the dataset and joining this, according to the RDFS rule, to a schema triple if possible. The architectural approach involves 100s or 1000s of threads working in parallel. As in the case of Hadoop and MapReduce, the approach produces a high number of duplicate triples. The number of duplicates produced differs depending on the RDFS rule. Some rules can produce 40,000 duplicates for each unique triple. The number of duplicates produced can therefore be much higher than the ratio of 50 duplicates per unique triple found with Hadoop and MapReduce. Given the complexity of memory access and transfer on GPUs, duplicate removal can be very challenging and expensive.

In tests, the GPU implementation was found to be 5 times faster than an equivalent CPU implementation. However, once the overhead of duplicate removal was included in the calculation the GPU and CPU performance was similar. RDF reasoning with GPUs is therefore a promising area of research which could provide significant advantages if new methods were developed for handling duplicate removal and minimising costly memory access.

6.17 Benchmarks

In section 6.7.6 we described how the Berlin SPARQL Benchmark (BSBM) [33] was used to compare performance of four NoSQL databases for RDF against a native RDF store. BSBM is one of the oldest SPARQL and RDF benchmarks. The latest version of the BSBM published in April 2013 includes benchmarks with up to 150 billion triples. This is 750 times more than the previous BSBM. This confirms a trend toward testing tools against real-world scenarios involving realistic volumes of data rather than small proof of concept scenarios.

The LDBC (Linked Data Benchmark Council) [44] is an industry-neutral non-profit organisation that develops benchmark for RDF and graph databases. It is similar to the TPC (Transaction Processing Performance Council) [45] that devises benchmarks for the transaction and database performance of systems. As in the case of BSM, LDBC is offering benchmarks reflecting realistic scenarios that involve big data volumes and complex queries.

Angles [46] propose a benchmark for reasoning over social network data represented as a graph. The benchmark was tested on a range of graph, RDF and relational databases. Reachability queries (such as finding the friends or friends of a given person, or the shortest path in a social network graph between two people) were found to be the most challenging.

XGDBench [47] is a platform designed for benchmarking graph databases on large scale computing systems. XGDBench is designed to operate on a cloud service infrastructure. The platform has ben used to evaluate graph databases including AllegroGraph [6] and Neo4j [24].

6.18 Further reading

[1] http://www.gartner.com/newsroom/id/2359715

[2] http://www.franz.com/about/press_room/trillion-triples.lhtml

[3] http://www.ontotext.com/owlim

[4] http://virtuoso.openlinksw.com

[5] http://stardog.com

[6] http://www.franz.com/agraph/allegrograph

[7] http://www.systap.com

[8] http://jena.apache.org/documentation/tdb

[9] Neumann, T. and Weikum, G. (2010). x-rdf-3x: Fast querying, high update rates, and consistency for rdf databases. PVLDB, 3 (1), 256-263.

[10] Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W. and Liu, L. (2013). TripleBit: a Fast and Compact System for Large Scale RDF Data. PVLDB, 6 (7), 517-528.

[11] http://4store.org

[12] http://www.w3.org/RDF

[13] http://www.w3.org/TR/sparql11-overview

[14] http://aws.amazon.com/dynamodb

[15] http://www.windowsazure.com/en-us/documentation/articles/storage-dotnet-...

[16] http://basho.com/riak

[17] http://redis.io

[18] http://memcachedb.org

[19] http://www.project-voldemort.com/voldemort

[20] http://hbase.apache.org

[21] http://cassandra.apache.org

[22] http://www.couchbase.com

[23] http://www.mongodb.org

[24] http://neo4j.org

[25] http://sparsity-technologies.com/dex.php

[26] http://www.hypergraphdb.org

[27] De Abreu, D., Flores, A., Palma, G., Pestana, V., Piñero, J., Queipo, J., Sánchez, J. and Vidal, M. (2013). Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data. ISWC 2013 Workshop on Consuming Linked Data. Sydney, Australia.

[28] http://www.slideshare.net/maribelacosta/slides-tutorialgraphdatabases-ma...

[29] http://graphium.ldc.usb.ve

[30] Punnoose, R, Crainiceanu, A., Rapp D. (2012). Rya: A Scalable RDF Triple Store for the Clouds. Cloud-I ’12, Istanbul, Turkey.

[31] http://accumulo.apache.org

[32] Cudré-Mauroux, P. Enchev, I., Fundatureanu, S. et al. (2013) NoSQL Databases for RDF: An Empirical Evaluation. International Semantic Web Conference (ISWC 2013), Sydney, Australia.

[33] Bizer, C., Schultz, A. (2009). The Berlin SPARQL Benchmark. International Journal on Semantic Web and Information Systems (IJSWIS), 5 (2), 1–24.

[34] Urbani, J., Kotoulas, S., Oren, E. and van Harmelen, F. (2009). Scalable Distributed Reasoning with MapReduce. International Semantic Web Conference (ISWC 2009), Washington, DC.

[35] Huang, J., Abadi, D. J. and Ren, K. (2011). Scalable SPARQL querying of large RDF graphs. PVLDB, 4 (11), 1123-1134.

[36] Urbani, J. (2013). Three Laws Learned from Web-scale Reasoning. AAAI Symposium on Semantics for Big Data. Arlington, VA.

[37] Rudolph, S., Tserendorj, T., Hitzler, P. (2008). What is approximate reasoning? 2nd International Conference on Web Reasoning and Rule Systems (RR2008), Karlsruhe, Germany.

[38] Arasu, A., Babu, S., and Widom, J. (2003). The CQL continuous query language: Semantic foundations and query execution. Technical report, Stanford University.

[39] Barbieri, D. F., et al.  (2010). Querying RDF streams with C-SPARQL. ACM SIGMOD Record, 39 (1), 20-26.

[40] Balduini, M. et al. (2013). Tutorial on Stream Reasoning for Linked Data. Tutorial at International Semantic Web Conference (ISWC’2013), Sydney, Australia.

[41] Calbimonte, J-P. and Corcho, O. (2013). SPARQLStream: Ontology-based access to data streams. Tutorial at International Semantic Web Conference (ISWC’2013), Sydney, Australia

[42] http://www.w3.org/2005/Incubator/ssn/ssnx/ssn

[43] Heino, N. and Pan, J. (2012). RDFS Reasoning on Massively Parallel Hardware. International Semantic Web Conference, Boston, MA, USA.

[44] http://www.ldbc.eu

[45] http://www.tpc.org

[46] Angles, R., Prat-Perez, A., Dominguez-Sal, D. and Larriba-Pey, J.-L. (2013). Benchmarking database systems for social network applications. International Workshop on Graph Data Management Experience and Systems (GRADES 2013), New York, NY, USA.

 [47] M. Dayarathna and T. Suzumura. Xgdbench: A benchmarking platform for graph stores in exascale clouds. International Conference on Cloud Computing Technology and Science (CloudCom), Taipei, Taiwan.

6.19 Summary

After studying this chapter you should achieve the following outcomes:

  • An understanding of what is meant by Big Data and how it can be characterised in terms of variety, volume and velocity.
  • An appreciation of the complexity introduced by the extended characteristics of veracity, variability and value.
  • An understanding of both how Linked Data can help to address Big Data challenges and how Big Data introduces new challenges for Linked Data.
  • Knowledge of the different types of NoSQL database and how they aim to address the challenges of Big Data.
  • An understanding of how NoSQL databases for RDF perform relative to native RDF stores.
  • An understanding of the additional challenges presented by reasoning over RDF data on a large scale, how they have been addressed and on-going research issues.
  • Knowledge of how stream data can be represented and queried using extensions to the SPARQL query language.
  • An appreciation of current research initiatives concerned with handling Big Linked Data.
  • Knowledge of current benchmarking schemes for the evaluation of Big Linked Data.