The goal of this tutorial is to introduce, motivate and detail techniques for integrating heterogeneous structured data from across the Web. Inspired by the growth in Linked Data publishing, our tutorial aims at educating Web researchers and practitioners about this new publishing paradigm. The tutorial will show how Linked Data enables uniform access, parsing and interpretation of data, and how this novel wealth of structured data can potentially be exploited for creating new applications or enhancing existing ones.

As such, the tutorial will focus on Linked Data publishing and related Semantic Web technologies and standards, introducing scalable techniques for crawling, indexing and automatically integrating structured heterogeneous Web data through reasoning.


Introduction to RDF and Linked Data

The first session gives an overview of RDF and Linked Data publishing. We will discuss the RDF data model and Linked Data principles for publishing RDF data on the Web. In particular, this session will cover:

Scalable Linked Data Crawling

This session gives an overview of the state of the art in efficient data-retrieval techniques, including novel challenges and techniques for crawling Linked Data from the Web. We will present the architecture of a crawler for small to medium-sized datasets in the range to several hundred million triples. In particular, this session will cover:

Scalable RDF Indexing Techniques

This session presents scalable techniques for indexing and querying local repositories of Linked Data. We will discuss the standardised SPARQL query-language and thereafter discuss the state-of-the-art in RDF storage with respect to research, directions and applications. In particular, this ses- sion will cover:

Reasoning: Motivation and Overview

This session gives an introduction to the RDFS and OWL (2) standards and to rule-based reasoning, with heavy emphasis on motivating reasoning for the Linked Data use-case and for integrating heterogeneous data from a large num- ber of diverse sources. We also introduce algorithms which incorporate information about the provenance of data during reasoning to ensure robustness in the face of noisy or impudent remote data. In particular, this session will cover:

Scalable Distributed Reasoning over MapReduce

This session presents scalable distributed reasoning using the MapReduce distribution framework, enabling high performance over a cluster of commodity hardware. This session details the MapReduce framework (employed by Google and Yahoo, among others) and the award-winning WebPIE system which integrates optimised execution strategies for rules supporting a (pragmatic) fragment of OWL semantics.

Implementing a LarKC Workflow

This session allows attendees to get hands-on with building scalable linked data applications. Some of the technologies presented in the previous sessions will be put together using a scalable workflow engine tailored for Linked Data: the Large Knowledge Collider (LarKC).


Session 1: Introduction to Linked Data

Session 2: Integrating Web Data with Reasoning

Session 3: Distributed Reasoning: Because Size Matters

Session 4: Putting Things Together (LarKC)