Friday, 3:30–5:00 PM
Chair: Jun Zhao
Mind the Data Skew: Distributed Inferencing by Speeddating in Elastic Regions
Spyros Kotoulas, Eyal Oren, Frank van Harmelen
Semantic Web data exhibits very skewed frequency distributions among terms. Efficient large-scale distributed reasoning methods should maintain load-balance in the face of such highly-skewed distribution of input data. We show that term-based partitioning, used by most distributed reasoning approaches, has limited scalability due to load-balancing problems. We address this problem with a method for data distribution based on clustering in elastic regions. Instead of assigning data to fixed peers, data flows semi-randomly in the network. Data items “speed-date” while being temporarily collocated in the same peer. We introduce a bias in the routing to allow semantically clustered neighborhoods to emerge. Our approach is self-organising, efficient and does not require any central coordination. We have implemented this method on the MaRVIN platform and have performed experiments on large real-world datasets, using a cluster of up to 64 nodes. We compute the RDFS closure over different datasets and show that our clustering algorithm drastically reduces computation time, calculating the RDFS closure of 200 million triples in 7.2 minutes.
Data Summaries for On-demand Queries over Linked Data
Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres, Kai-Uwe Sattler, Jüregen Umbrich
Typical approaches for search and querying over structured Web Data collect (crawl) and pre-process (index) large amounts of data before allowing for query answering in a central data warehouse. This time-consuming pre-processing phase decreases the freshness of query results and only uses to a limited degree the benefits of Linked Data where structured data is accessible live and up-to-date at distributed Web resources that may change constantly. An ideal query answering system for Linked Data should return always current answers in a reasonable amount of time, even on corpora as large as the web. Query processors evaluating queries directly on the life sources require knowledge of the contents of data sources. In the current paper we develop and evaluate a probabilistic index structure for covering graph-structured content of sources adhering to Linked Data principles, provide an algorithm for answering conjunctive queries over Linked Data on the web exploiting this structure, and evaluate the system using synthetically generated queries. We find that our lightweight index structure enable more complete query results over Linked Data compared to direct lookup approaches, while keeping the overhead for additional lookups and index maintenance low.
Identification and Disambiguation of Graph-structured Concepts for Enterprise Search
Falk Brauer, Michael Huber, Gregor Hackenbroich, Ulf Leser, Felix Naumann, Wojciech Barczyński
Enterprise Search (ES) is a major challenge due to a number of reasons, among which the high level of ambiguity and implicitly addressed concepts in query and document terms are the most important. What distinguishes ES from ordinary search problems most is the existence of graph-structured enterprise data (ontologies) that describe the concepts of interest and their relationships to each other. We present a method to leverage this type of information to improve the quality of query answers. Our method identifies concepts from the enterprise ontology in the query and in the corpus. Therefore, we propose a ranking scheme for top-k ontology sub-graphs on top of approximately matched token q-grams between text and ontology. The ranking scheme leverages the graph-structure of the ontology for identification and disambiguation of not explicitly mentioned concepts. It improves previous solutions by using a fine-grained ranking function that leverages relevance ratings derived from the enterprise data and a confidence rating which takes constituent match situations derived from the document into account. Query/document-specific subgraphs are used for ranking documents based on the similarity of those subgraphs. This method is able to capture much more of the semantics of queries and documents than previous techniques. We prove this claim by an evaluation of our method using three real-life document sets and consider two knowledge bases.