Dynamic Planning for Link Discovery
Author(s): Kleanthi Georgala, Daniel Obraczka, Axel-Cyrille Ngonga Ngomo
Full text: submitted version
Abstract: With the growth of the number and the size of RDF datasets comes an
increasing need for scalable solutions to support the linking of resources. Most
Link Discovery frameworks rely on complex link specifications for this purpose.
We address the scalability of the execution of link specifications by presenting the
first dynamic planning approach for Link Discovery dubbed Condor. In contrast
to the state of the art, Condor can re-evaluate and reshape execution plans for
link specifications during their execution. Thus, it achieves significantly better
runtimes than existing planning solutions while retaining an F-measure of 100%.
We quantify our improvement by evaluating our approach on 7 datasets and 700
link specifications. Our results suggest that Condor is up to 2 orders of magnitude
faster than the state of the art and requires less than 0.1% of the total runtime of
a given specification to generate the corresponding plan.
Keywords: Link Discovery; Semantic Web; Linked Data; Planning; Scalability
Review 1 (by Paul Groth)
(RELEVANCE TO ESWC) Link discovery is a key part of creating the web of data. This is a core consideration of the semantic web. (NOVELTY OF THE PROPOSED SOLUTION) The proposed solution is novel in terms of looking at the planning problem for link discovery. It is unclear whether this stresses existing planning algorithms. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The solution is well described and appears to be correct. (EVALUATION OF THE STATE-OF-THE-ART) The state of the art with respect to link discovery is well contextualized. While the authors briefly mention that their inspiration was planning approaches from the area of databases, I wish there was more here. It's hard to determine how much of this is novel with respect to planning algorithms or just an application of existing planning algorithms. How different is this planning problem to existing DB or AI planning problems? (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper provides a precise definition of link specifications and the planning operators. The algorithm for planning is well specified and clear. The complexity of the planning is briefly explored. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental data is provided as well as how the experimental data was extracted. The experiment appears reproducible although does rely heavily on the authors prior tools for link discovery. Overall, an excellent evaluation agains multiple standard datasets and the introduction of a new dataset to explore scalability. The results are well described and clearly beat the state-of the art. (OVERALL SCORE) * Summary of the Paper This paper explores the efficient execution of link discovery algorithms by exploiting knowledge about link specifications and their runtime properties to better plan the execution of link discovery. The approach outperforms current link discovery systems in terms of execution. * Strong Points (SPs) - Interesting exploration of dynamic planning for the execution of link discovery - Better than state of the art results - Really well written * Weak Points (WPs) - Too quickly glances over other planning literature - A large reliance on the author's existing infrastructure * Questions to the Authors (QAs) - Can you clarify what the value is in "the runtime value of a plan for a complex LS additionally includes a value for the filtering or set operations, wherever present."? Where does this come from? - How different is this problem from existing planning problems? Can you clarify is this challenging planning problem in general or is it just the case that no one has done this for link specifications. Comments after rebuttal: I'm keeping my scores the same. Yes I agree that DB planning is related but there's also the AI planning literature which looks at plans with less or more knowledge of the world (i.e. data). It's worth to have a look at.
Review 2 (by Danh Le Phuoc)
(RELEVANCE TO ESWC) The contributions of the paper is highly related to ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The paper propose the first dynamic planner of link discovery between knowledge bases published using Linked Data Principles. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The problem and solution is nicely formulated and explained with provided algorithms. (EVALUATION OF THE STATE-OF-THE-ART) The paper gives a short analysis of state of the arts. An evaluation over two other systems developed by one of the author is given. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The good level of technical details is given. Example and illustrations are sufficient to follow the complicated technical details. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The system is not accessible, hence, evaluation is not reproducible. (OVERALL SCORE) The paper addresses the scalability of the execution of link specifications by presenting the A dynamic planning approach for Link Discovery which is claimed the first of its kind. The advantage of the system built by this approach is that it can re-evaluate and reshape execution plans for link specifications during their execution. The paper provides a nice formalisation and algorithm of the problem and given solution. The evaluation results look quite impressive. ***Strong Points*** 1) The paper proposes an interesting and innovative solution for dynamic planing of link discovery. The solution is nice formulated and explained with provided algorithms 2) The paper carried evaluations on a lot of data settings with several experiments. 3) The content is interesting to reading, the storyline is consistent and easy to follow. ***Weak Points*** 1) The evaluation was conducted against two algorithms of the systems built by one of the author, it seems this is the internal competition. This will create doubts to reader as this is an heuristic approach. 2) The implementation is not accessible, so, evaluation is not reproducible. The output of experiments are not much useful here. 3) In section 3.2 authors mentioned that they use linear model in equation(1) for approximating runtime as experiments in their previous work and  found that it’s sufficient approximation. I wonder it is the case given the data is in graph structure and uniform distributions are not universal in this settings? It seems authors agree with my doubt as their future work aims to study more complicate regression model. Q&A 3) Please clarify the question in Weak Point no. 3)
Review 3 (by Tomi Kauppinen)
(RELEVANCE TO ESWC) Indeed Link Discovery is one of the key tasks - and challenges - for creating useful linked data sets. The authors address this by improving performance of executing link specifications. (NOVELTY OF THE PROPOSED SOLUTION) Authors state that they "build upon these solutions and tackle the problem of executing link specifications efficiently.", referring to existing algorithms and continue about their innovation: "it has a significant effect on the performance of LD systems as shown by our evaluation". (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The theory behind the approach is well described, including a walk through of the steps. There is an evaluation with an experimental setup providing specific questions (Q1-Q3) on the performance. (EVALUATION OF THE STATE-OF-THE-ART) The related work and their relevance to different aspects of the work is discussed in depth. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The algorithms and approach are described very well given their complexity. The evaluation reveals performance characteristics of the approach in detail. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The paper support reproducibility in following ways: - the algorithm is explained in detail. There is, however, no implementation provided as open source that would facilitate rerunning of the experiments - "The new data and a description of how they were constructed are available..." online (OVERALL SCORE) Summary of the Paper Introduction explained the research challenge and the contributions very well, and also opened results. Strong Points (SPs) This is a in a way very theoretical paper, yet it manages to describe the practical problems and also through an experimental setting communicate the value (in terms of improving performance) Weak Points (WPs) With link discovery algorithms it is vital to support intuition. Now in the paper there could be a few convincing examples of linkages, and discussion of challenges, accordingly. Questions to the Authors (QAs) - are you planning to release the implementation of the algorithms as open source? This can substantially support both reproducibility and further development of link specification execution machinery. - can you provide examples of linkages between the datasets to support intuition?
Review 4 (by Wouter Beek)
(RELEVANCE TO ESWC) Link Discovery is an important topic for the Semantic Web, where links between datasets are used to provide context, which spurs reuse. The specific topic of _efficient_ Link Discovery is also important, because more efficient approaches are needed in order to link the increasing number of Semantic Web datasets. (NOVELTY OF THE PROPOSED SOLUTION) The novelty of this paper is relatively simple: apply replanning to Link Discovery. This novel approach makes sense, because executing a Link Discovery algorithm is a very good way of determining the efficiency of each of its component steps. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is explained clearly, and results are proven to be complete for multiple datasets in the empirical evaluation. (EVALUATION OF THE STATE-OF-THE-ART) The evaluation is performed against SotA tools and over established benchmark datasets. In addition, the evaluation is run over three new datasets in order to tests the scalability properties of the presented approach. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The Related Work presents existing SotA systems that are also used in the Evaluation section. The example run is helpful, since the 2p pseudo-code would otherwise be a bit difficult to get through as a reader. The conclusion/discussion part is very brief, but this makes sense for this kind of paper, where the research questions are almost directly answered by the evaluation results themselves. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The datasets and evaluation results are shared online, but the implementation code does not to be shared? Even without the implementation code, the pseudo-algorithm -- together with the example run described in the text -- should be sufficient to allow reproducibility (although having the code available as well would still be nice). (OVERALL SCORE) This paper applies dynamic planning to the task of Link Discovery. The idea is simple (but this is a good thing), and the evaluation results show that the proposed combination results in far better results, especially for relatively complex link specifications.
Metareview by Hala Skaf
The paper proposes the first dynamic planner of link discovery between knowledge bases published using Linked Data Principles. The paper is relevant to ESWC. Reviewers agree that the paper is well written, the approach is well described, algorithms and related works are discussed in depth. The authors answer questions of reviewers related to linear model, examples and the reproducibility by providing the URL of the project. Although the evaluation was conducted against two algorithms of the systems built by one of the author, and some related works from AI planning literature are missed, the contribution of the submission satisfies the requirements of the ESWC.