# Paper 59 (Research track)

DRAGON- Decision Tree Learning for Link Discovery

Author(s): Daniel Obraczka, Axel-Cyrille Ngonga Ngomo

Full text: submitted version

Abstract: The provision of links across RDF knowledge bases is regarded as fundamental to ensure that they can be used in real-world applications. The growth of knowledge bases both with respect to their number and size demands the development of time-efficient and accurate approaches for the computation of such links. In this work we present DRAGON, a fast decision-tree-based approach that is both efficient and accurate. Our approach is based on the insight that the semantics of decision trees is equivalent to that of link specifications, which are commonly used to compute links across knowledge bases. We use this insight and the characteristics of link specifications to derive a decision-tree learning algorithm dedicated to discovering link specifications. We evaluate DRAGON by comparing it with state-of-the-art link discovery approaches as well as the common decision-tree-learning approach J48. Our results suggest that our approach achieves state-of-the-art performance in regards to F-measure while being up to 6 times faster than existing algorithms for link discovery on RDF knowledge bases.

Decision: reject

Review 1 (by anonymous reviewer)

(RELEVANCE TO ESWC) The problem tackled in the paper, entity resolution as link discovery, is in the scope of the conference. The forms of semantics exploited are not necessarily related to the context.
(NOVELTY OF THE PROPOSED SOLUTION) The idea is very much in the line of other approaches by the same authors: casting the link discovery as a classification problem.
The novelty is essentially in a clever application of yet another ML algorithm to this problem.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution appears to be sound and sufficiently developed for a conference paper. There is still margin for further extensions following the line of these methods.
The very general problem stated in Def.2 regards a generic relationship R, but the measures that are taken into account are similarity measures which seems to restrict the domain of applicability of the link discovery approach.
All the feature selection part (the choice of measure and their fine tuning) and how they relate to the underlying KBs is only partly covered while it appears to be decisive for the performance of any ML algorithm that would employ them.
(EVALUATION OF THE STATE-OF-THE-ART) The related work section is rather short and a bit too self-referential (see also the Bibliography section).
I would have liked to see a discussion in the context of more general (instance) matching  solutions exploiting the semantics of the underlying ontologies:
I’m not sure that the evaluation covers a comparison with possible SotA alternatives.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Borderline:
The discussion is quite essential and can be improved;
The improvement in efficiency seems a by-product of the choice of the particular classifier compared to the referenced own previous systems.
It would be interesting to find harder datasets/problems which to be solved: It seems to me that most of the difficulty is in the feature selection problem  rather than in the algorithm for combining them in a classifier (be it a DT, a set of rules, etc.).
This part seems to be given [or pre-computed], while the proposed algorithm just focus on ways to determine the split points.
Classifiers are assumed to have available binary information about the links: it would be interesting to take into account information about their degree of uncertainty or cases related to an open-world setting where new info may be made available over time.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment is in line with previous papers of the same authors on related methods.
As mentioned, I’m not sure about the reasons for other systems to be ruled out.
Also, the choice of the datasets seems to be due to the availability of results with the other systems claimed to be SotA by the same authors.
The only “outsider” is J4.8 a vanilla implementation of C4.5 whose setup was left to the standard values and could instead be tuned to improve at least its efficiency.
More details would be required, for example, regarding the distribution of the training mappings w.r.t. the two classes (and the formation of the test sets in the cross validation).
It would be interesting to investigate a principled way for the fine tuning of the underlying parameters.
Code not available?
== [ack: answered in the rebuttal]
(OVERALL SCORE) The problem tackled by this work is the construction of classifiers that predict the linkage between resources of two KBs. The problem is cast as a case of induction of simple binary decision trees with numeric features in the inned nodes. The authors show how each such tree corresponds to a sort of declarative description accounting for the mappings predicted by the tree.
The paper proposes the usage of two measures to guide the tree construction process.
An experiment on various datasets provides tackles the problem of finding a good value for a key parameter and then proves some advance in terms of effectiveness and efficiency with respect to other in-house systems and a basic Java implementation of C4.5.
Strong Points (SPs)
- study of the correspondence between the class of declarative description of the linking rules and the class of proposed binary trees
- clarity of writing and organization
- replicability of the experiments
Weak Points (WPs)
- complementary forms of semantics used,mostly based on the measures that are selected and re-used; links based on mere similarity
- the discussion on the significance of the results is weaker than the quality of the rest of the paper
- quite self-referential as regards related works, references, experiment.
Questions to the Authors (QAs)
Q1 How can the KB semantics in the link discovery process?
Q2 Do all compared methods and those discussed adopt the same class of features?
Q3 Is there any form of stratification used in the preparation of the training and test sets for the cross-validation?
Q4 From the examples it seems that only datatype properties are taken into account: doesn’t that undermine the generality of the approach?
Q5 Are pairs with both (opposite) classifications (s,t,+1) (t,s,-1) ruled out beforehand?
Q6 Is there any form of stop condition in the growth based on some notion of purity?
Q7 Why system Coala was not considered in the experiment?
Section 3 on related works is quite short.
The usefulness of the correspondence of DTs with link specifications based on the proposed features would deserve a more in depth discussion. A convincing justification for the study of the equivalence DTs-LSs in sect. 4 would be advisable.
It is well known that rules can be derived from DTs by algorithms like C4.5Rules, hence the correspondence that was proved in not a surprising result.
Some of the presented algorithms (1-2) are structurally quite simple and in the line of classical tree-induction ones: space can be saved for required extensions.
By the way is there really nothing to do when the leading condition doesn not hold (else-branches) ?
A bottleneck of the supervised methods is certainly the availability of training mappings.
So the construction of TS is a problem per se (and likely also error prone).
They may vary for each pair of KBs considered.
It may be implausible then that the parameters be fixed without a preliminary phase for adapting them PER single dataset (i.e. each pair of KBs).
In 5.2 subsets S’ and T’ should be better introduced. Moreover, regarding training (and test) set, it would be interesting to consider also uncertain links for which the classification is unknown.
The usage of the symbol m_i seems a bit confusing
(ok for a measure but then a set of measures?)
Sect. 5.2 should also better illustrate algorithm 3.
In Sect. 5.3: the account regarding “Learning with the global F-measure “ seems a bit too informal
Pruning is touched very lightly in sect. 5.4.
In the experiments it would be nice to have a look at the precision/recall decomposition of the f-measures
The results of J4.8 in terms of time are omitted because of its inferior performance in terms of effectiveness; however, it has been used without tuning the standard parameters. I suspect some fine tuning may help improve both efficiency and  effectiveness (especially with the help of some form of pruning).
It would be interesting also to know about the average complexity of the structure of the resulting trees.
=== After rebuttal


Review 2 (by Claudia d’Amato)

(RELEVANCE TO ESWC) The paper proposes a tree learning based approach for performing efficient link discovery between RDF resources (either in the same knowledge base or between different knowledge bases) while maintaining comparative performances with state of the art systems. For the purpose, a system called Dragon, has been developed and its performances have been compared with state of the art systems.
The paper focuses on an interesting problem providing promising experimental results. However, whilst a good experimental evaluation is presented, the presentation of the proposed solution requires several improvements as reported in the comments below.
The paper and the focused problem certainly fits the conference topics.
(NOVELTY OF THE PROPOSED SOLUTION) The paper presents a moderate advance with respect to the state of the art but considering that the main goal is the improvement of the performances particualrly in terms of efficiency, the goal has been reached, as it has been showed in the expemental evaluation.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The formalization of the proposed solution lacks of clarity and some aspects do not appear fully correct. See detailed comments in the "Overall score" section.
(EVALUATION OF THE STATE-OF-THE-ART) The related works section is mostly a list of (certainly appropriate) related works without any discussion concerning the advances of the proposed solution with respect to the state of the art. Particularly, the value added of the proposed solution with respect to the existing methods based on decision trees that have been developed for solving the record linkage problem (which is actually closely related to the problem focused in this paper) should be discussed.
Additional and complementary studies that should be mentioned are also those based on key discovery for assessing particularly sameAs link between RDF resources (see for instance Symeonidou D., Armant V., Pernelle N., Saïs F. (2014) SAKey: Scalable Almost Key Discovery in RDF Data. In: Mika P. et al. (eds) The Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes in Computer Science, vol 8796. Springer, Cham) and maybe some self-citations can be removed (for instance you can report just the most representative survey).
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Appreciated both the analysis of the computational complexity of the proposed solution as well as the experimental evaluation
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) All the important information for making the experiments reproducible seems to the within the paper.
(OVERALL SCORE) The paper proposes a tree learning based approach for performing efficient link discovery between RDF resources (either in the same knowledge base or between different knowledge bases) while maintaining comparative performances with state of the art systems. For the purpose, a system called Dragon, has been developed and its performances have been compared with state of the art systems.
The paper focuses on an interesting problem providing promising experimental results. However, whilst a good experimental evaluation is presented, the presentation of the proposed solution requires several improvements as reported in the comments below.
The declarative link discovery framework should be introduced in a more formal way. An implicit closed world assumption is made when reformulating the problem as a classification problem as such the usage of the negation for the class -1 does not appear fully appropriate.
- See also what has been suggested within the related works section
As for the proof on page 6, given that \theta_{I,j} is a threshold, it is not clear the reason why its value becomes 0 in the else condition. Additionally, this does not seem to be applied in figure 2. Overall the proof results a bit artificial since it could be directly asserted by construction, namely by the text reported after proposition 1. Indeed, in the so called inductive step, such a  construction is actually exploited, otherwise what has been reported in the inductive step and also in the figure cannot be asserted e.g. the conjunction and disjunction cannot be added, the only operator that could be considered is the difference introduced in the base step.
The text reported after proposition 1 should be clearly formalized by means of an algorithm and it should be reported that this part is clearly inspired by traditional methods for learning First Order Logic Decision Trees as well as the most recent methods for learning Terminological Decision Trees. The authors also refer to L'_2 that has not been introduced before its usage.
In section 5.1, the necessity of updating the training data when building the tree suddenly appears. This problem should be properly introduced and explained, currently it is neither clear the reason why the training data need to be updated (particularly when using the Gini index) nor how such an update is computed.
It would be better to sketch the  procedure for growing the tree by an algorithm. In case of lack of space, algorithm 3 could be omitted.
On page 13, the authors refer to noisy datasets. What are the noisy datasets should be reported at beginning of section 6 possibly jointly with information concerning the percentage of noise in such datasets.
The analysis of the computational complexity is highly appreciated.
MINOR:
- end of page 13: "the quality of the LSs it generate are inferior..." --> "the quality of the LSs it generate is inferior..."
----- AFTER REBUTTAL PHASE -----
I would like to thank the authors for the clarification. I would suggest to make it clear within the paper.

Review 3 (by Guillermo Palma)

(RELEVANCE TO ESWC) Is a important idea behind the Linked Data paradigm interlink very large  data sources with time-efficient link discovery tools.
(NOVELTY OF THE PROPOSED SOLUTION) This paper presents a decision-tree based approach for link discovery. Link discovery has been studied as a classification problem before.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed solution mapping the Link Discovery problem as a classification problem and uses an algorithm based on decision trees
(EVALUATION OF THE STATE-OF-THE-ART) This papers presents the state-of-the-art methods for link discovery
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The equivalence between the semantics of Link Specifications and the semantics of decision trees has been proven. The proposed algorithm that uses decision trees to learn the Link Specifications is explained.
In Algorithm 2: Overview using global F-measure, the variable “splitAttribute” is not used.
On page 8, in section 5.1 Overview,  it is indicated: "DRAGON continues recursively by constructing the left and right child of the tree if it found a split attribute and has not yet reached the maximum height."
However, On Algorithm 2, in line 3, it is indicated: "if currentHeight < maxHeight then"
Should it be as in Algorithm 1, line 3: "if splitAttribute != null AND currentHeight < maxHeight then"?
On page 8, in section 5.1 Overview, it is indicated: “Any similarity values below
the lower bound \lamdba are disregarded i.e., set to 0.”
Why any similarity values below the lower bound are set to 0?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed approach is evaluated on four real-world datasets and on a synthetic dataset. The evaluation of the parameters is given. In the experimental study the algorithm proposed (DRAGON) is compared with state-of-the-art link discovery algorithms.
DRAGON presents a solution quality slightly inferior to the state of the art.
The DRAGON runtime outperforms the the state-of-art by several orders of magnitude in the datasets used in the experimental evaluation.
In Table 2 for some settings (e.g. UP, \lambda = 0.05  and  UP, \lambda = 0.2) the “error estimate pruning”  obtains better results than the  “Global F-measure pruning” . However, Dragon performs best with  \lambda = 0.4 and “Global F-measure pruning”.
Why in some configurations Dragon obtains better result with the “error estimate pruning” and in others with the “Global F-measure pruning”?
(OVERALL SCORE) This paper presents a decision tree learning approach for link discovery.
Strong Points (SPs)
1) The proposed approach was evaluated in an experimental study with several different datasets.
2) The runtime of the link discovery algorithm proposed exceeds the state-of-art by several orders of magnitude.
3) The DRAGON algorithm and its properties are well explained.
Weak Points (WPs)
1) The link discovery algorithm proposed presents similar solution quality to the state-of-art.
2) There is no clear explanation of the diferency of the results of  the Global F-measure approach and the Gini approach.
3) There is no analysis of the results regarding the effect of the pruning methods.
Minor comment
1) Typo error in page 9: “jaccard(name, name) ≥ 0.4”  should be “qgrams(name, name) ≥ 0.4”
----- AFTER REBUTTAL PHASE -----
I would like to thank the authors for their responses. However, the following observations of the reviewers are maintained:
• The related work section is too self-referential.
• The related work section does not contain any discussion regarding the differences of the proposed approach with respect to the state of the art.
• In the proof of the Proposition 1, the authors refer to L'_2 that has not been introduced before its usage.
• In the evaluation section the values of precision and recall could be useful to characterize the proposed approach.

Review 4 (by Shawn Bowers)

(RELEVANCE TO ESWC) Link discovery approaches fit with ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The paper seems largely incremental, extending prior work by one of the authors. The basic idea is to use fairly standard decision trees for determining link specifications. Other approaches have done similar things using different classification algorithms (including the author).
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) It would be nice to see a wider range of evaluations. A few places the formalisms were hard to follow (mainly in terms of the details). As an example, when defining the "size" of a link specification (which I don't believe is ever used later), it states that if L is empty, s(L) = 0, however, the syntax doesn't appear to allow empty link specifications.
(EVALUATION OF THE STATE-OF-THE-ART) The related work section in particular does not properly position the paper's work against other similar work. How does this work differ or complement existing work? How is it similar? Also, there is quite a bit of work on object fusion techniques that aren't considered.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The evaluation suggests that the classification performs as well as existing approaches, yet has fairly significant reduction in runtime cost. It would be good to have an explanation as to the reason behind the runtime costs as opposed to just through observation. Also, on larger or more complex data sets, how might the approaches perform. Where are the weaknesses of the approaches, etc.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) Again, it would be good to look at a wide range of datasets (in terms of complexity and size).
(OVERALL SCORE) This paper extends the authors' prior work on inferring link specifications for link discovery using ML classification approaches (e.g., see [15-19]). This paper uses decision tree classifiers, tailored to link specifications, and not that surprisingly, the d.t. approach works well. The paper provides an experimental evaluation comparing the approach to existing approaches.

Metareview by Andreas Hotho

The paper addresses the fast prediction of links between RDF knowledge bases. The focus is on the speed. The major contribution is the use of a decision tree learner to solve the task. As it is know, that such learners are fast, the challenge is to have the same predicting quality for this kind of learner. All reviewer see a value in the work but also pointing to a bunch of weaknesses. This includes the related work (self-referentiality), generality of experiments, a missing complexity analysis beside the provided runtime analysis or analysis of the proposed algorithms with respect to the used measures and pruning behaviour. It started as a borderline paper and moved towards weak reject during rebuttal and discussion among the reviewers (a category not available on easychair). So we can only recommend a reject given the moderate novelty and advance with respect to the state of the art and all the other points already mentioned.

Share on