Using Ontology-based Data Summarization to Develop Semantics-aware Recommender Systems
Author(s): Tommaso Di Noia, Corrado Magarelli, Andrea Maurino, Matteo Palmonari, Anisa Rula
Full text: submitted version
Abstract: In the current information-centric era, recommender systems are gaining
momentum as tools able to assist users in daily decision-making tasks. They
may exploit users’ past behavior combined with side/contextual information to
suggest them new items or pieces of knowledge they might be interested in.
Within the recommendation process, Linked Data (LD) have been already proposed
as a valuable source of information to enhance the predictive power of
recommender systems not only in terms of accuracy but also of diversity and
novelty of results. In this direction, one of the main open issues in using LD to
feed a recommendation engine is related to feature selection: how to select only
the most relevant subset of the original LD dataset thus avoiding both useless
processing of data and the so called “course of dimensionality” problem. In this
paper we show how ontology-based (linked) data summarization can drive the
selection of properties/features useful to a recommender system. In particular,
we compare a fully automated feature selection method based on ontology-based
data summaries with more classical ones and we evaluate the performance of
these methods in terms of accuracy and aggregate diversity of a recommender
system exploiting the top-k selected features. We set up an experimental testbed
relying on datasets related to different knowledge domains. Results show the feasibility
of a feature selection process driven by ontology-based data summaries
for LD-enabled recommender systems.
Keywords: schema summarization; feature selection; recommender systems
Review 1 (by Valentina Janev)
Review 2 (by Alasdair Gray)
(RELEVANCE TO ESWC) The paper presents a recommender system that uses Linked Data to improve its results. (NOVELTY OF THE PROPOSED SOLUTION) Other systems have adopted this approach in the past and indeed the authors have a similar paper in SAC2017. This is an iterative improvement on their system (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach seems appropriate and well worked through (EVALUATION OF THE STATE-OF-THE-ART) Most of the related work section highlights the different approaches that have been tried but do not compare and contrast that work with that which is presented, i.e. it is predominantly just a summary of the field. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The framework is explained reasonably well, although a small running example to exemplify the various definitions would have been beneficial. One issue is the assumption that datasets publish summaries in the specific form that are required for their framework. I'm not aware of any datasets publishing data summaries. I think it is a stretch to say that you don't need to download the whole dataset as they are readily available or to expect publishers or third parties to make these available; particularly when there are multiple competing data summary approaches. The data summarisation step should be factored into your preprocessing figures. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) There is a lot about the evaluation and the presentation that is unclear to me. The scripts and the datasets to rerun the experiments have not been made available. How are datasets such as movielens (relational database) connected to DBpedia? Is DBPedia the only LD resource considered? What about the numerous other datasets for each of the domains considered? It is unclear to me what is being presented in tables 2-4, the first column of each table is not descriptive as to the system under test. Which results refer to which system? What indicates a good value? Thus I can't verify the claim that ABSTAT is better in films and books. (OVERALL SCORE) The paper presents a recommender system that uses summaries of linked data (although I believe this is only for selected parts of DBpedia) to improve its recommendations. The work is an iterative improvement over their previous work. The approach adopted is appropriate and thorough. The evaluation is not fully explained and cannot be reproduced. The approach relies on data summaries being published by data providers which is not a reality Minor Issues: - The various algorithms mentioned in the introduction should be cited - The various experimental datasets should be appropriately cited - Rather than an ow.ly link to the alternative approaches you should publish the work on zenodo or figshare and get a doi
Review 3 (by Dejing Dou)
(RELEVANCE TO ESWC) It is very relevant to SW community. (NOVELTY OF THE PROPOSED SOLUTION) It's a novel method in recommendation system. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) It reads correct. (EVALUATION OF THE STATE-OF-THE-ART) It has been compared with state-of-art. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) It is well written. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) It seems is reproducible. (OVERALL SCORE) It is a well written. Interesting topic (recommendation) and novel method (ontology-based data summarization). Minor issue in writing: In latex, left " should be ``
Review 4 (by John McCrae)
(RELEVANCE TO ESWC) The paper concerns the use of linked data and ontologies for feature selection in machine learning. The application of SW technologies to other tasks is very welcome at ESWC. (NOVELTY OF THE PROPOSED SOLUTION) The paper builds on the existing work of ABSTAT and applies it to the task of feature selection. There have been some other approaches in this direction but the methodology proposed in this paper is novel and interesting. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) While the methodology is quite limited in that it focusses on a single aspect, it is well supported by experiments. I wonder why the authors did not consider combining the information gain metric with the ontology-based feature selection, given that the IG still outperforms this method for some cases. (EVALUATION OF THE STATE-OF-THE-ART) The authors use a large dataset, that provides significant results and the evaluation presents many results with different configurations. However, I did have some difficulty connecting the categories in the results to the text. Why are there three categories of ontological features on page 9, but they are not always evaluated? Furthermore, the metrics are not defined... I personally do not know either catalogCoverage or aggrEntropy and some readers may not know MRR. Please add references of formulae. The authors should also explain why they consider TFIDF as a baseline, when IG is already a well-established method, and seems to be the focus of this paper. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The results are a bit mixed and the authors claim "higher sparsity in the knowledge graph may give statistical methods *a chance* to beat ontological ones". It is a shame that this is not explored more fully... perhaps by trying to see the comparison between sparsity of the data and the performance of the proposed approach against IG. Still the results are better than IG in some cases and as such this would certainly be an interesting step. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The method seems to be generally well-explained and the authors have provided code on GitHub. The datasets are public enabling comparison to other methods. (OVERALL SCORE) The paper presents a small but interesting contribution and the strong evaluation supports this paper Strong points ------------- * Good Evaluation * Clear methodology * Should be reproducible Weak points ----------- * Results show only inconsistent improvements for this system * Some errors in English * Contribution is a little incremental Minor Errors: p1. "Linked data" is not the same "knowledge graphs" p2. "means to discovering" => "means to discover" "properties *such* as" "prone to be" => "easily" p6. "f_i is higher as the lower is the value of entropy" => "higher values of f_i are correlated with lower values of entropy"?? p7. "step only to the benefit of IG" => "step only for IG" "interactive time" => I think you mean real time p8. "i.e. movies, books..." a colon would be better here p9. "splitted" => "split" p11. "no particular differences" => "no clear differences" p12. I don't understand "triples associated result" "give chance to" should be "give X a chance to" p13. What is "minimalization approach"?? "Limited work" => "Some work" please makes sure there is a space between text and references, e.g., Heitmann and Hayes~ p14. "are allowed to" => "are able to"
Metareview by Valentina Presutti
The paper presents a method for performing feature selection in machine learning for recommender systems, using linked data. It's a nice example of applying semantic web to machine learning and of high interested for the semantic web community. Although the advancement is only incremental with respect to precious work the evaluation is robust and reproducible and show positive results.