PageRank and Generic Entity Summarization for RDF Knowledge Bases
Author(s): Dennis Diefenbach, Andreas Thalhammer
Full text: submitted version
Abstract: Ranking and entity summarization are operations that are tightly connected and recurrent in many different domains. Possible application fields include information retrieval, question answering, named entity disambiguation, co-reference resolution, and natural language generation. Still, the use of these techniques is limited because there are few accessible resources. PageRank computations are resource-intensive and entity summarization is a complex research field in itself.
We present two generic and highly reusable resources for RDF knowledge bases: a component for PageRank-based ranking and a component for entity summarization. The two components, namely PageRankRDF and SummaServer, are provided in form of open source code along with example datasets and deployments. In addition, this work outlines the application of the components for PageRank-based RDF ranking and entity summarization in the question answering project WDAqua.
Keywords: RDF; ranking; PageRank; entity summarization; question answering; linked data
Review 1 (by Heiko Paulheim)
The paper describes a set of tools and resources: an implementation of PageRank over RDF datasets and a collection of pre-computed PageRank values for different datasets, and a service for creating entity summarizations for RDF entities, again with a collection of pre-computed such summaries. The value of both in the area of question answering is discussed. Generally, given the number of contributions, the value of the paper is high. In particular, the PageRank dataset has been acknowledged and used in quite a few works already. The paper itself is clearly structured and written, with only a few minor shortcomings. With respect to the resources provided, it is unclear why DBpedia is not included in the list of R2.x resources in section 2, in particular since it is listed in section 4.2. As far as the computation of PageRank is concerned, the discussion of HDT as a format for reducing the computational cost is very interesting. While the reduction of memory consumption is straight forward (due to the replacement of URIs by integers), it is not so easy to see why the computation time decreases so drastically. Here, I would like to see a bit more details. The comparison in section 3.2 is interesting, but deserves a more critical look. Stating that a PageRank computed on Wikipedia links is better than one computed solely on the RDF graph since it correlates better with SubjectiveEye3D is a bit short sighted. An RDF graph has a different purpose than Wikipedia, hence, for a ranking in an RDF graph, one may actually want a different ranking than for Wikipedia. Here, I would like to see a more critical look. For the future, an evaluation in a setting that actually exploits the RDF graph would be more interesting (e.g., analyzing the impact on QA tasks when using different rankings). Minor points: * The headline of 4.2 should be "Implementation" rather than "Implementation Guide" * On p.9, G is not defined (I assume it is the RDF graph for which summaries are created) * p.4: it's -> its * p.5: Dictionray -> Dictionary * p.9: where -> were Summarizing, this is a valuable contribution to the ESWC resource track. === POST REBUTTAL === I still think that this is a valuable contribution to the ESWC resource track, and retain my positive rating.
Review 2 (by Christophe Guéret)
Update: I thank the authors for their rebuttal and the care taken at addressing the reviewers' concerns. It seems the questions around the impact of a mix ABox/TBox on the ranking of the nodes will still remain valid and are acknowledged as being serious enough to warrant a journal paper. My impression is that it would have been best to work on that journal paper first, and then publish as a resource the data-set produced for this journal paper. --- The presented resource is a algorithm and implementation of a page rank for RDF with a matching web service. Some data sets with pre-computed page ranks are also provided. Besides the page rank itself a service using to do summarisation is also served. Pluses: * Ranking and summarisation of entities are two relevant problems for our community * RESTful APIs and datasets are provided as well as the open source implementation Minuses: * There is no indication of the level of support for the API. If I was to build a service relying on them I would not know how much I can trust them to remain alive (and for how long). * The related work is weak. I would have expected to see things like http://graphdb.ontotext.com/documentation/free/rdf-rank.html mentioned. * More attention should be paid to the impact of statements like rdf:type on the graph structure! It seems page rank is computed on the graph as is, and then the rankings are compared to those of wikipedia which do not have any T-Box style link. A bit later in the paper the impact of those type of predicate is acknowledged when it is indicated that wdd:P31 and others are filtered out but that raises two more questions: 1) how to you decide on which predicates to filter out? and 2) why are they not taken out to the graph completely before page rank is computed? . The specific structure of an RDF graph, and the fact that the same information can lead to different structures!, is something that is IMHO quite important to study in detail when looking into applying page rank and other network metrics. Misc: * It could be a good idea to move the example found in Section 5 earlier in the paper and use it as a running example. * Typo on "Septermber" In summary, I would recommend to the authors to clarify those two points in the rebuttal and then in the paper: * Provide a stable archive of the resource and pointers to documentation (DOI, DCAT, ...) to improve on the reusability and availability aspects of the submission. * Give some insights into the potential impact of the inherent structure of an RDF graph (TBox+Abox, reified statements, sequences modeling, etc) on the value of the Page Rank scores.
Review 3 (by Marco Luca Sbodio)
Interesting work that describes useful tools for computing PageRank on RDF graphs and for entity summarization. The authors describe the tools that they have developed, and also open sourced for the community. They also describe an interesting application of these tools in the WDAqua project. The only suggestion that I have for the authors is to add information about the hardware that they used to run their performance tests: this would help readers who intend to run again these experiments, or want to try and improve the proposed techniques.
Review 4 (by Mari Carmen Suárez-Figueroa)
The paper presents two software components: one for ranking and the other for summarization. These two activities are quite interesting and useful, but it is not enough clear why the paper is about them. That is, the abstract and/or the introduction should clarify the relation betweeen both activities and why another component for them is needed. In addition, authors should state in a clear way the current problem with respect to existing modules for these activities. In order to benefit the understanding of the paper, authors should include a definition and examples for entity summarization at the very begining of the document. As the track is about resources, authors should explicitly mention the potential use of these components as well as how and where they can be reused. Other comments: - Acronyms should be explained (LOD, HDT) Typos: - it's scalability --> its scalability (Section 3) - Dictionray --> Dictionary (Section 3.1) ------ POST REBUTTAL PHASE: Thank you very much for your letter. It would be very appreciate if you could include in your paper the issues that are consider not to much clear in the submission.