Build a corpus of scientific articles with semantic representation
Author(s): Jean-Claude Moissinac
Full text: submitted version
Abstract: As part of the SemBib project, we undertook a semantic representation of the scientific production of Telecom Paristech. Beyond the internal objectives, this enriched corpus is a source of experimentation and a teaching resource. This work is based on the use of text mining methods to build graphs of knowledge, and then on the production of analyzes from these graphs. The main proposal is the disjoint graph production methodology, with clearly identified roles, to allow for differentiated uses, and in particular the comparison between graph production and exploitation methods. This article is above all a methodological proposition for the organization of semantic representation of publications, relying on methods of text mining. The proposed method facilitates progressive enrichment approaches to representations with evaluation possibilities at each step.
Keywords: semantic publishing; publication; Linked Data; SPARQL
Review 1 (by Giuseppe Rizzo)
This paper addresses a need of turning the ParisTech library into a richer archive of data and relations among authors and papers and into a better indexed and thus searchable archive. Such a pain can be extended to any library of universities and organizations, thus this objective is certainly of help for the community and society as whole. The approach presented aims to apply basic text mining notions (identified in the paper as TF-IDF -- only) to the generation of a knowledge base of scientific concepts and publications (scientific papers), where concepts and publications have explicit relations. Thus, the approach presented in this paper is of interest for the Resources track as defined in the requirements of the call for papers. However, the paper lacks from the necessary formalism and depth in describing core concepts (see below) and in presenting the knowledge base. The paper would also benefit from proof-readings as numerous paragraphs are hard to be understood (I omit to report the long list of typos but I recommend to check plurals such as "different strategy", and the numerous sentences ending with 3-dots which left me doubting about the notion acquired by the writers and the notion delivered to the readers). Strengths - idea and impact to scientists and society - availability of the resource (with a big question mark see below) Weaknesses - the portal http://givingsense.eu/sembib/ is in French, this introduces an access barrier to the reuse of the data instances indexed by the portal. Then, from the portal I cannot find instructions on how to access the instances. There is a "technical section" in http://givingsense.eu/sembib/state.php , but it remains unclear how to traverse the graph. Despite the reported technicalities, the easiest thing of allowing the access and re-use of data instances is lacking - the approach is half-way of being a research paper and half-way of being a resource paper because it lacks from describing with the necessary depth the text mining approaches for the creation of the KB (a few mentions here and here such as TF-IDF) and the KB as data model and instances - the value chain of the approach is based on a traditional process of data acquisition->corpus creation->data publishing. However, Sec 2.5 "Gather docs and perform basic treatments" (I reckon you meant basic processing) that should describe the first 2 stages of the value chain do not answer the question "how this has been done". Then, the authors mentioned that the ontology used for modeling data points is based on a set of SPAR ontologies. Which ones? In principle I can indeed agree on the choice, however the paper should list the rationale and eventually a comparison why those ontologies in such a context (a citation could be already enough) - the authors mentioned that the approach grounded on the decoupling of concepts vs publications and this part has been underlined as important for the creation of a "ground truth". Which ground truth by the way? The choice of decoupling, assuming crucial, needs to be described comprehensively if you want to give such an importance - the other stage of the value chain is the publishing that should ease the exploration. This part cannot be assessed (see above). It is introduced that the authors did undergo into a technology choice for deploying a SPARQL endpoint. This technology choice is just introduced, but not elaborated, lacking from giving the necessary inputs to scientists to understand your motivations. Imagine: if I want to replicate or extend such an approach, how could I know which technology is more efficient, better performing, ... (and we can list here many more evaluation dimensions)? - a wrap up section on the achievements based by sembib is presented. This section should be expanded in order to show the actual benefit of the approach, ideally using a proper scientific validation (hypothesis formulation, experimental set up definition, KPI derivation and quantification) ======= AFTER REBUTTAL ======== Thanks for answering. However, my points concerning the how (3rd, 4th, 5th) haven't been answered. I have to confirm my previous mark as I don't have the necessary information to judge the relevancy of this approach and thus the resource that is generated.
Review 2 (by Cristina Sarasua)
This paper presents a data set containing RDF metadata about scientific articles by Telecom ParisTech. The author reused state-of-the-art ontologies from the bibliographic domain (e.g. Fabio and dcterms), used state-ot-the-art tools to mine the text of publications to obtain their keywords, and enabled a SPARQL endpoint. The author claims in the abstract that the contribution of the paper is the organization of publication metadata in different graphs. My main concern with this submission is that it provides a very limited contribution and it ignores the related work done in the field of scientific RDF data publishing. Even if the submission is a Resources submission, a data set submission in this track should somehow contribute to the state of the art. One way to do that is to provide data that can be used for benchmarking methods (and hence, includes an exhaustive set of test cases manually reviewed by experts). From the description provided in the text and the content visible from the SPARQL endpoint, it does not look like this data set provides a solid benchmarking data set, and it is not a data set that is supposed to help validating new methods either. While having a new data set about publications is in general appreciated, the submission does not provide a novel research contribution, and it looks like an engineering effort. Moreover, the submission provides a data set that ignores some Semantic Web publishing practices, and it lacks proper documentation (see more details below). I recommend the author to work in the direction of new methods to improve the annotation and the interlinking of scientific publications and submit a novel contribution to a Semantic Web venue in the future. **Positive aspects** - It adds RDF data into the Web of Data. - The author reused some existing ontologies, such as Fabio and dcterms. - The data set contains a concept graph with the papers’ keywords. ** Negative aspects** - The contribution of this paper does not have novelty in terms of research. Many data sets with scientific articles have been published in the past (see the publications colour in the LOD diagram http://lod-cloud.net/versions/2017-08-22/lod.svg), and the submission does not refer to any of the existing data sets. - The submission lacks specificity. For example, when the author describes the representation of documents, the author says that “a semantic representation of the metadata has been realised by relying on SPAR ontologies (bibo, cito) and ontologies commonly used for documents (Dublin Core, schema, foaf …);”, without providing a comprehensive description of the exact parts of the ontologies used, nor providing examples of representative resources. A graphical representation or a Turtle/N3 description of the RDF resources that the data set contains would help the reader have a clear understanding of the exact data published. Moreover, the author mentioned that data is exposed via a SPARQL endpoint, but the text does not provide the URL of the endpoint. Guessing that it would end in “*/sparql” I found it on the Web. However, the author should be aware of the fact that a Resources submission should indicate such details (see the call for papers at ESWC 2018 https://2018.eswc-conferences.org/resources-track/ and last year’s analogous call and program in ISWC https://iswc2017.semanticweb.org/calls/call-for-resources-track-papers/ ). - The work presented seems to be incomplete, as the text mentions that they started to evaluate methods for learning word vectors, and they are working on enrichment. - The submission does not fulfil the requirements of a high quality submission in the Resources track. ** As mentioned before, the data set does not break new ground, it does not advance the state of the art, and it does not look like it will have impact in improving the adoption of Semantic Web technologies — the publications domain is quite well researched, and (despite the challenge of convincing specific organisations to do so), more and more librarians and archivists are involved in standard Semantic Web metadata publishing. ** The Web site of the project is exclusively in French. Considering that it will have an international scientific audience, English would be needed. ** The author says in the conclusions and outlook section that “An important next step is the RDF documentation of this dates with DCAT, and then the publication of this data”. I assume the data is published (since it is queryable via the SPARQL endpoint), and again, the DCAT description is something the author should have provided as for the submission (see also https://2018.eswc-conferences.org/resources-track/ ). The author should make the data FAIR (see also https://www.nature.com/articles/sdata201618). ** It is very difficult to assess whether the resource will be useful for a wider audience, since the author does not indicate the extent to which the data set intersects with other publication data sets. ** The technical quality of the resource should be revised, applying best practices in Semantic Web publishing. See suggestions below to (i) interlink the instances to other data sets and (ii) revise the classification information for e.g. ResearchPapers ** The author does not provide enough descriptive statistics about the data set in the submission. For example what is the AVG and STDEV of the number of keywords per publication in the data set? The RDF description of the paper I looked up, for example, does not have any keyword in the RDF response obtained in the SPARQL endpoint, while the publication does contain keywords (https://eprint.iacr.org/2013/303). It is unclear if that is due to multiple named graphs or because the data is not there. It would also be advisable to showcase queries that use the various graphs mentioned in the text. ** There is not information about the license of the data — or at least it is not mentioned in the submission, nor in the main Web site (http://givingsense.eu/sembib/sparql/ or http://givingsense.eu/sembib/). ** The author does not provide any entry of the data set in Zenodo, GitHub, DataHub, Figshare etc. **There is no sustainability plan specified. The author barely indicates technical tasks that they are currently working on (section 3.2) but does not mention the way the maintenance of the data set will be carried out. - The semantic forms developed, to display the resource descriptions in a more human-friendly way (e.g.http://givingsense.eu/sembib/onto/persons/David_Bertrand) are not a novel contribution either. There have been plenty of projects developing user interfaces for RDF data (e.g. all the work around semantic wikis and semantic portals). - The conclusions indicate that the paper describes "a general methodological approach for testing different approaches to the description of bibliographic entities by association with concepts” while the content of the paper mainly refers to the engineering process of preparing metadata about bibliographic entries, and the text lacks indeed details about processes. - The quality of the writing should be improved: (i) The author should revise their text with an English native speaker. (ii) In some parts, the text does not look rigorous enough for a scientific publication (e.g. when the author lists the achievements based on SemBib, when the author mentions “the idea is to use external graphs”, in sentences like “we can try different strategy to associate..” and “Important progress remains to be made to improve the exploitation of semantic graphs. their adoption, especially on the Web, is still very limited compared to the potential of these representations, especially driven by thematic operations- music, events ... - carried by major search engines (cf schema.org).”) (iii). The text contains explanations about Semantic Web standards or principles that are unnecessary for a Semantic Web audience and consume valuable space. **Suggestions** ** I recommend the author to translate the explanations offered in their Web site (http://givingsense.eu/) from French to English, to provide the information in both languages. ** I recommend the author to link the resources in this data set to other external data sets. For example, the author could have also already linked the instances via owl:sameAs links to resources in other existing bibliographic data sets such as DBLP. If this (SemBib) data set contains keywords that other data sets do not have, then this new data set would be enriching the description of the article. ** I went to the SPARQL endpoint http://givingsense.eu/sembib/sparql/ ** I executed the following query to look up the description of a ResearchPaper: describe <http://givingsense.eu/sembib/onto/tpt/biblio/14420> <rdf:Description rdf:about="http://givingsense.eu/sembib/onto/tpt/biblio/14420"> <rdf:type rdf:resource="http://purl.org/spar/fabio/ResearchPaper"/> <ns0:publicationDate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2013</ns0:publicationDate> <ns1:firstAuthor rdf:resource="http://givingsense.eu/sembib/onto/persons/Bhasin_S_"/> <ns1:publicationMonth>jan</ns1:publicationMonth> <ns1:state>published</ns1:state> <ns2:audience>2</ns2:audience> <ns2:category>grandpublic</ns2:category> <ns2:entrytype>article</ns2:entrytype> <ns2:fromDpt rdf:resource="http://givingsense.eu/sembib/onto/tpt/COMELEC"/> <ns2:fromGroup rdf:resource="http://givingsense.eu/sembib/onto/tpt/SEN"/> <ns2:hasTptId rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">14420</ns2:hasTptId> <ns2:ref>SB:theorymasking-13</ns2:ref> <ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Carlet_C_"/> <ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Bhasin_S_"/> <ns3:creator rdf:resource="http://givingsense.eu/sembib/onto/persons/Guilley_S_"/> <ns3:language>en</ns3:language> <ns3:title>Theory of masking with codewords in hardware: low-weight $d$th-order correlation-immune Boolean functions</ns3:title> <ns4:url>http://eprint.iacr.org/2013/303</ns4:url> </rdf:Description> ** And I saw that the description of the paper does not link to any other external representation of the paper ** While DBLP (http://dblp.l3s.de/d2r/snorql/) contains the representation of that paper ( http://dblp.l3s.de/d2r/snorql/?describe=http%3A%2F%2Fdblp.l3s.de%2Fd2r%2Fresource%2Fpublications%2Fjournals%2Fiacr%2FBhasinCG13) as well as the representation of the main author (http://dblp.l3s.de/d2r/resource/author/Shivam_Bhasin) - I encourage the author to identify the intersection and the differences that this data set has with other bibliographic data sets, in terms of concepts, resources (people, publications, topics), and facts (i.e. statements). I would also recommend to try to enrich the data set with other information, not necessarily from the bibliographic domain. **Other things** - Why is the property “entrytype” used as a datatype property in the description of a ResearchPaper, if this way it helps little to classify the resource? Moreover, the description already contains an rdf:type statement. What value does “entrytype” add to it? In the resource description shown above http://givingsense.eu/sembib/onto/tpt/biblio/14420”> there are two statements as follows: <rdf:type rdf:resource="http://purl.org/spar/fabio/ResearchPaper"/> <ns2:entrytype>article</ns2:entrytype> - Section 2.3 mentions “Our hypothesis is that advances in the semantic web are able to give us new ways to efficiently exploit the data we collect, in order to provide research and analysis functions. “ That very much depends on the specific definition of the objective, which is not clearly specified. The hypothesis is not tested in any way in the paper. - In page 3 the submission says “Large warehouses of bibliographic data exist elsewhere. Unfortunately, they give a very truncated view of our production, in particular because they are not able to resolve the changes in the name of our institution and their usual variants. Moreover these bases do not have information on internal structures of research: projects, departments and groups of research …” . Has the author considered that usually information about projects, institutions etc. are present in separate data sets? That is why it is so important to link the data to other data sets (for the sake of information coverage, and data maintenance). - The author says that publications were crawled but the text does not discuss matters such as the copyright of publications. How was this issue handled? "Publication venue" sounds better than the term used ("publication channel”), when one refers to conferences and journals. - When one tries to execute the query in pages 8-9, if one includes the PREFIX statements the endpoint gives an ARC2 error, and without the PREFIX statements, as the author indicated the query gives a “this site cannot be reached”. See also https://goo.gl/VT5iHt - The plot shown in Figure 1 does not provide a very useful insight to the reader. The author should think of communicating the information differently. Perhaps the authors could show a table, including information about the shared keywords, clustering the information by conference or topic, and give statistics based on the number of keywords shared. ** After rebuttal ** Thank you for replying to our questions and comments. I would like to add that the comment about "ignoring other publication data sets" didn't refer to the fact of citing them in the text, but rather to the process of integrating the data. Finding different ways to reuse existing technology is a valid approach to explore a field and come up with new solutions to new problems. However, I still think that this work requires more novel components and further elaboration to be accepted.
Review 3 (by Jodi Schneider)
Thanks for your many comments and especially for trying to make your materials more accessible to non-French speakers! ---- This paper describes the process of producing a semantic representation of publications in the Telecom Paristech full-text repository. While semantic metadata for publications has been extensively curated, work on full-text of papers is novel. Even as an expert in this area I find the paper useful in going further than the published literature on details of corpus ennrichment. The fit with the resource track is not completely clear; I would think of it as an industry/in-use application. It would be even better to reframe the text as a tutorial with select data released (e.g. presumably the non-copyrighted data). Some suggestions: - avoid dates in the form 1/12/2017 in international publications (whether this is Jan 12 or Dec 1 will depend on the location of the reader. YYYY-MM-DD (ISO 8601) is always safe. (It appears from the end "In not differently specified, all links were last followed on January 12, 2018." -- which should be "if not..." -- that this is actually Jan 12 2018) - remove space before footnotes so that they follow the letters (not follow the space) - carefully proofread (e.g. "citet Larsen: 2010: Scientometrics: 20700371"), ideally have a native English speaker proofread (e.g. in English there is no space between a number and the % mark). - check references for consistency and usefulness. For instance, I don't know what ToTh is. Technical reports should ideally have URLs given to make them easier to find. - Figure 1 needs improvement. It's not clear what the colors mean. Do you have any comment to make on the disconnectedness of keywords?
Review 4 (by anonymous reviewer)
The paper presents a collection of scientific articles with semantic representation. It follows a graph-based approach for capturing the metadata. The overall goal would be to enable to connection with further distributed repositories of scientific publications. At the currents state of the repository, I am not sure that it is ready to be released to the community. As the authors point out, there are already quite a few large and established repositories for academic articles. A new one would have to have very definitive and clear advantages. The main concerns are: 1) why not use a standardised format for the meta data? This would also include linking to existing graphs of authors for example (or reusing existing ontologies) 2) The first paragraph of 2.5 does not really clearly state what the state of the repository is? Is it an anvil pilot with only a few articles, are there any further repositories likes, is it in a final and stable version? Overall, the work is interesting but the potential impact for the community is not absolutely clear.