TweetsKB- A Public and Large-Scale RDF Corpus of Annotated Tweets
Author(s): Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, Stefan Dietze
Full text: submitted version
Abstract: Publicly available social media archives facilitate research in a variety of fields, such as data science, sociology or the digital humanities, where Twitter has emerged as one of the most prominent sources. However, obtaining, archiving and annotating large amounts of tweets is costly. In this paper, we describe TweetsKB, a publicly available corpus of currently more than 1.5 billion tweets, spanning almost 5 years (Jan’13-Nov’17). Metadata information about the tweets as well as extracted entities, hashtags, user mentions and sentiment information are exposed using established RDF/S vocabularies. Next to a description of the extraction and annotation process, we present use cases to illustrate scenarios for entity-centric information exploration, data integration and knowledge discovery facilitated by TweetsKB.
Keywords: Twitter; RDF; Entity Linking; Sentiment Analysis; Social Media Archives
Review 1 (by Stefano Faralli)
The authors presented a paper to describe TweetsKB a very interesting dataset of anonymized tweets. Tweets are collected as a set of N3 files containing meta information such as: entities and sentiment. The resource is well described and the authors made an effort in order to cover all the necessary requirements of the resource track. Evaluation is also made on the metadata automatically extracted. The evaluation is performed thanks to well known evaluation benchmarks. Since the dataset consists of a very high number of annotated tweets I consider this resource of high interest to the community which is always searching for large datasets of such sensitive nature. I personally consider this paper as a very good example of how a "resource" paper should be written. I think the authors adequately argued the reviewer's questions.
Review 2 (by anonymous reviewer)
The paper describes the creation, content, application and maintenance of TweetsKB resource, a large corpus of annotated tweets. In general, availability of an annotated tweet corpus is useful to the community because it saves crawling and preprocessing effort. However, the details of what the annotations are is also critical. For instance, extracting what is present explicitly in the tweet is straight forward (e.g., hashtags, user mentions), while annotations that are synthesized and of wider applicability are more significant (e.g., sentiments, network-oriented features). And in most situations, the tweet corpus will be analyzed for new insights and applications beyond the provided annotations, including after retrieving the textual content of the tweets. The paper argues for a number of applications that a tweet corpus can be put to and shows a few examples such as temporal evolution of entity, popularity, and attitude, thus demonstrating its current utility. The approach also uses best practice by reusing well-known ontologies for annotation. It discusses relevant issues of evolving the tweet corpus by growing it in size and enhancing its expressiveness. While it is always nice to have richer set of annotations, it can be incrementally added. Overall, the paper is well-written. Minor point to be clarified: ------------------------------- In Section 3, the paper refers to “metadata” as a separate type. It is unclear why is this separate from the other five types listed, which are also metadata. ======= I have read the rebuttal which has been responsive.
Review 3 (by Krzysztof Janowicz)
The resource paper entitled 'TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets' describes a large corpus of tweets collected over five years together with annotations such as sentiments and extracted entities, e.g., people mentioned in these tweets. While there are some minor language issues, the paper is overall well-written and readable. Nonetheless, the focus should be adjusted in a revision. For instance, figure 1 seems unnecessary, while, on the other hand, a lot of relevant information is missing. To give just one example, I would have hoped to learn more about the types of entities being extracted. For instance, all SPARQL examples are about politicians. There is one more example (provided as a figure) about a tennis player. Ideally, the authors would also give other examples that involve objects, events, places, and so forth, if present. Just to avoid confusion, I do realize that the examples have a place in them (Greece and Germany), but these places are not part of the annotated data but the query formulation. A second criticism that one could bring to the table is the selection of annotations. While the authors give examples that include sentiment analysis, it seems like an arbitrary choice nonetheless. In general, one would assume that a TwitterKB would contain the tweets with their metadata and the analysis would be up to the domain scientists. What makes sentiment analysis so different from, say, providing location estimates? Both are readily available and are supported by tools. I assume that the authors may have an interest in sentiment analysis, but this part could have been better justified. Similarly, and while the authors have provided guidance in a footnote, the retweet count cannot be used as is but will always require going back to the data. Finally, I am not sure why the authors host 5% of the data via a SPARQL doing and not the entire dataset. While it is certainly very large, Linked Data that is only available as data dump is not very usable. At the same time, one could argue that RDF-based Linked Data is a very wasteful encoding for such data as is displayed by the fact that 1.5 billion tweets result in 48 billion triples. Ideally, the authors could have made a stronger case here. On a side note, I am unsure how long Twitter will tolerate such a public dataset that violates their usage agreements. Summing up, this is a well-written resource paper with explicitly stated maintenance and update policies and with interesting examples that illustrate its usefulness. Whether all these data should really be encoded using Linked Data and whether the sentiment analysis should be part of it, is another question. I would suggest focussing a bit more on providing interesting statistics about the data, e.g., distribution of hashtags and so on. *Update after the rebuttal*: I would still argue that a Linked Dataset should be served via a query endpoint and not just a data dump. After all, this is a core part of the Linked Data vision. Also, a large part of the argument for such dataset in the first place is that it is difficult for individuals to collect and enrich the data on their own. This will also be true for hosting the endpoint, especially given the size of the data. The rest of the rebuttal is good, and the paper seems ready for acceptance.
Review 4 (by Anisa Rula)
The resource paper presents the representation of twitters in RDF. The twitter were gathered from 2013 to 2017 and with all the pre-processing the twitters are about 1.5 billion which were transformed in RDF in about 48 billion triples. Positive aspects *There exist already similar datasets, but, this is the first large-scale dataset made publicly available. *This work advances the state of the art by providing an approach that is able to process efficiently the extraction, storing and calculating of identity statements and their closure. *The state of the art is comprehensive enough. *This dataset is of interest to the Semantic Web community since it can be used in different scenarios as mentioned by the authors. *A subset of the dataset is available on the Web that can be queried and the whole dataset is available to be downloaded *There is a discussion about the maintenance of the dataset which states that there will be updates every 3 months. *The repository is available on github *The dataset can be easily reused since there are good documentation about it *The dataset is designed following linked data best practices. The authors perform an appropriate re-use of ontologies Negative aspects *A negative aspect can be considered in the related work section where at the end of the first paragraph there is no real comparison with the mentioned work. So the question is, how do you distinguish from the other works? Also the work Twarql is very interesting since it translates twitter at real-time. It is necessary to compare their approach against yours. If it can be supported by ingestion tools of big data, maybe it can be better for the maintenance of the dataset which will be updated almost on real-time. *Although you state that the dataset is too big, for the moment you do not provide a service that can query the whole dataset. As a consequence it may bring to the problem of non reusability of the dataset. This problem can be resolved by converting triples to an HDT file. *you never talk about the execution time it takes to transform 1.5 billion twitters in 4.8 billion triples ======================== The rebuttal has answered almost to all my concerns except one on the sparql endpoint which is not very convincing. For this reason I will downgrade my score and if the paper get accepted I would expect to see the sparql enpoint of the whole dataset.