Semi-Automatic Semantic Annotation of Videos in Ballet for Meta-data Tagging, Analysis and Retrieval
Author(s): Swati Dewan, Shubham Agarwal, Navjyoti Singh
Full text: submitted version
Abstract: Over the last decade, the volume of user-generated content on the web has skyrocketed. The Semantic Web however, hasn’t grown at the same rate especially when considering videos. This is primarily due to high costs associated with time and resources to manually annotate such data. Here, we leverage the advancements in Machine Learning to reduce these costs by building a faster multimedia annotation system for videos for the Semantic Web. We propose a semi-automatic annotation model which automatically generates semantic annotations over a big dataset of videos using only a small number of manually annotated clips per semantic category. We provide a new semantically annotated dataset on ballet and test our model on it. High-level concepts such as ballet pose and steps are used to make the semantic library. These also act as descriptive meta-tags for any ballet video making the videos retrievable using a semantic or video query.
Keywords: Automatic Annotation; Semantic Annotation; Ballet; Laban Movement Analysis; Meta-data Tagging; Machine Learning; Retrieval; Neural Networks
Review 1 (by Valerio Basile)
(RELEVANCE TO ESWC) More or less nothing of this paper relates to the Semantic Web, or the Web in general. This work involves machine learning for semantic annotation of video content. The original videos could be taken from the Web, and the final resource might become a SW dataset, but no detail is given in this regard. (NOVELTY OF THE PROPOSED SOLUTION) As far as the annotation of ballet videos goes, this approach is novel, as explicitly pointed out in 5.2. As a general framework for semantic annotation, there is not much novelty, as the procedure is quite straightforward: gather videos, annotate them at the frame level, use a LTSM to predict the labels, which makes sense because the data is sequential in nature. The statistical model itself is very much studied and popular for a wide variety of tasks, e.g. in NLP. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The approach is very simple, thus it seems to be correct in all its parts. The only bit of evaluation provided is the accuracy score of the LSTM on a small test set. No further analysis is provided, nor alternative methods. (EVALUATION OF THE STATE-OF-THE-ART) No evaluation of state of art nor comparison with other systems is provided. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The theoretical backbone underlying this work is quite well explained. However when it comes to the computational contribution, almost no discussion is provided as to why this particular model should work, what are its strengths and drawbacks, what are the lower and upper bounds for this task, and so on. This is a supervised approach, so it relies on manual annotation of videos, which is hard and time consuming. Despite the seemingly high figure of 94% accuracy, it is not clear that the problem is solved, that the approach generalizes to other types of videos (or dance types, even), and therefore that we can produce high quality semantic annotation and skip the manual process entirely. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experiment should be easy to reproduce if the code and specifications of the implementation details are given, along with the annotated data. The only experiment presented in the paper seems very limited in scope, and the results do not give any information to the applicability of this method to other types of data. (OVERALL SCORE) This paper is a preliminary work of machine learning applied to a very interesting use case, with a sound theoretical backbone. I suggest the authors to perform more experimental evaluation, especially with respect to the generalization power of the approach, and perhaps consider a more fit venue than the Semantic Web conference.
Review 2 (by Victor de Boer)
(RELEVANCE TO ESWC) the relevance to the ESWC community is not made clear. The annotation task includes the LMA framework, which makes a lot of sense for this task, but does not port this to or link this to ideas about Knowledge Representation, (Semantic) Web technologies, etc. The most relevant related work here (from el Raheb et al) is mentioned in the related work section, but it is unclear why their ontologies were not used or what the benefits of using the LMA model are. (NOVELTY OF THE PROPOSED SOLUTION) The method itself, given the specific task seems novel and builds on state-of-the-art (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The method seems correct and complete. But the algorithms cannot be checked in detail. (EVALUATION OF THE STATE-OF-THE-ART) The Ml state of the art is well-defined. However, the paper provides only a basic discussion on related work in dance video annotation and dance ontologies. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The method is well-described. Limitations and in-depth discussion on the properties is missing (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) not reproducible (OVERALL SCORE) This paper presents a semi-automatic method for annotating ballet videos using a pre-defined set of semantic annotations. The method is built on state-of-the-art machine learning technologies and uses posture detection and end-to-end learning to match video frames to postures and movements. These postures and movements are based on the Laban Movement Analysis framework. The authors present the outline of their methid and test the method on a human-annotated set of test videos, where the claimed accuracy is an AUC of 95% The strong points: - A very interesting and challenging problem is tackled using state-of-the-art Machine Learning - The paper makes a technical contribution in a field where not much has been done - The paper is well-written and the steps and decisions are easy to follow The weak points: - The authors describe an evaluation of the automatic method, but this evaluation is very briefly described (only the AUC is mentioned). The evaluation setup is not clear, nor is the evaluation data shared. The datasets used to train and evaluate the model are also not shared at this moment. The authors promise to do this, but at this point, the lack of resources makes reproducibility and transparency of the experimental study quite weak. - A similar problem is with the method itself. The method is described in the paper at a higher level, but the algorithms are not shared. here to, this hinders reproducibility and transparency. - Finally, the relevance to the ESWC community is not made clear. The annotation task includes the LMA framework, which makes a lot of sense for this task, but does not port this to or link this to ideas about Knowledge Representation, (Semantic) Web technologies, etc. The most relevant related work here (from el Raheb et al) is mentioned in the related work section, but it is unclear why their ontologies were not used or what the benefits of using the LMA model are. Based on this, I would not argue for accepting the paper for the ESWC conference, even though I find it a very interesting and well-written piece of text. I think, with improvements in the reproducibility and transparency, this will be an interesting contribution to the field of media analysis, computer vision or digital humanities. Minor issues and typos: p1: "In this paper we propose, a semi-automatic annotation system which automatically generates semantic annotations for ballet videos" -> In this paper we propose a (without comma), also this sentence is a bit confusing as it talks about semi- and automatic at the same time. This point is later elaborated but is confusing here. Fig1: it seems that some numbers got lost in the layers (2x..) p9: "We use a total of 23 semantic annotations to divide the dataset. Of these 22 semantic events" -> is it 23 or 22? Later in the paragraph there is a confusion between 14 and 15 dynamic events. Section 5 The related work is rather limited. There is some work done on dance annotation and retrieval, for example   Ramadoss, B., & Rajkumar, K. (2007). Semi-automated annotation and retrieval of dance media objects. Cybernetics and Systems: An International Journal, 38(4), 349-379
Review 3 (by Judie Attard)
(RELEVANCE TO ESWC) The authors exploit machine learning technologies to create a semi-automatic annotation model for video. (NOVELTY OF THE PROPOSED SOLUTION) Whilst the application on ballet videos might be novel, the semi-automatic annotation on videos is not. Unfortunately the authors fail to provide any related work in order to compare and indicate how their approach improved upon existing state of the art. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors do not provide details about the evaluation process and simply state that they "obtain an AUC accuracy of 94.56% on the dataset". (EVALUATION OF THE STATE-OF-THE-ART) The authors fail to provide an appropriate state of the art/related work section. The authors mention that there is substantial literature on enhancing manual annotation processes, and on ontological tools that automatically build ontologies from manually annotated data. However, the authors state that they "haven't seen much work done on generic automatic semantic annotation frameworks". A simple google search proves that at least there is some obviously relevant related work that the authors could have exploited in this publication. Particularly it is worth noting that there is also literature on the semantic annotation of videos. Moreover, the authors re-use work from other authors for the ballet ontologies and Labanotation, however they fail to provide any relevant literature on the topics. Discussion on the approaches and methods used for the semi-automated tagging are also missing. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors fail to describe their approach in substantial detail. They do not provide any clarification on why they used the selected tools/techniques as opposed to others. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The proposed solution can be adapted to be used in similar dancing videos by adapting the used vocabulary. However, the authors fail to provide details about their experiments, which impacts the reproducibility of this research. (OVERALL SCORE) In this paper the authors propose a semi-automatic annotation model that generates annotations on videos. This semi-automatic process requires the use of a small number of manually annotated clips. The authors generate a vocabulary of ballet (and generic movements) poses. They extract human motion from frames in the videos by using a machine learning approach. This data is then used to recognise and classify semantic events (ballet movements) in a video. The problem tackled in this paper is the semi-automatic annotation of videos, and the authors create a semantic vocabulary of ballet terms, and exploit it to semi-automatically extract ballet movements from videos. Strong points: - relevant topic Weak points: - no related work discussion (and therefore no indication on the improvement on state of the art) - no description of the evaluation process - no description or clarification as to why the authors used the specific tools/techniques/approach.
Review 4 (by Ralph Ewerth)
(RELEVANCE TO ESWC) The task of automatically generating annotations for motion data in video in order to generate a semantically rich database is related to the conference tracks information retrieval and machine learning. (NOVELTY OF THE PROPOSED SOLUTION) The authors apply a fine-tuned state-of-the-art method for skeleton extraction that generates a 3D model of the human body from two-dimensional video data. Next, they derive a descriptive feature extraction method to encode ballet movements from an existing methodology called „Laban Movement Analysis“ (LMA). To the best of our knowledge, this is (at least) novel for the target domain of ballet videos. (CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) Section 3.2 describes the aforementioned adaptation. It lacks some details which will make it presumably difficult to reproduce this method, for example „we also calculate orientation of certain bodyparts“. An in-depth example that illustrates the generation the feature vectors from 3D skeletons that were extracted would be helpful. (EVALUATION OF THE STATE-OF-THE-ART) The related work section of this paper is not sufficiently comprehensive. While it is understandable that the niche of ballet movement detection is a sparse field of research, a general overview over the field of human motion detection and applications of the LMA should have been provided. (DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) As mentioned below, the demonstration of the proposed framework could benefit from a more detailed explanation and reasoning why certain model choices have been made. (REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) This part is unclear and should be described in a dedicated section. A manual labelling process is conducted for 2-4 videos per category and they are used for training. It remains unclear how the unlabelled remaining dataset that was not manually annotated can be used for testing. (OVERALL SCORE) A semi-automatic annotation system is shown which automatically generates semantic annotations for a big dataset of videos using only a small number of manually annotated clips per semantic category. For this purpose, high-level concepts are utilized such as different ballet poses, ballet steps and static as well as dynamic visual events. The goal is to provide a semantically searchable database of these videos that will be publicly released domain. This database can be queried via text or video snippets. Strong points: + LMA adaptation + Goal to make an open access database + Fine-tuned skeleton extraction model Weak points: - Experimental results, no sufficient size of test set - Related work section - Completeness of framework description The presented work is interesting, but the experimental results do not sufficiently support the advantages of the proposed approach. A dedicated related work would be helpful to make the contribution of the paper more clear, the contribution and novelty should be emphasized in the introduction. Finally, a reasoning regarding the selected parameters is needed (e.g., the „threshold“ of the annotation generator). Some minor comments: - Section 4: Numbers in second paragraph are not coherent (23 vs 22 semantic events, 14 vs 15 dynamic events) - Section 4: „keep only those that fall above a certain threshold“: which threshold, how is it determined? - Section 4: „punk bars“ instead of pink bars - Fig. 3: The last sentence of the caption should be extended, for instance, describing which two semantic events are meant. - Conclusion: „crossoads“
Metareview by Andreas Hotho
The paper presents a deep learning approach to automatically annotate ballet videos. The work is only weakly related to the scope of the conference as only the outcome could be a semantic web dataset. The approach is a standard machine learning method and the overall evaluation is not well stated. There was no author response to clarify questions. Given the clear statement of the reviewer, we suggest to reject this work.