Paper 60 (Research track)

Soccer Event Detection

Author(s): Abdullah Khan, Beatrice Lazzerini, Gaetano Calabrese, Luciano Serafini

Full text: submitted version

Abstract: The research community is interested in developing automatic system for the detection of events in video. This is particularly important in the field of sports data analytics. This paper presents an approach for identifying major complex events in soccer videos, starting form object detection and spatial relations betweem objects. The proposed framework, firstly, detects objects from each single video frame providing a set of candidate objects with associated confidence scores. The event detection system, then detects events by means of rules which are based on temporal and logical combinations of the detected objects and their relative distances. The effectiveness of the framework is preliminarily demonstrated over different events like “Ball possession” and “Kicking the ball”

Keywords: Event detection in video.; Simple events.; Complex events.

Decision: reject

Review 1 (by Petar Ristoski)

(RELEVANCE TO ESWC) The topic of the paper doesn't exactly match the topic of the conference, at least not in the current presentation. The authors do not utilize any of the Semantic Web technologies, ontolgoies nor standards. The paper might be a better fit in some of the computer vision conferences.
(NOVELTY OF THE PROPOSED SOLUTION) The presented approach is trivial, and the authors fail to clarify what are the contributions and the novelty of the proposed approach. The authors don't provide related work section, which makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. Couple of related approaches are mentioned in the introduction of the paper, however the description is very vague and the reader cannot identify the novelty of the proposed approach over the related work.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed approach is very limited. First, the approach is only limited to soccer, without even discussing possible generalization to other sports. Second, the approach is able to identify only 2 events in soccer, using manually defined rules, which include several thresholds that are not further explained. As such, I cannot identify the novelty of such an approach.
(EVALUATION OF THE STATE-OF-THE-ART) The authors don't provide related work section, which makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. Couple of related approaches are mentioned in the introduction of the paper, however the description is very vague and the reader cannot identify the novelty of the proposed approach over the related work.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors fail to give a clear demonstration and give a clear motivation of the approach.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation of the approach cannot be considered as valid. First, the generation of the evaluation dataset is not clear. Second, the number of events is very small to be able to deduct any conclusions, i.e., the dataset can be considered as a toy example. In my opinion, annotating soccer videos should be a relatively easy and fast task, thus the events in the dataset can be easily increased. The authors don't discuss such issues at all. Third, the approach is not compared to any baseline approaches nor related work.
(OVERALL SCORE) The authors propose an approach for identifying events in soccer videos. First, the authors formalize the definition of events in soccer, then propose a pipeline for identifying such events in soccer videos. The approach uses existing object detection tool (Single Shot Multi-Box Detector) to identify objects in soccer videos, which are later analyzed with user-defined rules in order to identify an event.
SP:
1. Possibly an interesting research direction.
WP:
1. The topic of the paper doesn't exactly match the topic of the conference.
2. The authors don't provide related work section.
3. The proposed approach is very limited, in regards of the events types and the events domain.
4. The evaluation of the approach cannot be considered as valid.
Detailed Review:
The authors propose an approach for identifying events in soccer videos. First, the authors formalize the definition of events in soccer, then propose a pipeline for identifying such events in soccer videos. The approach uses existing object detection tool (Single Shot Multi-Box Detector) to identify objects in soccer videos, which are later analyzed with user-defined rules in order to identify an event.
Detecting events in videos is an interesting and important area of research, however it is difficult to identify the contributions and the novelty of the proposed approach in the paper.
The paper has several major drawbacks:
- The topic of the paper doesn't exactly match the topic of the conference, at least not in the current presentation. The authors do not utilize any of the Semantic Web technologies, ontolgoies nor standards. The paper might be a better fit in some of the computer vision conferences.
- The authors don't provide related work section, which makes it difficult to position the paper compared to the existing related work, and identify the contributions and the novelty of the proposed approach. Couple of related approaches are mentioned in the introduction of the paper, however the description is very vague and the reader cannot identify the novelty of the proposed approach over the related work.
- The proposed approach is very limited. First, the approach is only limited to soccer, without even discussing possible generalization to other sports. Second, the approach is able to identify only 2 events in soccer, using manually defined rules, which include several thresholds that are not further explained. As such, I cannot identify the novelty of such an approach.
- The evaluation of the approach cannot be considered as valid. First, the generation of the evaluation dataset is not clear. Second, the number of events is very small to be able to deduct any conclusions, i.e., the dataset can be considered as a toy example. In my opinion, annotating soccer videos should be a relatively easy and fast task, thus the events in the dataset can be easily increased. The authors don't discuss such issues at all. Third, the approach is not compared to any baseline approaches nor related work.
-------------------------------------------------------------------------------------------
After the authors' response:
Thanks to the authors for the response. 
I still think this is a preliminary study, and in the current shape it might be a better fit for a workshop paper or a poster.


Review 2 (by Jeremy Debattista)

(RELEVANCE TO ESWC) This work was submitted for the Semantic Data Management and Big Data. Whilst video data can be considered as considerable data, there is a little element of semantics in this work and no data management.
(NOVELTY OF THE PROPOSED SOLUTION) Whilst this preliminary approach sounds interesting, as a football (soccer) fan myself, I feel that such event detection is already done at a large scale (see for example VAR - i think in Rugby is a bit better than in Football). Furthermore, large data statistics companies (such as Opta) have systems in place to get such statistics intelligently, however, these companies would not normally divulge exactly how this is done. Nonetheless, the authors did not provide a research question for the task at hand, and did not provide adequate related work in this area.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I found a number of flaws in both simple and complex event definitions. These should be described better Furthermore, given the nature of a football match, one cannot rule out unexpected events and or combination of events that rarely happen in a match. Therefore, in such a preliminary work, I would have expected the authors to formally proof (null and alternate cases) the notation in order to show that these notations would work for any event that might occur. Some comments with regard to the notation presented:
Simple events:
1) The example in section 2.1 is not consistent with definition of SE. The ID is not mentioned in (1) and time t is also not included in one. 
2) The simple event type notation seems to be heterogenous. I suspect that this should have been a tuple SE = <seType , <Role>>, where Role = <role_x, oType_x>  1 < x <= n
3) Considering the original definition and description of SE, I am not sure why the authors need the roles and type when they can be inferred from the seType, e.g taking "Throwing the ball", semantically a machine should be able to infer that the doer is a player/ball boy and the "ball" is the object being thrown.
Complex Events:
1) is Type the same as seType?
2) I am having trouble to understand why you need both logical and temporal events. As far as I know, there are no events in a football match that would require the "OR" operator. I see a football match a a sequence of events. Therefore, I would recommend that LCE is removed as it is unnecessary.
3) The authors state that a complex event can be simple events or other events. This is not demonstrated well in the notation and actually the current notation does not cater for this supposition.
(EVALUATION OF THE STATE-OF-THE-ART) The authors did not provide adequate state of the art description, and the few articles mentioned were not compared theoretically or evaluated with the proposed solution.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The authors give two examples of the simple event which are also up for discussion. The authors failed to mention what the threshold value is and why it is important. Would the threshold be the same for all players? Would that hold for all players? Consider myself as one example, and Messi as another.
With regard to the first example (ball possession), my first impression was that the authors should have considered ball possession as a team. I also question the use of time t in order to calculate the distance D. The event, as described here shows that not all "ball possession" techniques were considered. For example, what if a player prolongate the ball to dribble a defender using his pace? It is still ball possession, it is not kicking the ball, but this event would fail. For the second example, I suggest that the authors take into consideration the speed of the ball. For example, in the pace dribbling I mentioned before, the ball will travel with less pace, but when I am kicking/shooting the ball, the pace and power on the ball are different!
The proposed architecture was adequately explained, however, I miss a lot of details on how to actually do it. Therefore, an algorithm to show how rules are triggered would be very helpful in such a paper.
In the evaluation section, I have no idea how the authors decided on the magic number 5. Was it arbitrary? If so, did the authors try different values to analyse the results? Furthermore, the complex event "Pass the ball" seems to be wrong according to the complex event definition.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) No algorithm (or an online code repository link) and test data were provided.
(OVERALL SCORE) The authors describe their preliminary work on event detection in soccer matches using videos and images. The paper, as highlighted by the authors as well is still in a preliminary phase. The lack of a research question definition or research problem, together with the lack of related work, makes it difficult to highlight the main contribution, apart from a couple of notations that are not correctly defined. I don't feel there are any strong points of this work, whilst the weak points were described in detail in the evaluation point.
-----
After the authors' rebuttal:
Thanks to the authors for the response. I suggest that this preliminary work is submitted for the poster and demo session or even better to a relevant workshop. My scores remain unchanged


Review 3 (by Mario Cannataro)

(RELEVANCE TO ESWC) The paper is relevant to the topics of the conference. It proposes a novel rule-based soccer events detection method. Moreover, as pointed out by the authors, the statistical analysis of soccer game is a new interesting topic.
(NOVELTY OF THE PROPOSED SOLUTION) The novelty of the paper seems to be:
- the workflow of the proposed methodology based on the implementation of SSD-based object detector and rule-based event detector;
-identification of low-level complex events in soccer video.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors describe the entire workflow in detail and formally, from the definition of events to the description of events detector.
(EVALUATION OF THE STATE-OF-THE-ART) The authors have analyzed the context. They have reported and discussed some previous related works in the "Introduction" section. Reference 1 is not clear. I suggest to improve this section, by adding new recent references about the use of semantic-based methodologies in the soccer event detection. Specifically, the readability of the paper could be improved by introducing a specific section relative to the state of art. Moreover, it could be useful to compare the proposed methodology to other approaches proposed in literature, in terms of accuracy, precision, computational costs.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The demonstration and discussion of the properties of the proposed approach is quite good.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) I suggest to perform more experimental tests on different videos. It would be interesting if the authors will evaluate other performance measures of the proposed approach, not only limited to the accuracy and the precision. Specifically, the paper lacks of any detail relative to the computational implementation and performances.
(OVERALL SCORE) The paper "Soccer Event Detection" proposes a new approach for identifying major complex events, such as "Ball possession" and "Kicking the ball" in soccer videos. 
The proposed methodology is based on the use of Single Shot Multi Box Detector (SSD) for objects detection in each frame. SSD object detector provides with objects expressed in terms of bounding boxes with a given confidence score. Then, a filtering step is applied to select objects associated with a confidence score higher than a specific threshold. Finally, by examining the distance between the bounding boxes of selected objects and using logical and temporal operators, events are detected. The demonstration and discussion of the properties of the proposed approach is quite good. 
The authors have tested the proposed methodology on 5 minutes short video consisting of approximately of 7.5k frames. The reported accuracies for "Ball possession" is 92% and for "Kicking the ball" 84%.
The paper is very interesting and well organized. The authors explain in detail the workflow of the proposed novel methodology based on SSD-based object detector and, then, a rule-based event detector. 
Some improvements of the paper could be:
- a more detailed section relative to the state of art;
- a major number of experimental tests;
- a comparison of the proposed approach to other methodologies proposed in the literature.  
I suggest to perform more experimental tests on different videos. It would be interesting if the authors will evaluate other performance measures of the proposed approach, not only limited to the accuracy and the precision. Specifically, the paper lacks of any detail relative to the computational implementation and performances.


Review 4 (by Ralph Ewerth)

(RELEVANCE TO ESWC) In this paper, an approach for recognizing soccer events using spatial relations between objects is suggested. Objects are detected with a fine-tuned SSD object detector. Several simple and complex events are defined using a tuple representation. Spatial distances of the detected bounding boxes of players and the ball over several frames are measured in order to detect two events: “Ball Possession” and “Kicking the ball”. Detecting events using a tuple representation is a promising idea and makes it relevant to ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The novelty of the presented approach is incremental, since it simply uses distances between objects extracted with the popular SSD visual object detector in a tuple representation to find simple events (ball possession, kicking the ball) in soccer games.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) It is not clear on which data the SSD object detector is fine-tuned. In case image data of the same soccer match (as used for testing) would have been used, the applicability to other soccer matches is reduced. There are no explanations how the thresholds for filtering as well as for detecting both events are selected. Since the thresholds are decisive for the quality of the results, they should have been properly evaluated.
(EVALUATION OF THE STATE-OF-THE-ART) There are many proposals for event detection in football/soccer games or sport events in general, which should be addressed in the related work section. Furthermore, the advantages using a tuple representation over consecutive frames compared to, e.g., deep learning approaches using temporal information like 3d-ConvNets (Tran et al.: Learning spatiotemporal features with 3d convolutional networks. ICCV 2015) or RNNs (Feichtenhofer et al.: Spatiotemporal residual networks for video action recognition. NIPS 2016) for the recognition of actions should be explained in detail and evaluated using an appropriate test dataset. There are also some CNN-based approaches for soccer event detection: 
- Jiang et al.: Automatic soccer video event detection based on a deep neural network combined cnn and rnn. ICTAI 2016
- Liu et al.: Soccer Video Event Detection Using 3D Convolutional Networks and Shot Boundary Detection via Deep Feature Distance. ICONIP 2017
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) See "Correctness and Completeness of the Proposed Solution".
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The experimental setup consisting of 7,500 consecutive frames of a single video with only 14 and 19 ground-truth events is not sufficient to evaluate the proposed approach. The approach is not able to recognize "Ball possession" and "Kicking the ball" events if the players are close to one another and if the ball hits another player within 5 frames, respectively. There is no comparison with other approaches, making it difficult to evaluate the results.
(OVERALL SCORE) Strong Points (SPs):
+ Tuple representation to detect events is a promising idea.
+ The paper is very clearly structured and therefore easy to understand.
+ Very detailed explanation of the events in Section 2.
Although we believe that detecting events using a tuple representation is a promising idea there are some concerns regarding the approach:
Weak Points (WPs):
- Experimental evaluation.
- Fine-tuning of SSD object detector is not clear. 
- No explanations regarding threshold setting for filtering and detecting events 
- Related work using deep learning etc. should be discussed
- Incremental novelty 
Questions to the Authors (QAs):
1.	On which data was the SSD object detector fine-tuned?
2.	How were the thresholds for detecting the events evaluated?
3.	How did you extract the ground-truth labels in the test clip?
--------------------------------
*** After the authors' response:
We thank the authors for their response and additional explanations. However, as stated by the authors, we also believe that the work is preliminary. It would be useful, for instance, if the paper would consider in more detail related approaches based on computer vision and deep learning (two refereneces were exemplarily given above) in the future (and demonstrate in a more comprehensive experimental study in which ways the proposed approach is beneficial). Also, the questions that we had raised in our review should be addressed.
There are no changes in our scores.


Metareview by Maribel Acosta

This work presents a two-fold approach to automatically detect events in soccer videos. The approach relies on generating a set of candidate objects and then identifying events modelled as rules. The authors conducted an empirical evaluation to measure the 
The reviewers indicated that this paper follows an interesting research direction with promising ideas. Nonetheless, the reviewers have identified major issues in this work, which include: limited scope of the proposed solution, lack of discussion of related work in the area of ML, and insufficient experimental evaluation. In addition, some reviewers indicated that the paper -- as it is presented -- does not exactly fit with the ESWC topics as the role of semantics in the approach is rather unclear. 
In summary, the reviewers agreed that this submission is still in a preliminary state for its acceptance. We highly encourage the authors to consider the reviewers' suggestions to advance this work.


Share on

Leave a Reply

Your email address will not be published. Required fields are marked *