Paper 113 (Research track)

MaskLink- Efficient Link Discovery for Spatial Relations via Masking Areas

Author(s): Georgios Santipantakis, Christos Doulkeridis, George Vouros, Akrivi Vlachou

Full text: submitted version

Abstract: In this paper, we study the problem of spatial link discovery (LD), focusing primarily on topological and proximity relations between spatial objects. The problem is timely due to the large number of sources that generate spatial data, including streaming sources (e.g., surveillance of moving objects) but also archival sources (such as static areas of interest). To address the problem of integrating data from such diverse sources, link discovery techniques are required to identify various spatial relations efficiently. Existing approaches typically adopt the filter and refine methodology by exploiting a blocking technique for effective filtering.
In this paper, we present a new spatial LD technique, called MaskLink, that improves the effectiveness of the filtering step. We show that MaskLink outperforms the state-of-the-art algorithm for link discovery of topological relations, while also addressing some of its limitations, such as applicability for streaming data, low memory requirements, and parallelization. Furthermore, we show that MaskLink can be extended and generalized to the case of proximity-based link discovery, which has not been studied before for spatial data.
Our empirical study demonstrates the superiority of MaskLink against the state-of-the-art in the case of topological relations, and its performance gain compared to a baseline technique in the case of proximity-based LD.

Keywords: Link Discovery; topological relations; proximity relations

Decision: reject

Review 1 (by Ziqi Zhang)

(RELEVANCE TO ESWC) Linking spatial objects is an important and interesting topic to the semantic web, but since this paper is submitted to the linked data track, it is not very clear how it contributes to that area (using linked data, or creating linked data?)
(NOVELTY OF THE PROPOSED SOLUTION) A novel approach is introduced addressing the main problem of efficiency in existing linking algorithms. It is also the first study on linking Point-to-Region Proximity-based Relation, as the authors claim.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The proposed method is sound and interesting, it is well describe and easy to follow. But you should better motivate why improving the efficiency of the methods is more of an issue than just some engineering tweaks (e.g., by parallelisation)
(EVALUATION OF THE STATE-OF-THE-ART) Evaluation includes comparison against some well known state of the art methods. Efficiency of these systems are evaluated, but as scientific research you should also discuss how effective your methods are, especially you are claiming your method is the first solution to some problems.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) Results are reasonably discussed. But I would like to know something about the effectiveness of your method.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The method is well described and should be easy to re-implement. It would be good if the authors can share their code and data.
(OVERALL SCORE) **short description**: this paper introduces a novel method that addresses the efficiency of linking spatial objects including areas, points and lines. Compared to state of the art, results show that the proposed method uses less computation.
**strong points**: 
1. paper is well written, easy to follow
2. tackles Point-to-Region Proximity-based Relation lining, which is novel
3. comparison against state of the art
**weak points**:
1. motivation for more efficient algorithms is not very strong
2. no results or observations/comments on the effectiveness/accuracy of the method
3. lack of elaboration on the link of this work to the Linked Data track
**QAs**
1. Did you compare the accuracy of linking of your methods and the state of the art? If so can you comment on that? If not, why not?
2. Can we parallelise some of the procedures of some existing methods and achieve the same efficiency improvement? If yes, why developing more efficient methods is crucial? If not, why not?
This paper introduces a novel method that addresses the efficiency of linking spatial objects including areas, points and lines. Link discovery in the geo-spatial domain is a 'niche' and perhaps under-studied area. It is certainly very interesting and important subject, and the paper is generally well-written, easy to follow, and the idea is fairly interesting. But I have some reservations regarding its contribution to this conference and I am yet to be convinced. 
Firstly, I cannot see an explicit link between this work and Linked Data (the track this work is submitted to) and Semantic Web in general, other than that you are findings 'links' in data. But finding links in data is a very general subject that can belong to many domains, e.g., databases. I accept that I am not an expert in this domain, but the majority of audience at this conference may as well not be. You should better pitch your work to the audience by explaining its link to the Semantic Web - are your experiments using linked data? do we currently have lots of geospatial data on the Web that can benefit from your link discovery? what are the semantic applications that can benefit from this?
Secondly, your method addresses efficiency of link discovery methods. Again I think you should better motivate this. It is an expectation that most scientific work advances 'effectiveness', while 'efficiency' can often be dealt with by engineering tweaks. If efficiency is so important for this task, you should elaborate on it - why is it not something that we can simply achieve by using parallelism, and larger memories? Why is it so important to keep resource requirements low?
Thirdly, even though your method focuses on efficiency, it is not clear whether and how it impacts on effectiveness, or accuracy. Again I am not entirely aware how linking accuracy is measured/evaluated in this domain and I think the authors should inform the readers because most of them may be unfamiliar with the problem as well. But I would imagine some measures out there, as you cannot expect the linking to be 100% correct, right? For example, on linking 'proximity/nearby', how exactly do you create the buffered area? Is it possible that this buffered area can only be an approximation such that it can 'miss' some legitimate links? On the Point-to-Region Proximity-based Relation experiments, you said your method is the first that is able to do this. If so, you should definitely show us how accurate your discover links are.
*******************
** taking into account author's rebuttal **
I can accept author's response and hence my changed scores. However, authors should clearly discuss the link of this work to the semantic web - as you mentioned, some linked data are generated. But this should be discussed in more detail.
ANOTHER CRUCIAL ISSUE is, as one reviewer pointed out, you have the same work currently under review at the semantic web journal. I would urge the authors to review the policies of the two venues (ESWC & SWJ) regarding double submission.


Review 2 (by anonymous reviewer)

(RELEVANCE TO ESWC) The proposed approach is relevant for the computation of semantic interpretations of topologies of spatial entities. Likewise, it could be a valuable contribution to other communities than ESWC as well, such as geoinformatics.
(NOVELTY OF THE PROPOSED SOLUTION) This work builds upon the current state of the art in spatial link discovery and offers substantial improvement in terms of supported features, performance and memory efficiency based on a novel filtering technique. The authors claim that to the best of their knowledge there is no other comparable approach to optimizing spatial link discovery in such manner.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors provide sound, comprehensive and detailed investigations of the capabilities and limitations of their approach. However, the aspects of performance-by-relation as well as performance with multiple masks per cell are not considered.
(EVALUATION OF THE STATE-OF-THE-ART) The list of technical solutions to the problem of link discovery shows that the authors have investigated and rated current approaches.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) There is reasonable elaboration of the characteristics of the approach, although in very few parts it is a bit difficult to follow.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The steps for reproducing the experimental setup can be easily followed, however, some details on the preparation process of the merged data set would be helpful: Was there some special ordering or preprocessing involved in the merging of datasets, of which the MaskLink process could subsequently take advantage of? Further: do the used datasets cover all possible combinations of entities and masks per cell, so that there might also occur spatial configurations not described in the paper?
(OVERALL SCORE) *Short description
The paper proposes a novel and improved solution for spatial link discovery between spatial entities, which is an important feature especially of location-based applications. The approach is properly motivated and convincingly elaborated. Using the established grid cell blocking technique, this approach treats the non-empty areas within cells, i.e. those not covered by spatial entities, as separate entities called "masks" to compare other entities against. By doing so, the masks act as filters that reduce the number of required comparison operations between different types of entities from different data sets while being computationally more efficient than existing solutions. This approach and its properties are convincingly discussed and experimental results are reported.
*SPs
- The paper describes an interesting and improved solution to a known problem in a well-structured and well-written form.
- The approach can be extended to handle both spatial entities other than areas and proximities of areas, which is important for various applications in this field.
- The authors provide comprehensive elaboration of the topic, permitting easy reimplementation of their approach.
*WPs
- No answers are given to the - admittedly special - questions of the computational properties of MaskLink when presented with "artificial" configurations of entities, i.e. such that might not typically occur in the data sets used, or what the typical cost ratio of mask computation vs. actual link discovery is.
- The authors mention the applicability of MaskLink in streaming scenarios, however this distinctive feature seems to have not been investigated further.
- (I do not see any other really weak points at this time.)
*QAs
1) Cost-per-Relation: Considering the variety of datasets and possible spatial configurations of entities, are there relation-specific cases in which MaskLink could actually perform worse than other approaches while at the same time performing better for other relations?
===
Some remarks/questions by section:
1:
- "[...] leaves a significant part of the cell empty.": I presume that the property of emptiness is type-specific for the types of entities considered, thus there might be different masks possible within a single cell?
4.3:
- Is there always only one single connected mask area per cell or could there be multiple disconnected mask areas per cell whose union would constitute the mask to be used? How would MaskLink perform in such cases?
- In the "Cost Analysis" section: Intuitively, I would expect the probability p to be 0.5, since A can either be within the mask or not. If the probability p can be other than 0.5, then this would be considered a dataset-specific a-posteriori probability? It might also be a bit difficult to follow that the average number of comparisons is integer only for p = 0. Maybe this section could be refined? Further, are there estimates for the ratio of the costs for mask computation vs. actual LD computation?
5.1:
- What are the effects on proximity detection and performance if a mask is composed from multiple areas, thus permitting an entity to lie within θ of more than one mask area within the same cell? 
6:
- Why was the timeout limit set to 55 hours?
===
Consideration of the authors' rebuttal: I can accept the response. The work shows the knowledge discovery efforts necessary to generate proper linked data in the spatial domain.


Review 3 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper presents an approach to improve both topological and proximity linking in the context of link discovery, which is relevant for ESWC.
(NOVELTY OF THE PROPOSED SOLUTION) The approach, as the authors claim, is applicable to both topological and proximity link discovery. To the best of my knowledge, this is the first generic approach. Moreover, MaskLink seems to outperform the baselines in both the contexts.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) The authors present the strong points of the work in a clear way, which step-by-step guide the reader to understand the approach.
They clearly define the topological relations, select and explain the baseline approach from the state of the art, then they present their solution.
The only suggestions I have are:
-  add more examples to guide the non-expert reader through the paper even better.
- elaborate more on the streaming aspects
Can you please check the lines 10 in both the algorithm listings?
I think that a return statement there could be erroneous.
(EVALUATION OF THE STATE-OF-THE-ART) To the best of my knowledge, the state of the art is complete. 
A couple of passages could be supported by a reference, e.g., topological relations.
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The approach is deeply discussed and the authors' claims are supported by the evaluation.
My only concerns regard their call to a streaming-ready approach. What the authors' say about the memorylessness is true and can be a good starting point towards a streaming solution. However, they lack discussing the responsiveness requirement which is crucial in a streaming context (that is reactive).
To be streaming ready, the problem should be characterized accordingly. Which sources are streaming and which are static, what is the expected responsiveness? What is correctness?
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The evaluation is based on public datasets and the approach is explained in detail with two algorithms.
Although I would have appreciated the availability of a prototype, I think all the elements for reproducing the experiments are present.
(OVERALL SCORE) the paper present MaskLink a novel approach for link discovery. MaskLink aims at reducing the number of comparisons to check, by a characterization of the empty space in cells after the grid filtering step.
**Strong Points**
the paper is well written and well structured;
the evaluation is performed by-the-book: baseline selection; shared dataset, and fair comparison
the state of the art looks complete
**Weak Points*
the claim to be streaming-ready should be elaborated more
a running example would help to understand the paper and what state-of-the-art approaches cannot do
some sentences require a reference to be supported.


Review 4 (by anonymous reviewer)

(RELEVANCE TO ESWC) The paper proposed a new approach to build links/relations between spatial entities that are extracted from different sources. This is a hot topic in semantic web and could potentially solve issues such as spatial data integration and federation.
(NOVELTY OF THE PROPOSED SOLUTION) Based on the observation that "some space in the cells of space-tiling blocking is never covered by spatial representation", the authors proposed a more efficient approach, named as MaskLink, to build the topological and proximity relations between spatial entities. In the approach, the authors innovatively build a mask on cells to exclude the target spatial entities, and then compare the candidates to the mask to decide either as "disjoint" or any other type of topologies. By doing so, the computation would largely be reduced compared to the pairwise comparison between spatial entities.
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) This paper only demonstrated their approach on region-to-region relations. More explanations and experiments should be made to other types of spatial entities. See my detailed comments in the "Overall Evaluation".
(EVALUATION OF THE STATE-OF-THE-ART) Although the authors tested their approach on some data sets and showed the improved performance, there are some interesting evaluations missing, such as the size of the grid and the relation between p and k, which are proposed in the paper. Again, see details on these points in the "Overall Evaluation".
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) The paper demonstrates the proposed approach very clear by providing the pseudo-code and text-based explanations. Figure 1 also clear shows the core idea of the approach! In terms of the properties of the approach, I think the authors would do better if they explained more about generalizing the approach to other types of spatial entities.
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) This paper barely talks about some important details on reproducing the experiment, such as the grid size, the platform and algorithm to perform the masking task. So it would a little bit hard to reproduce the experiment study.
(OVERALL SCORE) Summary of the paper:
The paper proposed a new approach to build links/relations between spatial entities that are extracted from different sources. This is a hot topic and could potentially solve issues such as spatial data integration and federation. The paper is well-organized and easy to follow in general. 
The MaskLink proposed in this paper could efficiently build the topological and proximity links for spatial entities. As the approach name indicated, it contributes by using the technique of mask to reduce the computation of comparing spatial entities. A series of experiments also show that their approach outperforms the traditional ones. 
Strong Points (SPs):
(1). The idea of using mask first to reduce the computations is innovative and makes a lot senses. 
(2). The experiments conducted is well-organized, well-explained and results are demonstrated very clear 
(3). The paper is easy to follow.
Weak Points (WPs):
(1). The generalization of the approach is skeptical. 
(2). The paper lacks explanations of some important details.
(3). More experiments could be conducted to test the sensitivity of the approach to some parameters.
Questions to the Authors:
1. One important step of the proposed approach is to generate grids and then assign spatial entities to the grids. Then the core challenge is how to select the appropriate cell size. The authors have to explicitly explain it in the paper.
A relevant issue is that the authors talked about the relation between the the “probability that A is enclosed in the mask of c” and the “number of areas” k that are included in B: p>1/k (in the Cost Analysis at section 4.3). It is a very interesting point, but the authors never explained or tested it further in their work. Interesting questions should be answer such as: How to evaluate the p and k? What is the relation between p and the grid size? What is the value of p and k respectively in your experiments?  
2. The authors claimed that the proposed technique could be generalized to other types of spatial entities. But it is not well explained and tested in this paper. At least to polylines, it would be hard, if not impossible, to use the proposed MaskLink, because: what is the mask of a polyline? Therefore, I think the authors’ claim on this point is too weak! In addition, to make such a claim, the authors have to explicitly explain and experiment on building relations for all the combinations of spatial entities (e.g., “polyline-to-polyline” and “polyline-to-polygon”) rather than just demonstrating only “region-to-region” and “point-to-region”.
3. In Section 3: Notation and Definitions, Paragraph 2, the authors spend a lot of texts on explaining the distance between two points. But where did you use it in the paper.
4. By applying the proposed approach, there are cases that two spatial entities are detected to be both “nearby” (using algorithm 2) and “overlay”, or other topological relations like “within”, “disjoint”, (using algorithm 1), right? How to deal with these cases in the approach? 
5. It would be great if the author could discuss more about the impact of mask time on the performance of the proposed approach. Intuitively, there would be a tradeof between building the mask and the complexity/granularity of the geometry. The authors briefly talked about it in section 6.2., but more detailed experiments are needed.


Review 5 (by Roberta Cuel)

(RELEVANCE TO ESWC) Section 2, 3 and part of the four are copied by this already published paper
(NOVELTY OF THE PROPOSED SOLUTION) poor in term of innovation. Most of the content has been already published
(CORRECTNESS AND COMPLETENESS OF THE PROPOSED SOLUTION) I'm not an expert on that
(EVALUATION OF THE STATE-OF-THE-ART) I'm not an expert but the Stata of the Art has been copied from a previous paper  http://www.semantic-web-journal.net/system/files/swj1760.pdf
(DEMONSTRATION AND DISCUSSION OF THE PROPERTIES OF THE PROPOSED APPROACH) I'm not an expert on that
(REPRODUCIBILITY AND GENERALITY OF THE EXPERIMENTAL STUDY) The results seem interesting
(OVERALL SCORE) I do not see how much innovative is the contribution of this paper


Metareview by Hala Skaf

This submission studies the problem of spatial link discovery (LD), focusing primarily on topological and proximity relations between spatial objects.
Reviewers reveal that some parts of this submission are similar to another submission of the same authors "A Stream Reasoning System for Maritime Monitoring"  which is currently also unpublished (listed as "under review" according to http://www.semantic-web-journal.net/content/stream-reasoning-system-maritime-monitoring).
After the rebuttal, the authors argue that the journal paper presents the architecture of a complex event recognition system that is assisted by spatial relations detected by a specific configuration of our method. While the ESWC submission generalizes MaskLink to support any combination of geometries, and provide experimental comparison to other LD methods. 
There is an overlapping between the two submissions, this overlapping is not clearly identified by the authors. As a result, the submission does not fulfill the ESWC requirements.


Share on

One thought to “Paper 113 (Research track)”

  1. Dear authors,

    In your paper you compare your approach to RADON, stating assumptions that are not true and we would like to kindly ask you to correct:

    (1) RADON can perform area – point relationship discovery.

    On page 3 you state that “[…] RADON employs an optimization based on minimum bounding boxes (MBBs) for deciding the granularity of the grid. This means that a data source providing point geometries (e.g. for touches or within relations) would force the construction of an “infinite” number of cells. This indicates that RADON cannot handle point-to-region topological relations.”. RADON indeed employs a heuristic for calculating the grid granularity, but it will take the mean of the average MBB sizes for source and target. So for a point (S) – area (T) discovery task, where the granularity factor in each dimension d would simply be 2 * averageSize_d(T).

    (2) RADON can perform streaming

    On page 6 you state that “Despite the benefits in efficiency offered by these techniques, their adoption also limits the applicability of RADON. For instance, swapping requires that both data sets are available beforehand, thereby rendering this technique (and consequently RADON) inapplicable in the case of streaming data sources.”
    The RADON paper indeed specifies a swapping strategy, but just as a micro-optimization in the case that both datasets are available beforehand. It is not used to “select[…] the data set that will be organized in the cells”, but to determine which dataset to index first. It is in no way elementary to the core RADON algorithm and its contribution to performance is actually very small.
    So while RADON is implemented within LIMES, which currently does not offer a streaming API, its algorithmic design does in no way prevent an implementation for streaming data sources.

    (3) RADONs caching is not memory expensive

    The caching mechanism in RADON is implemented using Map<String, Set>. The strings stored in this data structure are the URLs of the geometries, which are already in memory anyway, so the data structure just references these. Therefore, the memory footprint of caching is neglectable.

    However, we like your contribution “MaskLink”, as it presents a truly novel way of filtering comparisons and indeed will advance the state of the art.

Leave a Reply

Your email address will not be published. Required fields are marked *