Stefan Kufer

Effective and Efficient Summarization of Two-Dimensional Point Data

Approaches for Resource Description and Selection in Spatial Application Scenarios

Reihe:

Effective and Efficient Summarization of Two-Dimensional Point Data
DOWNLOAD COVER

Space is everywhere, and so is spatial data. The digital revolution has led to an enormous pool of available spatial data, and the amount of newly generated spatial data is increasing day by day. Often, this spatial data is two-dimensional point data that in addition is associated with data objects such as media objects (e.g. pictures or texts). As a consequence of the huge amount of data objects which is generated and then has to be maintained, there is a need for effective and efficient search systems which are capable of addressing the spatial properties of the data objects. In this context, different search scenarios exist: The data objects might be maintained in a distributed system (i.e. on a set of different machines) or in a centralized system (i.e. on a single machine) which, in each case, leads to varying requirements for appropriate search systems. However, in both scenarios, the concept of resource description and selection is an applicable paradigm for conducting similarity searches with regard to the spatial properties of the data objects. Hereby, when focusing on the spatial aspects, a resource is an abstract entity that administers spatial point data. For describing the spatial footprint of a resource, its spatial data point set can be ‘summarized” by means of geometrically delineated areas which cover this data point set. These resource descriptions are then usable for a targeted selection of the resources which administer the relevant data objects based on spatial properties.

This thesis is concerned with the effective and efficient summarization of two-dimensional spatial point data as a means for resource description and selection in spatial application scenarios. The term ‘effective” refers to a spatially very accurate geometric delineation of the data point set to describe whereas the term ‘efficient” relates to its storage-space-efficient representation. Two search-based spatial application scenarios are assessed in this thesis, and in both, the search task can be modelled by adhering to the concept of resource description and selection.

The first scenario is a distributed application scenario in which spatial point data is maintained by a set of independent resources such as peers in a peer-to-peer network. Given a concrete query, the resource descriptions shall enable the targeted selection of the peers administering the relevant data points while ignoring ‘irrelevant” resources. The distributed application scenario is the main part of this thesis, and our summarization approaches are specifically developed for this purpose. Hereby, a variety of concepts is considered which include the compression of the data points, the use of arbitrarily complex bounding volumes to delineate them, the representation of complex objects (such as complex bounding volumes) with a set of simple objects, and a division of the data point set to describe into groups and the subsequent concise delineation of each group. Overall, 14 summarization approaches are presented in this thesis which can be categorized into data partitioning approaches, space partitioning approaches, and hybrid approaches. Following the specification of suitable resource selection schemes which are based on the various summaries, an extensive evaluation is conducted by means of assessing the approaches” performances for k nearest neighbor (kNN) queries. The evaluation considers different data collections and varying environmental conditions in order to assess the robustness of the approaches.

In the second application scenario, a utilization of our summarization approaches in a multidimensional data structure is assessed—which constitutes a centralized application scenario. More specifically, the two summarization approaches of the distributed application scenario which are most suitable for this purpose are integrated into an R-tree which administers sets of two-dimensional point data. The R-tree is a tree-based, centralized multidimensional data structure which has been the subject of intensive research over many years, starting in 1984. Within an R-tree, the ‘resources” are the nodes of the hierarchical tree structure. The resource descriptions shall enable the targeted traversal of paths in the tree structure when searching for the relevant data points given a specific query. Traditionally, these nodes are described by Minimum Bounding Rectangles (MBRs) which are very storage-space-efficient but rather coarse descriptions of a spatial footprint. After presenting the classical R-tree, the modifications to the R-tree that are necessary to integrate our summarization approaches are discussed. Also, appropriate algorithms for calculating summaries from a set of rectangles are outlined as this is a prerequisite for the efficient use of our sophisticated summarization approaches in R-trees. Following the specification of appropriate range query and kNN query algorithms, an extensive evaluation is conducted which assesses both the improvement potential over the traditional R-tree using MBR summaries as well as the degree of achievement by means of the straightforward approach to the integration that we pursue. Also in this evaluation, different data collections and varying environmental conditions are considered.

Overall, the thesis presents a wide variety of summarization approaches for describing sets of two-dimensional point data. These summarization approaches are excellently suited for usage in the investigated distributed application scenario. Furthermore, we assess the utilization of the two summarization approaches which are most appropriate for this purpose in an already very intensively researched centralized application scenario. Here, we examine further improvement potentials and identify important obstacles which are to overcome for our summarization approaches in such an environment.