Crowd-sourced data collection to support automatic classification of building footprint data

Human settlements are mainly formed by buildings with their different characteristics and usage. Despite the importance of buildings for the economy and society, complete regional or even national figures of the entire building stock and its spatial distribution are still hardly available. Available digital topographic data sets created by National Mapping Agencies or mapped voluntarily through a crowd via Volunteered Geographic Information (VGI) platforms (e.g. OpenStreetMap) contain building footprint information but often lack additional information on building type, usage, age or number of floors. For this reason, predictive modeling is becoming increasingly important in this context. The capabilities of machine learning allow for the prediction of building types and other building characteristics and thus, the efficient classification and description of the entire building stock of cities and regions. However, such data-driven approaches always require a sufficient amount of ground truth (reference) information for training and validation. The collection of reference data is usually cost-intensive and time-consuming. Experiences from other disciplines have shown that crowdsourcing offers the possibility to support the process of obtaining ground truth data. Therefore, this paper presents the results of an experimental study aiming at assessing the accuracy of nonexpert annotations on street view images collected from an internet crowd. The findings provide the basis for a future integration of a crowdsourcing component into the process of land use mapping, particularly the automatic building classification.


Introduction
Digital building models from National Mapping and Cadastral Agencies (NMCA) or Volunteered Geographic Information (VGI) platforms often lack attribute information, such as the building usage, housing type, number of floors, building height, and years of construction.However, this information is of particular importance for various research domains and applications such as spatial science, geography, urban planning, architecture, and disaster management.Supervised machine learning techniques help to classify the building footprints according to a predefined building typology and to semantically enrich the datasets with additional information.Such data-driven approaches provide promising results with high accuracies for single cities and regions (e.g.Römer and Plümer 2010;Henn et al. 2012, Hecht et al. 2015, Wurm et al. 2016).One of the main challenges is the limited transferability of the classifiers due to strong regional dependencies (Steiniger et al. 2008, Hecht et al. 2015).A trained machine learning classifier is only applicable for cities with a similar building structure and history.Changing the spatial and cultural context (e.g.other regions, countries, continents etc.) requires the collection of additional ground truth data in the specific area under investigation for model training and validating.To overcome these regional differences, an efficient strategy for ground truth data collection needs to be elaborated.In recent years, crowdsourcing has been proven suitable for collecting training and validation data in a variety of research disciplines.In this study, we want to explore the potential of crowdsourcing in the context of mapping and monitoring urban land use, particularly the classification of building footprints in digital topographic databases.

Background and Related Work
Today citizens are becoming more and more important as a new source of geo-information.In the last few years, a number of different terms from different disciplines have emerged that describe the process of citizen-based sensing of geographic information, namely crowdsourcing, citizen science, collaborative mapping or the crowd-sourced information itself, such as Volunteered Geographic Information (VGI) or User-Generated Content (UGC).The form of data collection can be very different.According to See et al. (2016) crowdsourced geographic information can be either contributed actively as part of a crowdsourcing system/campaign (e.g.OpenStreetMap, Wikimapia) or contributed passively by mapping already existing crowdsourced data that has been collected for other purposes (e.g.mobile data, location-based social media content).Furthermore, the types of information (e.g.spatial vs. aspatial, labels vs. geometry etc.) or the forms of motivation strategies (gamification, paid crowd etc.) can vary.In our context, we prefer using the term crowdsourcing defined as a type of participative online activity, particularly the process of a voluntary undertaking of specific tasks (Estellés-Arolas and Ladrón-de-Guevara 2012).Crowdsourcing appeared first in Howe (2006) describing the business practice of outsourcing activity to the crowd, which is today an attractive way of acquiring cheap and fast annotations from non-expert contributors over the Web that almost have the same quality as expert labels (Snow et al. 2008).The idea of using online users for the purpose to label images goes back to Luis von Ahn who designed the ESP game (von Ahn and Dabbish 2004) and further developed reCAPTCHA (von Ahn et al. 2008), a system to verify humanity and simultaneously assisting the digitization of books by solving complex OCR problems with crowdsourced labels.Today crowdsourcing is used in different research domains to collect large datasets that would otherwise not be possible using the researcher's own resources.On the other hand, it can be applied to solve computationally expensive and difficult problems.Annotations such as boxes, contours, correspondences or labels are of research interest, for example, in medical image processing (Maier-Hein et al. 2014) or autonomous driving (Donath and Kondermann 2013).In the context of land cover mapping and remote sensing, crowdsourcing is used to collect data (primarily labels) for validation and training, such as in the famous Geo-Wiki platform (Fritz et al. 2012, Laso Bayas et al. 2016).The Geo-Wiki developments go hand in hand with studies on data quality (See et al. 2013, Salk et al. 2016).In addition to the classification task (labeling), there are also conflation tasks and digitization tasks (Albuquerque et al. 2016).Hillen and Höfle (2015) have proposed a prototype implementation of a system for digitizing building footprints, namely Geo-reCAPTCHA.They adapted the reCAPTCHA idea to create geographic information via web-based micro-mapping tasks and assessed time and quality of the data.Further, in an EU project CAP4Access crowdsourcing was also tried out for the acquisition of sidewalk information that is necessary for routing and navigation services tailored to the needs of wheelchair users (Hahmann et al. 2016).

System for crowd-sourced data collection supporting building type recognition
In this section, we outline an integrated system for automatic classification of building footprints supported by crowdsourcing.The automatic classification of building footprints uses a supervised machine learning approach as described, for example, in Hecht et al. (2015).Crowd-sourced annotations on geo-coded street view images from the internet supports the training and validation of the classifier.The conceptual model for crowd-sourced collection of ground-truth data mainly consists of the following steps, also shown in Figure 1: 1) Definition of building types and visual characteristics 2) Construction and design of image annotation tasks 3) Collection of street view imagery 4) Perform image annotation 5) Post-processing, quality assessment and data filtering 6) Inference of building types based on the crowdsourced data At first, a target building classification scheme needs to be defined (1).This includes the definition of the building types that are subject to the subsequent classification process as well as their visual characteristics in the images intended to be used.Subsequently, the classification problem has to be decomposed into individual image annotation tasks (2).This is carried out by constructing and designing very simple tasks, which requires some a priori expert knowledge of the relevant visual characteristics for distinguishing different building types.A simple task would be the boolean query of whether a particular property is true or false.A more complex task offers more than two answer options in single-choice or multiple-choice mode.The annotation tasks can be implemented in different crowdsourcing/micro-task platforms (e.g.CrowdFlower, Amazon Mechanical Turk, Crowdcrafting) or in own applications and games (e.g.Cropland Capture, www.cropland.geo-wiki.org)for different devices.Easy handling is one of the most important aspects for task designing.The task should be solvable in an acceptable amount of time.Further, limitations of the display size (desktop computer vs. smartphone) should also be considered to ensure readability.The next step (3) is to collect the images to be annotated by the crowd.Generally, street view or bird's-eye perspective images are desired since these types of images allow for the recognition of several building properties.Potential image sources are Google Street View, Microsoft Bing Maps Bird's Eye Views, or the street-level imagery from the VGI platform Mapillary.In addition to these sources, any kind of geotagged imagery in social media (Flickr, Facebook, Instagram etc.) can be used as long as an automated access is given through a provided API by means of a spatial query.The basis for the selection of the images can be a random sample of addresses (address list), which was created in advance from a given spatial database.Once the images are collected, the image annotation can be performed (4).The task responses are usually recorded along with metadata (time, user name, country, etc.).In order to reduce noise and to allow for intrinsic quality controlling redundant labels are gathered by assigning a task to different annotators.
In a post processing step (5) the results are aggregated by majority voting.With the help of intrinsic measures, the quality of annotations can be assessed.Based on the measures bad annotations or unreliable annotators can be identified and excluded from further processing.In a final step (6), the building types are derived based on the crowd-sourced building characteristics resulting in categorical data.This ground truth data can either be used for training and / or validation in the context of an automatic building classification.

Implementation of the experimental study
The aim of the study is to validate the performance of crowdsourcing for obtaining ground truth information of specific building characteristics using street view imagery from online resources.According to the conceptual workflow, we further specify the design of the study including problem definition, the annotation tasks, implementation and the validation.

Definition of building types
In our experiment we sought to qualify crowdsourcing annotations in the context of building type recognition.We focus on the classification of the residential building stock in Germany.Several different typologies for different purposes can be found in the literature.We use a hierarchically structured typology already used in

Task Design
Since the tasks for the crowd members need to be as easy as possible, we chose very basic questions that anybody should be able to answer.Therefore, relevant building criteria are identified which are necessary to separate the individual building types.The identified criteria are: the morphological type, number of floors, housing type, roof type, and the building age.We defined six questions in a single selection mode, each requesting a different building criteria.The survey of the building age is carried out via the façade type separated for the SFH and the MFH.In this case, the annotators is asked to assign the most similar façade (out of a set of typical façades of a certain building period) to a building.

Image data capture
For the cities of Dresden and Hamburg, reference data were available that has been gained by experts through previous fieldwork.In addition to the building geometry, these include postal addresses as well as information about the building type (considering nine types), the building height (in m) and the period of construction.This building-based reference data is the basis for the drawing a random subset of 2,000 building addresses.
The address list was used to create image requests using the Google Street View Image API.Using the API, static (non-interactive) views can be defined and embedded into web pages using URL parameters sent through a standard HTTP request.After the creation of the initial (default) URL list, street views were examined manually with regard to their usefulness and recognizability of the image content and, if necessary, URL parameters (e.g.size, fov und pitch) were revised.Approximately 46 % of the street views could not be used due to privacy concerns in Germany.In these cases, houses are blurred in Google Street View.The final data set containing 924 buildings (approx.100 per building type) has been stored in a database including address data, ULR request, x, y coordinates of the buildings´ centroid as well as the ground truth information on the building type, building height, roof type, etc.

Implementation
We chose task implementation in an online gaming environment with support by Pallas Ludens GmbH (www.pallas-ludens.com),a company located in Heidelberg, Germany, specialized in these activities.In our study, tasks were embedded in computer games by replacing commercial ads with crowdsourcing tasks.We use online games such as Farmerama of the game publisher Bigpoint (www.bigpoint.net)attracting millions of desktop users in social networks around the world.The number of monthly active users available as a "crowd" is been estimated around 250,000 (Pallas Ludens 2014).
The user interface for desktops consist of two components: a display field with the street view image of a building to be interpreted and a selection field for labeling.Interactive radio buttons with symbolic illustrations (showing category text on hover) support the annotation process.Users are asked to just click on one of the categories (see example in Figure 2).The results of the annotation process conducted and controlled by Pallas Ludens GmbH, lead to structured output files using JSON (JavaScript Object Notation).
After conversation into a comma-separated values file (CSV), the following values are available for each annotation: • annotation_ID: annotation identifier (integer) • task_ID: task identifer number (integer) • image_ID: image identifier (integer) • creator: user name as annotator identifier (text) • result: label / category (text) This data is the basis for the statistical analysis and validation.To enable a comparison with external reference data, the majority class is determined for each image from the multiple responses.

Quality assessment
There are several ways of assessing quality of from crowd-sourced annotations.A common approach is to compare the data with external ground truth information and to calculate accuracy measures (external quality assessment).An introduction of measures of thematic classification accuracy give Congalton andGreen (1998), Foody (2002) and Liu et al. (2002).We used the overall accuracy that is calculated by dividing the total of correct annotations by the total number of annotations.Further, category-level accuracy measures such as the producer´s accuracy (PA) and the user´s accuracy (UA) representing individual accuracies for each category have been computed based on an error matrix (Congalton and Green, 1998).The error matrix reference data is represented in the columns and the classified data in the rows.The PA gives the ratio between correctly annotated objects and total number of reference objects of that category.The UA is the ratio of the correctly annotated objects of a certain category to the total number of all objects annotated belonging to the category.

Measures Notation
Error matrix taken from Congalton and Green ( 1998  ).In our experiment, we focused on aspects of agreement and diversity on instance level (each image) using measures given in Table 3.The inter-annotator agreement (IAA) is a measure that reflects how reliable/confident a majority vote is by calculating the ratio of the number of annotations in the majority category and the total number of annotations per image.
In other words, it is the agreement among annotators.In order to measure the diversity we use Shannon's Diversity Index (SHDI) and Shannon's Evenness Index (SHEI) known from the domain of landscape structure analysis (McGarigal and Marks 1995).SHDI is a quantitative measure reflecting the amount of information, in particular how many different classes occur per image, and simultaneously takes into account the occurrence of each class.Since SHDI is very sensitive to the number of possible categories k, the SHEI was

Experimental Results
In this section, we present the first results of our experimental study.After presenting a descriptive statistic of the output data, the results are evaluated by using the defined intrinsic and external quality measurements.

Descriptive Statistics
Table 4 gives an overview of the data in terms of the amount of images, categories, annotations, annotators and their relations.The latest column shows the number of annotations guaranteed for most of the images, which means that more than 95% of the images of each task have more than 14 annotations (see also histogram in Figure 3).

Intrinsic and external validation
In the following, the results of each task are described and evaluated using the defined internal and external measures (Table 5).The overall accuracy (OA) was determined by comparing the results of the majority vote with external reference data and computing the number of correct annotations and false annotations.The table shows the highest accuracy for task 2 (number of floors), task 3 (housing type) and task 4 (roof type) with OA values over 0.84.A similar picture is obtained by considering the intrinsic dimensions.The values of the inter-annotator agreement (IAA) are also high for tasks 3 and 4, which means that there is a high agreement between the annotators.The corresponding values for the diversity index SHDI are low, which suggests that many annotators have chosen the same class.However, the accuracy of the detection of facade types used for the reconstruction of the building age (task 5 und 6) is limited.Apparently, the assignment of the buildings to a certain type of façade may be too difficult, or only a few users are able to make these assignments correctly.Furthermore, the quality of the results of task 1 (morphological type) is at this stage unsatisfactory.Further investigations are needed in order to identify the causes for this misclassification.Initial checks indicate that there is a frequent confusion between the row houses and the semi-detached houses.The reason for this confusion is most likely a large number of street view images with an unfavorable view frames (image section) that do not allow the recognition of the neighboring buildings.Surprisingly, the accuracy of the recognition of the housing type is relatively high when looking at OA (0.86) and IAA (0.84).Here we had expected less accuracy.

Discussion and future research
In this paper, we propose an integrated system for automatic classification of building footprints that supports a crowd-sourced data collection component that can be used for training and validation.In an experimental study, the quality of crowd-sourced annotations on street view imagery is assessed.The annotations are related to a set of selected building characteristics relevant for distinguishing residential building types.These first results initially provide a rough overview of the quality.A deeper insight would be obtained by carrying out a more detailed analysis by having a look at the quality for different building types, calculating error matrices, and computing building-typespecific measures such as the producer´s accuracy and the user´s accuracy.Furthermore, the quality of the building types automatically derived from the building characteristics still needs to be evaluated.For this experimental study, we chose online game environment for task implementation.However, open micro-task platforms such as Crowdcrafting can also be considered in future studies.The advantage of this platform is that it does not incur any costs in comparison to the use of commercial platforms.With regard to the image data used, the suitability of alternative data sources can be investigated such as Wikimapia or Mapillary.The VGI platform Wikimapia contains a large stock of geocoded images of buildings, while Mapillary offers street-level images.Another interesting data source might be the Bird's Eye Views from Microsoft Bing Maps offering multi-perspective views of buildings.The views can be provided to the crowd as an embedded interactive window using the provided API.A comparison of the different data sources could lead to a specific data set being particularly suitable for a certain tasks.For example, the morphological type in the Birds Eye View is certainly better recognizable than in Google's Streetview images.
A further step will be to explore the relationship between the intrinsic measures and the data quality based on the external measurements.Thus, the question can be investigated whether the quality of an annotation can be estimated on the intrinsic measures solely.Furthermore, the data at annotator-level can be analyzed in order to estimate the annotator´s credibility and to identify good and bad annotators.These findings would then form the basis for the development of suitable filters (selection criteria) in the post-processing/quality control step.By using only the high-quality annotations from the best annotators, the quality of the ground truth data can be improved.This, finally improves the accuracy of the whole system, particularly the machine learning classifier for predicting the building types based on the digital topographic data.
Even if further research is necessary, we believe that crowdsourcing in combination with geospatial web technologies have the potential to massively reduce time and costs in collecting ground truth data for training and validating all kind of predictive models.Especially the huge time savings can lead to a much faster mapping which is essential in disaster mapping.

Fig. 1 .
Fig. 1.Workflow of crowd-sourced data collection to support automatic building classification

Fig. 2 .
Fig. 2. Example of prototype interface (task 1) containing the display field with a street view of a semi-detached single-family house (left) and selection field with the three clickable answer options: detached, semi-detached and row house (right).

Table 1 .
Defined Tasks and the characteristics

Table 2
. Measures for external validation External validation always requires sufficient reference data, which is not always available.Therefore, researchers have developed approaches to evaluate the quality of a dataset with the aid of intrinsic indicators as a proxy(Senaratne et al. 2016 (McGarigal and Marks 1995) on SHDI normalized by dividing by the maximum diversity present in case of equal class distribution(McGarigal and Marks 1995). is the number of possible categories, and   =   /, the proportion of annotations in the th category ( = 1, … , ), where   is the number of annotations in category  and  is the total number of annotations.