Semantic Similarity Assessment of Volunteered Geographic Information

T he recent development in communication technologies between individuals allows for the establishment of more informal collaborative map data projects which are called volunteered geographic information (VGI). These projects, such as OpenStreetMap (OSM) project, seek to create free alternative maps which let users add or input new materials to the data of others. The information of different VGI data sources is often not compliant to any standard and each organization is producing a dataset at various level of richness. In this research the assessment of semantic data quality provided by web sources, e

Despite the positive aspects of OSM data, the quality assurance of its data is still the major concern of geographical information (GI) users. In recent years, VGI quality has been interesting topic in GIS research.
One of the first systematic attempts to assess OSM quality was conducted by The objective of this study is to evaluate the semantic accuracy of the OSM "tag" also called "features". The main idea is developing a methodological framework based on a confusion matrix approach to determine the classification accuracy using all of the information diversity of OSM project. The reminder of this article is structured as follows: section 2 describes the main characteristics of the features of OSM data. Section 3 presents the approaches for estimating classification accuracy. The discussion of the system prerequisites and code program will be introduced in section 4. In section 5, the results and findings are illustrated and analysed. The last section concludes with a discussion and provides an outlook on future research.

THE FEATURES OF OPENSTREETMAP DATA
The OpenStreetmap (OSM) is not the only source of feature classifications data that is available free of charge. Services such as Google Maps, Yahoo Maps, or the Microsoft offering Bing maps have very good mapping available for viewing via the Internet, and they do not require payment.
In 2008, Google introduced "Map Maker", an edition that allows users to trace maps data from satellite or aerial imagery and upload it to Google servers. Google is using map maker primarily in countries, such as Iraq, where they cannot buy suitable map data from the traditional geodata supplier. All these offerings only very limited rights to the users for downloading maps with feature classifications. If users would like to add features to web maps, for example, these free Internet sources are usually not usable. With OpenStreetMap, on the other hand, any form of reproduction or processing is allowed, and users do not have to ask anybody for permission.
The objects of OSM data may be classified into two most important types: nodes (also called points) and ways. A node consists of geographical coordinates (latitude and longitude), while a way consists of an ordered list of at least two nodes. Attributes assigned to these objects in order to describe what they represent are called tags. A tag consists of a key and a value and is usually written with an equals sign between both parts "key=value". Both can be arbitrary strings of up to The OSM data can be exported directly in a variety of data formats such as XML data or Mapnik image (e.g. PNG, JPEG, and PDF). The OSM raw data can be processed with suitable OSM software. This export has a size limit; users can only use it if they are looking at a reasonably small area of the map (approximately 10km x 10km), Ramm, et al., 2011. In this research the OSM data was exported as XML format, and it was imported using ArcGIS 9.3 software for processing and manipulating. In order to obtain the required features, a pre-processing (filter) step was adopted. This step was essentially applied for separating the undesired data from attribute table.

Confusion (Error) Matrix
An error or confusion matrix evaluates classification accuracy based on comparing actual or reference land class with map data. The matrix has tow dimension with the same number of rows and columns. The rows and columns express the labels of samples assigned to a particular category in one classification relative to the labels of samples assigned to a particular category in another classification (Fig. 2). Each column is assumed to be correct and display the ground reference information, while each row of the matrix represents the map labels obtained from map classifications. The main diagonal of the matrix represents the correct feature classifications,

Congalton and Green, 2009.
Confusion matrix can be considered one of the most effective ways to represent classification accuracy. This is because that error matrix can describe the accuracy of each map category based on omission and commission errors. The error of omission refers to the proportion of observed features excluded from map classes, whereas commission error arises when features on map are categorised incorrectly. Besides the omission and commission errors, the overall, producer's and user's accuracy can also be determined by confusion matrix, Ismail and Jusoff, 2008. The overall accuracy represents the summation of elements on the main diagonal of confusion matrix divided by the total number of samples of confusion matrix. The user's and producer's accuracies are simply the ways of computing individual accuracy rather than calculating overall accuracy, as will be discussed in the following section.

Mathematical Representation of the Error Matrix
Suppose that there is a square matrix with k 2 cells and n samples.
Be the number of sampled classified into category j in the reference data set.
Overall accuracy between map classification and the reference data can be computed as follows: Producer's accuracy can be computed by: And the user's accuracy can be computed by: In addition to the above models, kappa coefficient can also be used as an index of classification accuracy, as follows: Where:

METHODOLOGY IMPLEMENTATON AND PROGRAM STRUCTURE
To achieve the main goal of this research, the classifications quality of OpenStreetMap (OSM) information must be checked carefully. It is indispensable because the information of different Volunteered Geographic Information (VGI) data sources are often not complaint to any standard and each organisation is producing a dataset at various level of richness. In this study the assessment of classifications quality of data provided by web sources will base on comparison with the information from other sources. In other words, utilizing the information from sources with known quality of data to evaluate the quality of data provided by sources with unknown quality of data. As Thakkar, et al., 2007 observed in relation to VGI data sources and high quality sources, using this technique can produce the most accurate results. Assessment of attribute and feature based accuracy will be undertaken using statistical indices based on, for example, kappa coefficients (as described in previous section). It is proposed that the true or actual classifications by field surveying are to be used for geospatial data collecting of selected location (Baghdad / Iraq). The data set will consist of the self-generated field survey and the open access data from web-based VGI (e.g. OSM).

Journal of Engineering
The intention is to compare classifications of these datasets. Initially this can be done with visual comparison of derived maps, to get a general picture but the methods indicated above will be applied to construct a quantitative approach, identifying the strength of accuracy assessment strategy. This exercise will help in developing methods of evaluating classification quality based on a rule set, coded into the data handling flowline. The result will be a set of operators which can measure data quality, allow for the preparation of datasets prior to successful integration, and actually undertake the integration of data. The proposed methodology was developed using Matlab 7.10.0 programming language, for academic and research purposes. In the first step the program inputs the features name or classifications through read a data file. Then, the input data will be employed to generate the confusion matrix by comparing the feature classes in reference and tested datasets. Subsequently, the program saved the completed confusion matrix and assessed the accuracy of classification data. The assessment processing based on determining the produce, user, an overall accuracy and kappa coefficient values. The report of final outcome can be exported and saved as text (.txt) or excel (.xlsx) format. Fig. 3 presents the flowchart of the program that used in implementing the methodology of this study.

EXPERIMENT AND ANALYSIS
In this research, the obtained confusion matrix constructed from 543 features and 16 classes as shown in Fig. 4. In this matrix the x-axis represented the true (reference) class labels, whereas the y-axis listed the tested classes. The correct classifications lay on the main diagonal of the matrix, while the other elements of the matrix showed the misclassifications. The overall number of classifications can be seen in the bottom right cell. The accuracy measurements of confusion matrix have been achieved based on equations 3, 4, 5 and 6 as follows: The procedure's accuracy of the primary road is one example for determining the procedures' accuracy in this article. Figure 5 shows the procedure's accuracy of other features.
( ) The user's accuracy of the primary road is one example for determining the users' accuracy in this article. Figure 6 shows the users' accuracy of other features.
To illustrate the computation of kappa coefficient (k) for the error matrix included in Fig. 4: The accuracy assessment reports of OSM classifications from 543 reference data are illustrated in Figure 4, 5, and 6. The results showed that the overall accuracy of the tested data was 86%, where the lowest user and procedure accuracy were 32% and 50%, respectively, and the highest user and procedure accuracy were similar with 100%. The classification accuracy is vary and different from one class to another class. For instance, primary road has only 55% accuracy since it was confused with secondary road, service road, and path. The path has only 77% accuracy since it was confused with building. Another example of this is the building which has 90% accuracy since it was confused with secondary road, service road, parking and university. The most likely causes of diverse classification accuracy are because the wrong way for classifications of some of OSM datasets.
The kappa coefficient was also determined from the calculations of tested features and classes.
This represents that the kappa statistics value of 0.826 which implies a credible 82% better accuracy than if a random unsupervised classification was adopted. According to Landis and Koch, 1987, the agreement scale of Kappa value as k>0.75 present excellent, 0.4<k<0.7 present good, and k<0.4 present poor. In another major study, Monserud, 1990, reported that the Kappa statistic is poor when k<0.4, fair when 0.40<k<0.55, good when 0.55<k<0.70, very good when 0.70<k<0.85, and excellent when k>0.85. In general, therefore, it seems that the kappa coefficient value of this study demonstrated an excellent to a very good agreement.

CONCLUSION
The OpenStreetMap (OSM) is one of the most popular projects of Volunteered Geographic Information (VGI) services. The OSM produced geospatial data by non professional volunteers of varying level of mapping experience. The OSM data does not follow any standard compared to authoritative or official datasets; therefore it's necessary to evaluate its quality continuously. The purpose of the current study was to present a method for assessing the quality of (OSM) semantic data. The methodology was implemented by designing a program using Matlab 7.10.0 programming language. The program was utilised in the assessment of classification accuracy of feature categories of OSM data. This was included the construction of confusion matrix and calculating the overall accuracy, users' accuracy, producers' accuracy and kappa coefficient.
The outcome of this investigation showed that the confusion matrix consisted of 543 elements, which is formed as 16 rows and 16 columns (as illustrated in Fig.4). The number of elements in each row and column are varying and different based on the number of features in each class. For instance there are eleven elements in the first column. These classified as six primary roads, three secondary roads, and two residential roads. Another example of what is meant by different elements of rows and columns of confusion matrix is the fifth row contains 170 elements. These distributed as two primary roads, and one hundred and sixty eight as residential roads. The research has also found that the overall accuracy was 86%; the users' accuracy was between 32% and 100%, while producers' accuracy was between 50% and 100%, and kappa statistics was 0.826. In general, therefore, it seems that the classification accuracy of OSM datasets is acceptable to some extent.
For future work, it is recommended that the further studies need to be carried out in order to apply this method with different data sources such as governmental agency data, Google map, and