<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saemi Jang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Megawati</string-name>
          <email>megawati@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiyeon Choi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mun Yong Yi</string-name>
          <email>munyi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Knowledge Service Engineering</institution>
          ,
          <addr-line>KAIST</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The development of Semantic Web technology has fuelled the creation of a large amount of Linked Data. As the amount of data increases, various issues have been raised. In particular, the quality of data has become important. A number of studies have been conducted to evaluate the quality of linked data. However, most of the approaches are operational only when a data schema such as ontology exists. In this paper, we present a new approach for conducting linked data quality assessment by evaluating the quality of linked data without involving an ontology. Our approach consists of three activities: (1) pattern analysis, (2) pattern generation, and (3) data quality evaluation. A pattern is a structure used to measure the quality of data. For the validation of the proposed approach, we have conducted two studies - one involving English DBpedia, which has a relatively well-developed ontology, and the other involving Korean DBpedia, which lacks an ontology. Our approach shows comparable performances when compared with RDFUnit for English DBpedia and high accuracy results while assessing the quality of Korean DBpedia, for which RDFUnit cannot be used.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality Assessment</kwd>
        <kwd>Assessment without Ontology</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Pattern Generation</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Linked data is an international endeavor to interconnect structured data on the Web.
The development of Semantic Web technology has fuelled the creation of a large amount
of Linked Data. There exists more than a thousand number of linked data, covering a
wide range of di erent domains1. As the amount of data increases, numerous problems
have been discovered regarding the data either syntactically or semantically (e.g. invalid
data, data inconsistency, etc). DBpedia also still has such problems even thought it is one
of the most well-organized and wildely used linked data resource. In addition, such errors
in the extant linked data resources (e.g., DBpedia) may be enlarged in other systems that
rely on those resources (e.g., Q&amp;A systems). Thus the quality of the data has become
important and a demand for accurate quality assessment methods has increased.</p>
      <p>
        The quality of data is de ned as tness of use [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] and includes various factors
such as accuracy, relevancy, representation, and accessibility [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. There have been many
prior data quality assessment approaches [3, 6{8]. Depending on a goal or a target of
the assessment methods, di erent factors are selectively employed and the assessment
processes are also di erent, ranging from semi-automatic approach that requires user
involvement to fully automatic approach. The main common ground of them is the data
quality assessment is based on ontology that is built from the target linked data. Therefore
it is not feasible to use prior data quality assessment approaches for linked data having no
ontology. Of course we can build up our own ontology. However it is a di cult and time
      </p>
      <sec id="sec-1-1">
        <title>1 State of the LOD Cloud 2014 document published in April 2014</title>
        <p>
          (http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/)
consuming work since it is done manually or semi-automatically by domain experts [
          <xref ref-type="bibr" rid="ref12 ref14">12,
14</xref>
          ]. Although automatic ontology generation frameworks have been introduced, it only
works for English and limited domains [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ].
        </p>
        <p>
          Main contributions: We propose a novel assessment method that performs the quality
assessment of linked data without requiring ontology. In general, a large portion of the
data in a knowledge resource is valid data because they usually took several debugging
passes not only before but also after being released on the web [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We exploit this
observation and this is the basic assumption of our approach. We rst analyze the data
patterns in a knowledge resource and rank the patterns based on the appearance ratio.
Then we take top k (e.g., ve) patterns for each property and compute the average ratio
of them. This average value is a threshold that is the standard for deciding whether a
given pattern is valid or not. Finally we take the patterns appearing more frequently than
the threshold and they become test case patterns. Based on the generated test patterns,
we evaluate the quality of knowledge resource. Since our approach directly utilizes a
knowledge resource without requiring ontology, we can apply it to any language and any
domain. Also our approach can work with any kind of pattern structures.
        </p>
        <p>We validate our method in two aspects including the accuracy of generated test case
patterns and the accuracy of assessment results. To measure the accuracy of generated
test case patterns by our method, we use English DBpedia that is one of the most
wellmaintained knowledge resources and compare patterns generated by our method with
the patterns created based on ontology. We also use Korean DBpedia as a localized,
nonEnglish DBpedia to measure the accuracy of assessment results, which lacks an ontology.
We found that our method shows a high consistency (up to 89%) between generated
patterns and existing patterns. Also it reached 79% F1-measure while assessing the quality
of the localized DBpedia. These results demonstrate the accuracy and exibility of our
approach.</p>
        <p>In the rest of this paper, we rst explain our approach for the data quality assessment
in Sec. 2 and then we validate the accuracy and usefulness of our approach in two
perspectives including the accuracy of generated test case patterns (Sec. 3) and the
accuracy of quality assessment results (Sec. 4). Finally we discuss the implications of the
ndings in relation to prior related research in Sec. 5 and conclude our work in Sec. 6.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Quality Assessment Without Requiring Ontology</title>
      <p>
        For the assessment of data quality, we rst analyze the data structure of the given
knowledge resource. Specially we check whether there exists a data schema or not. If it has
a data schema (e.g., ontology), we use a prior assessment method that utilize the data
schema like SWIQA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and RDFUnit [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. When there exists no data schema for the
knowledge resource, we cannot evaluate the quality based on these prior approaches.
Although we can generate a data schema for the target data resource, it is a time consuming
work while requiring involvement of domain experts. To address this issue, we propose a
novel quality assessment methodology that measures the quality of a given data resource
having no data schema. In this section, we rst provide the overview of our approach
(Sec. 2.1) and explain the details of our test case pattern generation algorithm (Sec. 2.2
and 2.3).
2.1
      </p>
      <sec id="sec-2-1">
        <title>Overview of Our Approach</title>
        <p>
          Our method is a semi-automatic assessment approach and it mainly consists of three
steps: 1) pattern analysis, 2) test case pattern generation, and 3) data quality evaluation
(see Fig. 1). To measure the quality of the data resource, a criterion should be de ned.
Data Quality Pattern (DQTP) is one of the widely employed standard for assessing the
data quality of linked data and it includes various types of patterns [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In the rst step,
we de ne pattern structures that will be used for the quality assessment as a form of
DQTP. This is a manual work and domain experts determine patterns that suit for the
target data resource. According to the pattern structures de ned in the rst step, we
gure out a set of test case patterns that represent valid data from the linked data. This
second step is done with our automatic test case pattern generation algorithm, explained
below in detail. Finally, we evaluate the quality of the data by applying the generated
test case patterns to the data.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Quality Assessment Criteria</title>
        <p>
          We use the Data Quality Test Pattern (DQTP) to de ne the quality assessment criteria.
DQTP is a tuple (V,S), where V is a set of typed pattern variables and S is a SPARQL2
query template with placeholders for the variables from V [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The quality assessment
criteria are de ned by domain experts as form of DQTPs. Usually data in a knowledge
resource is given in the form of RDF triples, which consists of subject, predicate, and
object. Our method is designed to work for a knowledge resource that uses RDF triples. In
RDF, a predicate maps a subject into an object. Domain is all possible types which can be
contained by the subject. Range is all possible types that can be contained by the object.
Literal values ensure a certain data type determined by the property used, e.g. string data
type is described as xs:string in English DBpedia. RDFSDOMAIN, RDFSRANGE and
RDFSRANGED in Zaveri et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] are well-known examples of DQTPs for RDF. In our
method, the role of DQTP is the same with the them, but it is di erent from them as it
directly works on the knowledge resource and not on the ontology.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Test Case Pattern Generation Algorithm</title>
        <p>For a given pattern structure (i.e. DQTP), we generate test case patterns automatically.
The goal of our test case pattern generation algorithm is to nd data patterns that have
correct information. In general, most information in a knowledge resource is valid since
they are built with domain experts while taking several bug xing processes. Base on this
observation, we assume that more frequently appeared data patterns are more credible
patterns.</p>
        <sec id="sec-2-3-1">
          <title>2 http://www.w3.org/TR/rdf-sparql-query/</title>
          <p>RDFUnit
RDFSDOMAIN
RDFSRANGE
RDFSRANGED</p>
          <p>To gure out valid test case patterns from the whole dataset in the knowledge resource,
we use a two-step algorithm. In the rst step, we check the pattern of all data depending
on the given DQTP and compute the appearance ratio of each pattern. Then, for each
predicate, we select the top k patterns and compute the ratio of the number of RDF
triples that represent the selected patterns over the whole number of RDF triples in the
knowledge resource. When we get the ratios for all predicates, we compute the average
value of them. This average value becomes the threshold for selecting test case patterns.
In the second step, we build the set of test case pattern. We check all patterns in the
knowledge resource and add the patterns whose appearance ratio is higher than the
threshold into the test case pattern set. If for a predicate there is no pattern having
higher appearance ratio, we take the pattern having highest ratio for the property.
2.4</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Data Quality Evaluation</title>
        <p>
          Data quality evaluation entails the measurement of quality dimensions that can be
considered as the characteristics of the resource [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In this paper, information accuracy and
logical consistency are the feature of quality dimensions. While the existing method [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
uses only one type according to the de nition of ontology, our approach uses one or more
types determined by threshold (Sec. 2.3). As the circumstances require, an upper-class
type is used in our approach. For quality assessement, our approach evaluates whether
the data conform to one of the identi ed types or not.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Validation: Test Case Pattern Generation</title>
      <p>
        The best way to exactly measure the accuracy of our quality assessment approach is
comparing the evaluation results with the ground truth. However making ground truth
is not feasible because it requires manual examinations of all the data in the knowledge
resource. Instead, we compare our method with one of the previous work that relies on
ontology. We use English DBpedia3 and RDFUnit [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as a benchmark considering that
labels are found in almost all classes and properties in the ontology of English DBpedia
and RDFUnit is the most recent pattern-based quality assessment method.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Test Case Generation without Ontology for English DBpedia</title>
        <p>We de ned three types of patterns including Domain Quality Pattern (DQP), Range
Quality Pattern (RQP), and dataType Quality Pattern(TQP)4. The three patterns have
same criteria with RDFSDOMAIN, RDFSRANGE, and RDFSRANGED, respectively
Table 1 shows the de nition of each pattern. The only di erence is that our method
works on the linked data itself, di erent from RDFUnit that uses ontology. With the three</p>
        <sec id="sec-3-1-1">
          <title>3 http://wiki.dbpedia.org/</title>
          <p>4 Test case patterns are available from https://github.com/KAIST-KIRC/SAQA</p>
          <p>Predicate DQP RQP TQP
English DBpedia 2,750 1,368 601 739
Korean DBpedia 1,070 955 317 166
Table 2. The number of unique predicates and unique patterns in each DBpedia.
patterns, we rst show the actual test case generation process of our method on English
DBpedia. Then, we validate the accuracy of our method by comparing the generated test
case patterns with the set of patterns generated from ontology by RDFUnit.</p>
          <p>Table 2 shows the statics of DBpedia we used. To generate test cases, we examined
all types for a subject and an object connected by a predicate to de ne possible domain,
range, and/or data types using SPARQL query. Then we calculated the ratio for each
pattern following our test case generation algorithm (Sec. 2.3). We took top ve patterns
(i.e. k = 5) and the average ratio (i.e. threshold) was 22% for DQP. We decided 17%
as the threshold for RQP following the same processes used for DQP. Based on the
thresholds, we generated the test case patterns for DQP and RQP. For TQP, most of
the triples has a single data pattern. Consequently, we used the top one pattern for each
predicate.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Analysis</title>
        <p>We compared the test case pattern generation results with those generated by RDFUnit.
For RDFUnit, we used RDFSDOMAIN, RDFSRANGE, and RDFSRANGED that are
matched with DQP, RQP, and TQP respectively. We do not consider RDFUnit patterns
that do not have any associated triples in the resource. Our approach generates patterns
by triples in the resource, but the RDFUnit is able to generate patterns using ontology
even if the knowledge resource does not have the triple.</p>
        <p>Table 3 shows the overview of the comparison between our approach and RDFUnit.
It shows more than 97% (up to 99%) of pattern generation rates when triples exist. We
also measured the consistency, which means the ratio of matched patterns between a
set of patterns generated by our approach and the patterns generated by RDFUnit. Our
approach achieves 89.35%, 80.33%, and 67.7% consistency for DQP, RQP, and TQP
respectively. We noticed a relatively lower consistency rate for TQP, which is due to the fact
that the generated patterns have equivalent meanings with those generated by RDFUnit,
but they come from di erent resources. For instance, DBpedia ontology de ned object
data type of dbo:alias5 as rdf:langString but the generated pattern has xsd:String
as TQP of dbo:alias. This problem can be solved by adding a mechanism that maps
same data types to representative data types. In our framework, this mechanism is not</p>
        <sec id="sec-3-2-1">
          <title>5 We used http://prefix.cc to express all name spaces as pre x. In the case of Korean DB</title>
          <p>pedia, it does not exist in prefix. So we expressed http://ko.dbpedia.org/property/ as
prop-ko and http://ko.dbpedia.org/resource/ as db-ko.
implemented yet and we leave it for future work. Nonetheless, our approach generally
achieves high pattern generation rates and the generated patterns show high consistency
with the patterns generated from ontology by RDFUnit. Such high generation rates and
consistency rates are in support of the reliable performances of the proposed approach
in the environment where ontology is readily available.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Validation: Quality Assessment Accuracy</title>
      <p>There are localized versions of DBpedia in 125 languages and most of them do not have
their ontologies (Fig. 2). Our assessment method mainly aims to handle such a localized
DBpedia and to evaluate the quality of the knowledge resource. To show the generality
and usefulness of our approach, we apply our approach to one of the localized DBpedia,
Korean DBpedia. In this section we rst examine the test case generation process for
the localized DBpedia. Then, we analyze the quality assessment results produced by our
approach and validate its accuracy.
4.1</p>
      <sec id="sec-4-1">
        <title>Data Quality Assessment for Korean DBpedia</title>
        <p>Localized version of Korean DBpedia consists of 32 million triples with 18,617 di erent
properties. Korean DBpedia itself has 9,424 properties while the rests are properties from
English DBpedia and external properties. Among those properties, we only used
properties that are carried by more than 100 triples. There exist only 1,070 properties for the
condition. Korean DBpedia does not have an ontology, the fact that contains description
about domain and range for each subject and object connected by its properties. For
Korean DBpedia, we also used the three types of patterns (DQP, RQP and TQP) and
took top ve patterns, as they were the same in the case of English DBpedia.</p>
        <p>Resource Pattern Property Certain type
English DBpedia DRQQPP dbo:deathPlace dbo:Place, dbod: Wboi:kAidgaetnat:,Qd5b3o2:,Pderbsoo:nPopulatedPlace
Korean DBpedia DRQQPP prop-ko:ý@ó dbo:Place, dbod: Wboi:kAidgaetnat:,Qd5b3o2:,Pderbsoo:nPopulatedPlace
Table 4. Example test case of DQP and RQP. prop-ko:ý@ó is equivalent to dbo:deathPlace.</p>
        <p>Total Domain Range Datatype
Triples TC TC Pass Error TC Pass Error TC Pass Error
1,492,331 2,452,023 1,470,389 1,075,953 394,436 613,535 176,423 437,112 368,099 309,286 58,813</p>
        <p>Table 5. Overview of the quality assessment of the Korean DBpedia</p>
        <p>Test Case Pattern Generation: For DQP and RQP we computed the threshold
in the same way with the case of English DBpedia. In the case of TQP, we consider not
only the data type but also the language tag since the Korean DBpedia use language tag
instead of de ning a language value as a data type of object. For instance, the value of
prop-ko:t, which means \name" in English, can be in the form of a string data type or
having its language tag (e.g. @ko). The threshold ratios are about 18% and 16% for DQP
and RQP respectively. We generated a set of test case patterns based on the threshold.
Similar with the English DBpedia, most of the triples have a single data pattern for TQP
and we identi ed the top one for each predicate. Table 4 shows examples of the generated
test case patterns.</p>
        <p>Data Quality Assessment: Our methodology generated 1,438 test case patterns by
1,070 properties in Korean DBpedia. It was tested against more than 1.4 million triples
from Korean DBpedia. Table 5 provides an overview of the data quality assessment from
the resource. Totally 2.4 million pattern matching tests were performed and about 1.4
million, 613 thousands, and 360 thousands tests were done for DQP, RQP, and TQP,
respectively. Among them about 64%, 73%, and 29% of tests were passed for DQP, RQP,
and TQP, respectively. This analysis has more details, which are explained in greater
depth below.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Accuracy Analysis</title>
        <p>
          To evaluate the accuracy of our assessment method on a localized DBpedia having no
ontology, we built a data set consisting of randomly selected 1,000 triples, and employed
two human evaluators to check the validity of each triple based on Wikipedia6 data:
If they found the information in a Wikipedia page, the triple will be labelled as true,
otherwise false. The triples human evaluators marked as valid are considered as actually
valid triples. Based on Krejcie et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], 1,000 samples is a su cient size to construct
95% con dence level with a margin of 3.5% of error.
        </p>
        <p>
          We measured the inter-rater agreement value based on the Cohen's kappa measure [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
and the value between two evaluators was 0.7207. Table 6 shows the accuracy of the
patterns in terms of precision, recall, and F1-measure. Precision is the ratio of actually
valid triples to the set of triples determined as valid by our assessment method. On the
other hand, recall is the ratio of triples assessed as valid to the actually valid triples.
Also F-1 measure is de ned by 2 (precision recall) and it means the harmonic mean of
(precision+recall)
the precision and recall performances. The average F1-measure weighed by the number
of triples is about 0.7 (up to 0.79). This high accuracy scores achieved by the proposed
approach well demonstate its usefulness and generality.
6 http://ko.wikipedia.org/
The error occurrence rate of the total triple is 36.31%. DQP was produced for most of
the triples and has an error rate of 26.83%. The highest occurrence with the most error
cases in Korean DBpedia is found with the rdf:range violation. It is seen by error rate
of RQP, which reaches over 71%, which seems high relative to other studies [
          <xref ref-type="bibr" rid="ref18 ref3 ref4">3, 4, 18</xref>
          ].
        </p>
        <p>
          DQP and RQP are the most de ned patterns in Linked Data. Therefore, quality
problems related to the domain and range are very common to be found in any dataset.
In previous studies [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ], similar problems were observed. Paticularly in our case, there
are many cases where there is no de nition for rdf:type in the Korean DBpedia data.
This feature caused the data to encounter many problems while checking correct domain,
range, and/or datatype for a property. For example, the rdf:type of db-ko:ä,
which is the same as db:Canada, should be de ned as dbo:Country. The db:Canada has
dbo:Country, but db-ko:ä does not have any types. Moreover, in terms of range,
we can not de ne range only by looking at object type. DBpedia triples are extracted
from Wikipedia data stream as URIs or literal. At this time, the object range validation
cannot be performed [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. There are many cases in which value of the object are extracted
as string or literal, not as URI. Although it is represented in a di erent form, the value
itself has an equal meaning, but still it does not meet rdf:range in quality evaluation.
For the reasons, which have been mentioned, the recall of RQP is much lower than the
other two.
        </p>
        <p>Other problems related to domain and range occurred when types were not labelled as
rdf:type but used as string instead, particularly for range. For such cases, we classi ed
them as a datatype problem. We found other cases of quality problem regarding datatype,
i.e. incorrect datatype setting and incorrect object value. An example for the rst case is
when the data concerning the date must be set as xs:date, but it is set to xs:integer
instead. For the second case, let's take prop-ko:\Ù0 , which means \active period"
in English, as an example. The object value is a period of time but, instead of duration,
only the beginning point of the duration is directly extracted from the Wikipedia page.</p>
        <p>In the case of datatype quality, we found that quality problems occurred in two cases.
First, datatype does not match the object. For instance, the object of prop-ko:Ü´ó,
which means \birth place" in English, must be within dbo:Place or string datatype in
Korean DBpedia. However, objects of prop-ko:Ü´ó are represented as xs:integer.
Second, property ambiguity is a common problem. One property could have more than
one meaning, which then a ects its object type. This happens when the property is not
represented by rdfs:label or dbo:abstract. For instance, for property prop-ko:©,
which means \event" (e.g. Olympic event), can have 2 totally di erent types of objects,
i.e. the name of the event itself or the number of events. It raised another problem
because we had to choose which datatype should be taken.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related work</title>
      <p>
        Linked data quality assessment There are a number of data quality assessment
approaches for linked data. Zaveri et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], classi ed the data quality dimensions into the
accessibility, intrinsic, contextual, and representational from analyzing several approaches
and tools. Quality assessment tools are typically used for semi-automatic or automatic
measurement. LINK-QA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is an extensible tool that allows for the evaluation of linked
data mapping using network metrics. It is an automatic approach that can perform the
quality assessment of the links.
      </p>
      <p>
        On the other hand, most of the quality assessment approaches are semi-automatic.
DaCura [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is able to collect and curate evolving linked data that maintain quality over
time. It requires a lot of human e orts for modifying schema involving domain experts,
data harvesters, and consumers. Another framework called SWIQA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] automatically
identi es data quality by SPARQL queries which represent the quality rule. The rule is
de ned by analyzing the Ontology, programing knowledge is not required in this time.
Also other studies proposed linked data quality assessment methods [
        <xref ref-type="bibr" rid="ref16 ref17 ref5">5, 16, 17</xref>
        ]. Even
though these studies introduced useful ways to assess data quality, they all require an
ontology or data schema. In our study, we semi-automatically generated patterns that
are able to evaluate the quality from linked data without requiring an ontology. We used
Korean DBpedia, which is one of the localized versions of DBpedia. In the following, we
identi ed the quality problems in DBpedia resources, also automatic ontology generation
methods, related to our approach.
      </p>
      <p>
        Data Quality Assessment of DBpedia DBpedia is a central hub of LOD cloud.
The quality problems of DBpedia were studied through manual, crowdsourcing [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and
semi-automated approaches [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a framework for the DBpedia quality assessment
is presented. It involves manual and semi-automatic processes. In the semi-automatic
phase, the framework requires the axiom, which is created by ontology learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or
manual veri cation. Another study classi ed more details about quality problems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In this study, 12 data quality test patterns were created from DBpedia user community
feedback, Wikipedia maintenance system and ontology analysis. Most of local DBpedia
do not have ontology, but have similar data formation. Therefore we have devised an
approach for the quality assessment of data by paying attention to this research.
Automatic Ontology generation Traditionally, ontology is generated by domain
experts. However, building an ontology for a huge amount of data is a di cult and time
consuming task. Consequently, there are several studies on the automatic ontology
generation. Text2Onto [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is an ontology learning framework from textual data by representing
the learned knowledge at a meta-level in the form of Probabilistic Ontology Model. The
framework calculates a con dence score for each learned object and it also allows a user
to trace the evolution of the ontology. The framework extracts ontologies from language
texts by employing natural language processing. As such, the framework is limited by
languages - it only supports English, Spanish, and German texts.
      </p>
      <p>
        Sie and Yeh's study [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] combines the results of speci c knowledge network and
automatic ontology generation from metadata. This approach builds digital libraries that have
metadata documents and schema information. Another study generated OWL ontology
automatically from XML [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Those approaches generated ontologies from data schemas,
which are not available in our localized DBpedia. Recently, Pilehvar and Navigli's work
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] addressed the alignment of an arbitrary pair of lexical resources independently of
their speci c schema. They proposed to induce a network structure for dictionary
resources, however textual similarity remains an important component of their approach.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future work</title>
      <p>In this paper, we proposed an approach for evaluating the quality of linked data without
requiring the use of ontology. The approach semi-automatically generates patterns from a
knowledge resource without using any data schema or ontology. Pattern is a structure that
is derived from data. Patterns are then instantiated into test cases to measure the quality
of data in terms of domain, range, and datatype of a property. We evaluated our approach
going through two phases. First, we compared the patterns generated without using
ontology with the existing benchmark patterns that were generated by using ontology. We
used dataset from English DBpedia. The consistency between the generated patterns and
the existing patterns are high (89.35% for DQP, 80.33% for RQP, and 67.74% for DTQP).
Second, we applied our approach to evaluating data quality using a localized DBpedia,
which does not have ontology. We used Korean DBpedia as example of localized DBpedia.
Our approach generated 1,438 test case patterns from Korean DBpedia. We evaluated
the quality of over 1.4 million triples in the resource by using patterns generated by our
approach. Through the evaluation results, we found several problems that are caused by
the lack of schema, as well as the problems of data itself.</p>
      <p>The current approaches for assessing the quality of linked data are only possible with
the presence of data schema or ontology. This work is the rst step of developing an
approach for evaluating data quality without requiring such data schema when automatic
generation of ontology is di cult. Further research is needed in order to conduct
fullscale evaluation of the potential of the proposed approach. Further, we plan to evaluate
data quality problems, caused by a lack of schema, by utilizing external resources (e.g.
WordNet, Thesaurus). We are also looking for more varied patterns that can be applied
to quality assessment. Finally, we plan to not only improve the quality assessment, but
also to create a complete validation system for determining trustworthiness of triples.
Notwithstanding these limitations, however, the current ndings clearly show that the
proposed approach opens a new possibility of conducting quality assessment when the
knowledge resource that lacks a well developed ontology has to be used.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by Institute for Information &amp; communications Technology
Promotion(IITP) grant funded by the Korea government(MSIP) (No. R0101-15-0054,
WiseKB: Big data based self-evolving knowledge base and reasoning platform)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Batini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cappiello</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Francalanci</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Methodologies for data quality assessment and improvement</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ),
          <volume>16</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietrobon</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Quality assessment methodologies for linked open data</article-title>
          . Submitted to Semantic
          <source>Web Journal</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sherif</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          , Buhmann, L.,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>User-driven quality evaluation of dbpedia</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems</source>
          (pp.
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          ). ACM (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Westphal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cornelissen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaveri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World Wide Web</source>
          (pp.
          <fpage>747</fpage>
          -
          <lpage>758</lpage>
          ). ACM (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passant</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A. Weaving</given-names>
          </string-name>
          <article-title>the pedantic web (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gueret</surname>
          </string-name>
          , Christophe and Groth, Paul and Stadler, Claus and Lehmann, Jens.
          <article-title>Assessing linked data mappings using network measures</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          (pp.
          <fpage>87</fpage>
          -
          <lpage>102</lpage>
          ). Springer Berlin Heidelberg (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bizer</surname>
          </string-name>
          , Christian and Cyganiak, Richard.
          <article-title>Quality-driven information ltering using the WIQA policy framework</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Feeney</surname>
            ,
            <given-names>Kevin</given-names>
          </string-name>
          <string-name>
            <surname>Chekov</surname>
            and
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Sullivan</surname>
          </string-name>
          , Declan and Tai, Wei and Brennan, Rob.
          <article-title>Improving curated web-data quality with structured harvesting and assessment</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>35</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bedini</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Automatic ontology generation: State of the art</article-title>
          .
          <source>PRiSM Laboratory Technical Report</source>
          . University of Versailles (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , Richard Y and Strong, Diane M.
          <article-title>Beyond accuracy: What data quality means to data consumers</article-title>
          .
          <source>Journal of management information systems</source>
          ,
          <volume>5</volume>
          -
          <fpage>33</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Juran</surname>
            , Joseph and Godfrey,
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Blanton</surname>
          </string-name>
          .
          <article-title>Quality handbook</article-title>
          . Republished
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Wachter, Thomas and Fabian, Gotz and Schroeder,
          <string-name>
            <surname>Michael.</surname>
          </string-name>
          <article-title>DOG4DAG: semi-automated ontology generation in obo-edit and protege</article-title>
          .
          <source>In Proceedings of the 4th International Workshop on Semantic Web Applications</source>
          and
          <article-title>Tools for the Life Sciences</article-title>
          (pp.
          <fpage>119</fpage>
          -
          <lpage>120</lpage>
          ). ACM (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sie</surname>
          </string-name>
          ,
          <article-title>Shun-hong and Yeh, Jian-hua. Automatic ontology generation using schema information</article-title>
          .
          <source>In Web Intelligence</source>
          ,
          <year>2006</year>
          .
          <article-title>WI 2006</article-title>
          . IEEE/WIC/ACM International Conference on (pp.
          <fpage>526</fpage>
          -
          <lpage>531</lpage>
          ). IEEE (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          , Philipp and Volker,
          <source>Johanna. Text2Onto. In Natural language processing and information systems</source>
          (pp.
          <fpage>227</fpage>
          -
          <lpage>238</lpage>
          ). Springer Berlin Heidelberg (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Wienand</surname>
          </string-name>
          , Dominik and Paulheim, Heiko.
          <article-title>Detecting incorrect numerical data in dbpedia</article-title>
          .
          <source>In The Semantic Web: Trends and Challenges</source>
          (pp.
          <fpage>504</fpage>
          -
          <lpage>518</lpage>
          ). Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Fu</surname>
          </string-name>
          <article-title>rber, Christian and Hepp, Martin. Swiqa-a semantic web information quality assessment framework</article-title>
          .
          <source>In ECIS</source>
          (Vol.
          <volume>15</volume>
          , p.
          <fpage>19</fpage>
          <lpage>)</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Fu</surname>
          </string-name>
          <article-title>rber, Christian and Hepp, Martin. Using semantic web resources for data quality management. In Knowledge Engineering and Management by the Masses (pp</article-title>
          .
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          ). Springer Berlin Heidelberg (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Acosta</surname>
          </string-name>
          ,
          <article-title>Maribel and Zaveri, Amrapali and Simperl, Elena and Kontokostas, Dimitris and Auer, Soren and Lehmann, Jens. Crowdsourcing linked data quality assessment</article-title>
          .
          <source>In The Semantic Web{ISWC</source>
          <year>2013</year>
          (pp.
          <fpage>260</fpage>
          -
          <lpage>276</lpage>
          ). Springer Berlin Heidelberg (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Wienand</surname>
          </string-name>
          , Dominik and Paulheim, Heiko.
          <article-title>Detecting incorrect numerical data in dbpedia</article-title>
          .
          <source>In The Semantic Web: Trends and Challenges</source>
          (pp.
          <fpage>504</fpage>
          -
          <lpage>518</lpage>
          ). Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>DL-Learner: learning concepts in description logics</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          ,
          <fpage>2639</fpage>
          -
          <lpage>2642</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yahia</surname>
          </string-name>
          ,
          <article-title>Nora and Mokhtar, Sahar A and Ahmed, AbdelWahab. Automatic generation of OWL ontology from XML data source</article-title>
          .
          <source>arXiv preprint arXiv:1206.0570</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Krejcie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Robert</surname>
            <given-names>V</given-names>
          </string-name>
          and Morgan, Daryle W.
          <article-title>Determining sample size for research activities</article-title>
          .
          <source>Educ Psychol Meas</source>
          (
          <year>1970</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>Jacob.</given-names>
          </string-name>
          <article-title>A coe cient of agreement for nominal scales</article-title>
          .
          <source>Educational and psychological measurement 20</source>
          .1 :
          <fpage>37</fpage>
          -
          <lpage>46</lpage>
          (
          <year>1960</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Pilehvar</surname>
            , Mohammad Taher, and
            <given-names>Roberto</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>A robust approach to aligning heterogeneous lexical resources</article-title>
          .
          <source>AP A 1</source>
          (
          <year>2014</year>
          ):
          <fpage>c2</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>