=Paper= {{Paper |id=Vol-2459/paper1 |storemode=property |title=OD2WD: From Open Data to Wikidata through Patterns |pdfUrl=https://ceur-ws.org/Vol-2459/paper1.pdf |volume=Vol-2459 |authors=Muhammad Faiz,Gibran M.F. Wisesa,Adila A. Krisnadhi,Fariz Darari |dblpUrl=https://dblp.org/rec/conf/semweb/FaizWKD19 }} ==OD2WD: From Open Data to Wikidata through Patterns== https://ceur-ws.org/Vol-2459/paper1.pdf
         OD2WD: From Open Data to Wikidata
                 through Patterns?

    Muhammad Faiz, Gibran M.F. Wisesa, Adila Krisnadhi, and Fariz Darari

        Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
    {muhammad.faiz52,gibran.muhammad}@ui.ac.id, {adila,fariz}@cs.ui.ac.id



        Abstract. We present OD2WD, a semi-automatic pattern-based frame-
        work for populating the Wikidata knowledge graph with (originally tab-
        ular) data from Open Data portals. The motivation is twofold. One,
        our framework can further enrich Wikidata as a central RDF-oriented
        knowledge graph with a large amount of data coming from public or
        government sources. Two, our framework can help the discovery of data
        from such Open Data portals and its integration with Linked Data with-
        out forcing the Open Data portals to provide Linked Data infrastructure
        by themselves, as the latter may not always be feasible due to vari-
        ous technical, budgetary, or policy reasons. Throughout the transforma-
        tion process, we identify several reengineering and alignment patterns.
        We implement the framework as an online service and API accessible
        at http://od2wd.id, currently tailored for three Indonesian Open Data
        portals: Satu Data Indonesia, Jakarta Open Data, and Bandung Open
        Data.

        Keywords: Open Data · Wikidata · Reengineering Pattern · Alignment
        Pattern


1     Introduction
The Open Data initiative has been adopted in various degrees by many countries
around the world [15] due to a number of benefits it offers, incl. increasing
citizen participation and boosting economic growth. However, challenges remain
on the findability as well as usability of such data particularly due to poor data
publishing practices. Even when data publishing is facilitated by a national or
regional Open Data portal, it is done in a format such as the tabular, comma-
separated value (CSV) format, which does not easily support interlinking and
integration. Viewing this from the 5-star rating model of Open Data [1], a huge
amount of data has only 3-star rating, rather than 5-star rating.
    Fig. 1 shows an example of tabular data about public schools in Jakarta
stored in the Jakarta Open Data portal at http://data.jakarta.go.id/. One
could conceivably think that the portal also possesses related CSV files con-
taining, e.g., the number of students of the schools mentioned in Fig. 1. Since
?
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       Faiz et al.




Figure 1: Tabular data about public schools in Jakarta with columns representing school
name, address, sub-district, phone number, and school type.



CSV or similar formats do not support interlinking, the integration of such data
becomes more complicated.

    A solution to improve this situation involves the cooperation of Open Data
providers (i.e., governments and public institutions) to provide Linked Data in-
frastructure beyond just putting CSV files online. However, technical, budgetary,
or policy reasons may prevent this to be realized. So, we advocate an alternative
approach in which we employ an already existing public Linked Data (or knowl-
edge graph) repository to host a Linked Data version of those CSV files. Provided
that the data licensing policies of the Open Data portals and the repository are
not in conflict, this approach then only requires, in principle, a technological so-
lution of transforming the data from Open Data portals into data in the public
Linked Data repository. The benefits of this approach have been pointed out by
van der Waal et al. [14], namely discoverability, harvesting, interoperability, and
community engagement.

    We thus implement OD2WD, a semi-automated framework for transforming
and publishing tabular, CSV-formatted data in Open Data portals to Wikidata.
We chose Wikidata as the target repository not just because it is one of the most
prominent Linked Data repositories, but more importantly, it allows public to
add and edit data in it. From a Wikidata perspective, our effort can also be
viewed as enriching Wikidata content. For example, from the public school data
in Fig. 1, we can obtain connections between entities in it to existing entities such
as the sub-districts. We describe the challenges in realizing such a transformation
framework in Section 2 and detail the transformation workflow in Section 3. We
also discover a few recurring patterns, described in Section 4, in different phases
of the transformation framework, namely two reengineering patterns in datatype
detection and protagonist column (i.e., main subject column) detection phases,
as well as four alignment patterns in property mapping, entity linking, class
linking, and instance typing phases. Such patterns can be useful in other similar
scenarios of data conversion from a tabular form to Linked Data. For the last
two sections of the paper, we report some performance evaluation regarding
different phases of the transformation workflow (Section 5) and finish with a
brief conclusion of the paper (Section 6).
                   OD2WD: From Open Data to Wikidata through Patterns              3

2   Tabular Data to Wikidata Graph: Challenges
The challenges faced by our approach do not solely arise from the conversion
process of tabular data to plain RDF data. Rather, we wish to integrate such
data into Wikidata. There are a number of tools to solve the former problem,
for example, Tarql [4] and RDF123 [7]. In addition, W3C has published a suite
of recommendations aimed at generating RDF data from tabular data sources
(e.g., Tandy et al. [12]). Our use case, however, requires consideration for the tar-
geted platform (i.e., Wikidata) and its vocabulary. In particular, if a CSV table
contains an entity that is semantically equivalent to an existing Wikidata item,
then that item and its identifier should be used as the basis for enrichment. In
contrast, inventing new identifiers would overpopulate the Wikidata knowledge
graph with distinct identifiers that actually refer to the same real-world entity.
The D2RQ mapping language [2] and the W3C recommendation of R2RML [11]
both provide mapping languages into RDF, yet the data sources must be in
the relational database format and also the mapping languages do not concern
vocabulary mapping and entity linking aspects. Karma [8] is a semi-automated
tool to map structured sources to ontologies and generate RDF data accordingly.
However, the tool is not tailored toward Open Data portals with Wikidata as
the target knowledge graph. The aforementioned tools are not appropriate for
our setting, while the W3C recommendations do not technically specify a recipe
that can be used to solve our problem.
    Specifically, there are two key challenges in our setting. First, tabular data
does not have a strict form that allows immediate determination of the entity
to be used as a subject, a predicate, or an object. Note that tabular data can
have different forms [3]: vertical listings, horizontal listings, matrices, and enu-
merations. Among those, data in Open Data portals usually takes the form of
vertical listings, whose rows are similar entities with attributes expressed by the
columns. The second challenge is the alignment aspect. Every term we extract
from a table is aligned to a vocabulary term in Wikidata. Oftentimes, we can
find entities with the same label but different context. For example, the term
“Depok” in Indonesia may refer to a city in West Java, a district in Cirebon
Regency, a district in Sleman Regency, and many more. To obtain good quality
of data conversion into Wikidata, we need to resolve this ambiguity challenge.

3   Conversion Flow
The OD2WD system converts and republishes CSV data from Open Data portals
to Wikidata. CSV serves as a canonical format for tabular data as other tabu-
lar formats such as XLS and ODS can be immediately exported as CSV. The
conversion process (Fig. 2) consists of two major parts, namely triple extraction
(via preprocessing and metadata extraction) and alignment to Wikidata terms
(via mapping and linking) followed with (re)publishing the triples to Wikidata.

Triple Extraction. Our analysis on the data in the three Open Data portals
we use as the basis of evaluation – see Section 5 for further information about
4       Faiz et al.




                       Figure 2: Data Conversion Architecture


the portals – did not find any table with format other than vertical listings.
Moreover, more than 95% of those tables have exactly one column designated
as the subject or protagonist column. So, our current implementation assumes
the table to be converted is vertical listing with one protagonist column.
    In a vertical listing table, each row represents a bunch of information about
a single entity that can be represented as a set of triples. All triples from that set
share a subject obtained from the cell value under the protagonist column. Each
non-protagonist column header gives a predicate and the corresponding object is
the cell value under that column. Starting from a vertical listing table, the triple
extraction phase consists of preprocessing and implicit metadata extraction.

Preprocessing. This step consists of data cleaning such as case normalization on
column headers as well as N/A values removal. The latter is realized by deleting
either the rows or columns depending on which would delete the lowest number
of cells. For example, in a table of 50 rows and 4 columns, we would delete rows
that contain N/A values, but in a table of 15 columns and 4 rows, we would
delete columns that contain N/A values. We also delete the (artificial) index
column as it conveys no semantic information useful in our conversion process,
e.g., the column with header “No.” indicating row numbers.

Implicit Metadata Extraction. This step consists of two substeps: datatype detec-
tion of each column and protagonist (column) detection. In datatype detection,
we go through each column and guess the datatype of each cell value under that
column. The latter is done by matching the cell value to a bunch of regular
expressions listed in Table 1 following the flow given by Fig. 3. The regular ex-
pressions used were original, with two exceptions: “Is Globe Coordinate”1 and
1
    https://stackoverflow.com/questions/3518504/regular-expression-for-
    matching-latitude-longitude-coordinates
                  OD2WD: From Open Data to Wikidata through Patterns            5




                           Figure 3: Datatype Detection

               Table 1: Regular Expressions for Datatype Detection

Evaluation          Regular Expression
Is Quantity         [-+.,()0-9]+
Is Time             ^([0-2][0-9]|(3)[0-1])([\/,-])(((0)[0-9])|((1)[0-2]))
                    ([\/,-])\d{4}$
Is Globe Coordinate ^[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?
                    ((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$
Is URL              ^[a-zA-Z0-9_\-\@]+\.[a-zA-Z0-9]_\-\.
Is Literal String   [\.\,\!\?\>\<\/\\\)\(\-\_\+\=\*\&\^\%\$\#\@\!\:\;\~]



“Is URL”2 . The datatype for the column is decided by a majority rule over
datatypes on all cell values under that column.
    The protagonist detection step aims to determine which column acts as the
subject of all triples induced by a row. We employ the following heuristic: the
column with the highest number of unique cell values is taken as the protagonist
column. Should a tie occur, the leftmost column takes precedence. Note that
any index column will not be taken as protagonist since it is already removed
in the preprocessing step. Our experiment (Section 5) indicates that tie only
occurs in less than 7% of the datasets we work with and the above tie-breaking
rule shows 100% accuracy in the evaluation. Of course, the heuristic itself may
yield inaccurate prediction of protagonist because there are tables with either
zero or more than one protagonist columns. Such tables are however in minority
(less than 5% of the datasets). Handling such tables involves special steps, e.g.,
creation of synthetic identifiers, or merging multiple columns into one compound
protagonist column, which we leave for future work.


Wikidata Alignment. The triple extraction step generates triples of the form
(protagonist cell value, column header, cell value), but not yet RDF triples be-
cause no URI is assigned to the triple components. So, the next step is aligning
to Wikidata items and properties so that appropriate URIs can be used. There
2
    https://stackoverflow.com/questions/11724663/regex-for-simple-urls
6      Faiz et al.




                         Figure 4: Mapping Architecture

are two substeps in Wikidata alignment step, namely mapping and linking steps.

Mapping. In this step, non-protagonist column headers are aligned to Wikidata
properties. Fig. 4 shows the architecture of our mapping process. We employ an
Elasticsearch component3 backed by an index formed of all Wikidata property
IDs, labels, description, alias, and range. Given a column header, this com-
ponent provides us with a list of candidate properties closely matching it. To
choose the semantically closest property, we employ Word2Vec [9] with Indone-
sian Wikipedia dump4 as the underlying corpus. We then compute the cosine
similarity in the embedding space between the column header and the candidate
properties. The closest candidate property is taken as the alignment target for
the column header. Note that Elasticsearch may yield no candidate properties.
In this case, the triple is either dropped, unless the Wikidata community5 ap-
proves the creation of a new property based on the given column header or an
appropriate Wikidata property is manually found.

Linking. As illustrated in Fig. 5, here we perform: (i) entity linking from cell
values to Wikidata items, and (ii) class linking, from protagonist column header
to Wikidata class.
    Given a cell value in the table whose type was detected as WikibaseItem,
entity linking aims to find an appropriate Wikidata item URI for it. Currently,
ambiguity is resolved by making use of column name as context. As future work,
this can be expanded to include other context information. The process starts
with obtaining a list of candidate items for the cell value through the Wikidata
Entity Search API together with Wikidata classes the items belong to. To choose
which item is the most appropriate to link with the cell value, we look at a
combination of two cosine similarity scores computed in the embedding space
(via Word2Vec). The first one is similarity between (vector representation of)
the cell value and the item. The second one is between the context information
(column header) and the class(es) of the item. We can thus rule out the items
3
  https://www.elastic.co/products/elasticsearch
4
  https://dumps.wikimedia.org/idwiki/latest/
5
  https://www.wikidata.org/wiki/Wikidata:Property_proposal
                   OD2WD: From Open Data to Wikidata through Patterns             7




           Figure 5: Entity Linking (a) and Class Linking (b) Architecture



whose class does not match with the context. The item with the highest combined
score is chosen to link to the given cell value. If no linking is found (because the
score is too low), then the cell value is likely to be a new item to be created in
Wikidata.
    Class linking is performed on protagonist column header. The motivation is
that all protagonist entities belong to the same category or class and information
from the protagonist column header can be exploited to obtain that class. The
process starts with getting candidate items that likely match the protagonist
column header via Wikidata Entity Search API. Then, we filter out those items
that are not classes (using SPARQL query on Wikidata endpoint). A class is
an item occurring at the object position of the “instance of” or “subclass of”
relationships. Next, we select the class with the highest similarity score with the
protagonist column header, computed in the embedding space.


Publishing The publishing phase aims to publish the conversion result, which
is in triple format of protagonist-property-value, to Wikidata. Our current imple-
mentation uses the QuickStatements6 tool to import the result to Wikidata. This
6
    https://tools.wmflabs.org/quickstatements/
8      Faiz et al.




                         Figure 6: Publishing Architecture



phase makes use of the mapping and linking results from the previous steps, and
generates a QuickStatements serialization to be executed for Wikidata import-
ing. Not all results from mapping and linking phase will be published, because
there is a possibility of an error in those phases. Hence, a final manual check can
be performed to resolve errors or false values in the mapping and linking.
    Fig. 6 shows the architecture for the publishing phase. This phase consists
of several steps. The first is loading the results of entity linking and property
mapping. After that we use the data from the source CSV to fill in the values
of literal properties and labels. We then apply the correct formatting based on
each column datatype according to the QuickStatements syntax. We also add
metadata for the import process, e.g., reference (i.e., Open Data portal links).
The final step is simply executing QuickStatements.


4     The Patterns

Ontology Design Patterns (ODPs) are modeling solutions to frequently occurring
ontology design problems [6]. Unlike the usual ontology modeling problem whose
aim is to design an ontology capturing a particular notion, our work concerns
with patterns occurring during transformation process. In the ODP typology,
these patterns are reengineering and alignment patterns. Within our data trans-
formation workflow, these patterns are apparent in protagonist and datatype
detection phases as well as within vocabulary alignment phases.


4.1   Reengineering patterns

Reengineering patterns operates on a source model to generate a new ontology
or knowledge graph (or its parts) as a target model via some transformation
procedure. The source model need not be an ontology; other types of resources
are possible, e.g., thesaurus, data model patterns or linguistic structures. In our
case, the source model is a tabular data model, while the target model is a graph
shape that is part of the Wikidata knowledge graph. Metamodel transformation
rules can be used to describe reengineering patterns [6, 13].
                     OD2WD: From Open Data to Wikidata through Patterns                        9

Given: Schema tuple T = (C1 , . . . , Cm ), t = (c1 , . . . , cm ) is a row in table T , and
Ck = P rot(T ).
Generate: Graph with the following form (written in Turtle syntax):

    Subj Pred1 Obj1 ; . . . ; Predk−1 Objk−1 ; Predk+1 Objk+1 ; . . . ; Predm Objm .

where Subj = LinkRes(ck ), the Wikidata entity corresponding to ck (as obtained
according to the alignment pattern in Fig. 10), and for 1 ≤ j ≤ m, j 6= k, we have:

    – Objj = LinkRes(cj ), the Wikidata entity corresponding to cj ;
    – Predj = M apRes(Cj ), the Wikidata property corresponding to column header
      Cj according to the alignment pattern in Fig. 9.

Figure 7: Reengineering pattern for generating graph shape from table with protagonist
column identified



    In our case here, the reengineering patterns emerge as we go along the trans-
formation process. They help us streamline our thought process and more im-
portantly, they capture the principles applicable to the transformation of virtu-
ally all tables in Open Data portals. As metamodel, however, there is no clear
consensus as to how exactly the pattern can be expressed formally. Thus, we
conveniently choose a very simple meta-rule expression to describe the patterns
as seen below.
    The first reengineering pattern here concerns the graph shape obtained from
the table by accounting for its protagonist. Let T be a table expressed as a
schema tuple T = (C1 , . . . , Cm ) with column headers C1 , . . . , Cm . Then, the
table can contain up to N data tuples of the form t = (c1 , . . . , cm ) representing
a row of the table T where each cj is a value of column Cj in that row. Let
P rot(T ) be the column header Ck that is detected as protagonist of T . Then,
the reengineering pattern specifies a graph shape according to Fig. 7.
    We also identify a reengineering pattern during datatype detection. Specifi-
cally, given a column header Cj of a table T and the corresponding values in that
column, we obtain the datatype corresponding to the column via reengineering
pattern in Fig. 8. Note that procedurally, the datatype detection is done through
a flow given in Fig. 3.


4.2     Alignment patterns

Alignment (or Mapping) patterns express semantic associations between two
vocabularies or ontologies [5, 10]. With alignment patterns, one can declaratively
express the associations between vocabularies. Such expressions can be captured
by some alignment language such as EDOAL.7 In our case, we slightly relax the
definition of vocabularies here to include not just Wikidata properties and items,
but also terms appearing as table headers and values. Thus, existing alignment
7
    http://alignapi.gforge.inria.fr/edoal.html
10        Faiz et al.

                                                                      (i)
Given: A column header Cj of table T containing N rows and cj , 1 ≤ i ≤ N are
N (not necessarily unique) values from each row of T at the j-th column.
                                                                             (i)
Generate: A Wikidata datatype dt for column Cj if the majority of cj ’s satisfy
the datatype pattern dtp, which is a Boolean combination of regular expressions
from Table 1 specified as follows:

     – if dtp is neither Quantity, URL, nor Literal String, then then dt is WikibaseItem;
     – if dtp is neither Quantity nor URL, but is Literal String then dt is String;
     – if dtp is not Quantity, but is URL, then dt is URL;
     – if dtp is Quantity, but not Date and not Globe Coordinate, then dt is Quantity;
     – if dtp is Quantity, not Date and is Globe Coordinate, then dt is Globe Coordi-
       nate;
     – if dtp is Quantity and Date, then dt is Time.

                  Figure 8: Reengineering pattern for datatype detection


language like EDOAL cannot exactly capture the intended alignment. Instead,
we express an alignment as a set of RDF triples using our own vocabulary
whose intuitive meaning can be easily understood. This format also allows the
alignments to be shareable more easily.
    During mapping, entity linking, and class linking phases we identify a num-
ber of alignment patterns below, expressed declaratively as RDF graph. Be-
low, we use the following URI prefixes: xsd for http://www.w3.org/2001/
XMLSchema#, wd for http://www.wikidata.org/entity/, wdt for http://www.
wikidata.org/prop/direct/, skos for http://www.w3.org/2008/05/skos#,
rdf for http://www.w3.org/1999/02/22-rdf-syntax-ns#, od2wd for http:
//od2wd.id/resource# and od2wd-prop for http://od2wd.id/property#.

Alignment pattern in mapping phase. The alignment pattern identified during
mapping phase is between a non-protagonist column header of a table and a
Wikidata property. Let ColName be a column header and wdt:Y is the Wiki-
data property most likely associated with ColName according to the mapping
procedure in Fig. 4. Then, M apRes(C) = wdt:Y expressed as a graph structure
in Fig. 9, which also includes mapping relation information (skos:broadMatch),
confidence score of the mapping, URI of mapping procedure, an the time when
the mapping was computed. We use skos:broadMatch because column names
in the source table tend to have a semantically narrower meaning than the cor-
responding Wikidata properties. That is, ColName “has broader concept” wdt:Y.

Alignment pattern during entity linking phase. During entity linking, we discover
an alignment pattern similar to the one in Fig. 9, this time between values in a
table and Wikidata entities (Fig. 10). Given a value in the table EntityName, the
linking phase described in Fig. 5a results in wd:Y, the most likely Wikidata entity
that matches EntityName. Note that this is only done to values of type Wik-
ibaseitem. The provenance information is similar, but with skos:closeMatch as
the mapping relation.
                  OD2WD: From Open Data to Wikidata through Patterns            11

_:link1 od2wd-prop:type skos:broadMatch ;
        od2wd-prop:from "ColName" ;
        od2wd-prop:to   wdt:Y ;
        od2wd-prop:confidence "Num"^^xsd:decimal ;
        od2wd-prop:generated_from od2wd:od2wdapi ;
        od2wd-prop:when "Time"^^xsd:dateTime .

     Figure 9: Alignment pattern for column headers and Wikidata properties.

_:link1 od2wd-prop:type skos:closeMatch ;
        od2wd-prop:from "EntityName" ;
        od2wd-prop:to   wd:Y ;
        od2wd-prop:confidence "Num"^^xsd:decimal ;
        od2wd-prop:generated_from od2wd:od2wdapi ;
        od2wd-prop:when "Time"^^xsd:dateTime .

        Figure 10: Alignment pattern for table values and Wikidata entities.

_:link1 od2wd-prop:type skos:closeMatch ;
        od2wd-prop:from "ColName" ;
        od2wd-prop:to   wd:Y ;
        od2wd-prop:confidence "Num"^^xsd:decimal ;
        od2wd-prop:generated_from od2wd:od2wdapi ;
        od2wd-prop:when "Time"^^xsd:dateTime .

Figure 11: Alignment pattern for protagonist column ColName and Wikidata class-type
entity wd:Y


Alignment pattern during class linking phase. Protagonist columns are not map-
ped to Wikidata properties since their values are the subject entities. Instead,
they are linked to class entities, i.e., those occurring as the target of instance
of or subclass of properties in Wikidata, via the procedure given in Fig. 5b.
Fig. 11 shows an alignment pattern between protagonist columns and Wikidata
class-type entities, expressed similarly as the earlier two alignment patterns.
    In addition to the alignment between protagonist columns and Wikidata
class-type entities, we also observe an alignment between values in a protagonist
column and its Wikidata class-type entity. That is, such values are viewed as
instances of the class-type entity. This is described in Fig. 12.


5   System Performance and Evaluation
We measure the accuracy of each conversion step, by comparing the system re-
sults with the human-created gold standards. The experiment was done using 50
CSV documents coming from several of Indonesian open data portals: Indonesia
Satu Data (http://data.go.id), Jakarta Open Data (http://data.jakarta.
go.id) and Bandung Open Data (http://data.bandung.go.id).
12     Faiz et al.

_:link1 od2wd-prop:type rdf:type ;
        od2wd-prop:type_context "ColName" ;
        od2wd-prop:from wd:Z
        od2wd-prop:to   wd:Y ;
        od2wd-prop:generated_from od2wd:od2wdapi ;
        od2wd-prop:when "Time"^^xsd:dateTime .

Figure 12: Alignment pattern asserting wd:Z as instance of wd:Y where wd:Z is the
result of LinkRes(c) with c a value under protagonist column ColName.

                     Table 2: Datatype Detection Evaluation Result

 Number of Correct Total Number         of Accuracy   Average Accuracy Total
 Column            Column                  per Document
 286               349                     83.5%              81.9%



5.1   Evaluation Result

This section discusses the result of evaluation from several conversion phases,
especially those who have been identified to have a pattern.


Datatype Detection Datatype detection is an effort to predict the datatype of
each columns of the table, the rules of the datatype are based on the Wikidata
datatypes, further information about the datatype rules and this phase as a
whole could be seen on previous section. To measure the performance of this
phase, we measure the accuracy of system’s datatype prediction. We compare the
system’s prediction with a gold standard we created using the help of Wikidata-
educated human judges using a 3-judge system, where, for each column, 3 human
judges would evaluate the system prediction, and give a verdict whether the
prediction was accurate or not. The results of the experiment are shown in
Table 2.
   From the table above we could see that the system has managed to obtain
quite a satisfactory performance with the total accuracy of 81.9% and 83.5%
average accuracy per document.


Protagonist Detection For protagonist detection to check the accuracy of
the heuristics, we use 50 CSV files, the same files used in datatype detection
evaluation, and we label each CSV with their respective protagonist column, the
labelling process was done manually by a researcher and several evaluators using
a 3-Judge System similar to the one we used in datatype detection evaluation.
Then we compare the result of the heuristics guess of the document’s protagonist
with the label to rate the heuristics accuracy.
   For protagonist detection phase, we managed to obtain accuracy of 88% . As
we can see, though it is not perfect, this score is pretty satisfactory.
                  OD2WD: From Open Data to Wikidata through Patterns           13

                 Table 3: Linking and Mapping Evaluation Result

      Phase                            Mapping Entity Linking Class Linking
      Number of Prediction             279     890            100
      Number of Correct Prediction     221     787            70
      Accuracy                         79.21% 88.42%          70%



Mapping and Linking To evaluate mapping and entity linking, we asked
human evaluators to rate the correctness of the predicted mapping and entity
linking of 50 CSVs with 279 column names and 890 cell values. Each evaluator
is given the predicted mapping between column names from CSV and Wikidata
properties and the predicted entity linking between cell values from CSV and
Wikidata entities. Every prediction is evaluated by 3 different evaluators.
    The class linking is similarly evaluated. Here, we asked human evaluators to
check the correctness of the predicted Wikidata class for the protagonist column
of 100 CSVs.
    Table 3 summarizes the result of our evaluation. We achieve an accuracy of
79.21% for mapping, 88.42% for entity linking, and 70% for class linking.


5.2   Result Discussion

The result that we managed to obtain is quite satisfactory. The system managed
to obtain a good accuracy score on all phases of the system. However, there are
few cases not handled very well and thus result in inaccuracies.
     Datatype detection is an early phase in the system. On the current implemen-
tation, a set of regular expressions are matched to the cell values of each column
in sample rows to determine their datatype. From this method, we managed to
obtain an accuracy of 80%. Inaccuracies in this phase are caused by irregularity
in the values themselves. As an example, a cell value consisting only of numbers
is normally a Quantity, but there are columns whose values are only numbers,
but should not have a Quantity datatype, such as the column containing iden-
tification code of a cemetery.
     We have observed that there are other factors that could potentially be used
in determining a column’s datatype, especially the column header. There are
column header names that sometimes correspond to certain datatypes, such as
“nama” (or name) that is more likely to be a WikibaseItem. The incorporation
of such a factor is left for a future work.
     Inaccuracies were also caused by nested structure in the table. For example,
the table ‘data-tps-kota-bandung’ containing waste collection location data, has
a cell value ‘Kel. Campaka 2 Rw - Sukaraja 1Rw - Pangkalan Auri’ in column
‘Sumber Sampah (RW - Kelurahan - Kecamatan)’. The cell actually contains
compound information, i.e., waste source, neighbourhood code, subdistrict, and
district. A special case of nested structure is where the cell values of a column
can be composed of multiple datatypes. The nested structure case occurs in less
than 6% of the datasets and their special handling is left for future work.
14     Faiz et al.

    The protagonist detection phase has also shown good performance with 88%
accuracy obtained. Most of the errors were due to carry-on errors from the
datatype detection phase in which the correct protagonist column was judged
to have a datatype other than WikibaseItem. Consequently, the column was
incorrectly not considered to be the protagonist.
    For the mapping and linking phases, many cases of errors occur when our
model failed to map to the correct property or entity. For example, our predic-
tion failed to map “SMA Negeri 10” from cell value to appropriate entity. Our
prediction maps “SMA Negeri 10” to SMA Negeri 10 Padang (Q7391091) al-
though the CSV where we get that cell value talks about high school in Jakarta,
not in Padang. We can resolve this case by adding more context to the mapping
process. For example because we took that CSV from Jakarta Open Data Portal,
we can take that information as a context to filter out any high school outside
Jakarta. The mapping and linking phases also dependent on the previous phase
such as datatype detection and protagonist detection phase, hence any error
occurs in those phases will be affecting the result of our mapping and linking.
From this observation, we could understand that one phase could influence other
phase’s performance. Hence, it is important that we should seek to improve the
performance of all phases, especially the early phases of the conversion process.

6    Conclusions and Future Work
OD2WD is a tool to convert tabular data in CSV format to RDF and republish it
into Wikidata knowledge graph. The main idea and challenges of this process are
two fold: extracting triples from the source table and aligning it with Wikidata
vocabulary. We approach the problem using patterns, used as a blueprint for the
conversion process. OD2WD system was constrained to a vertical listing table, so
While OD2WD system currently only support CSV format, adding support for
other data format should not pose a big problem because the conversion process
are practically the same as long as they are in vertical listing table category.
    To measure the system performance, each phase of this system has been eval-
uated using tabular data from the Indonesia Open Data Portal, Jakarta Open
Data Portal, and Bandung Open Data Portal. We achieve 81.9% accuracy on
datatype detection, 87.7% accuracy on protagonist detection, 79.21% accuracy
on property mapping phase, 70% in class linking, and 88.42% in entity link-
ing. This results shows that there are some degree of risks of inserting wrong
information to Wikidata, this risk, however, were mitigated because user could
manually check the conversion results before publishing them to Wikidata. At
the end of the research we have published 20256 statements to Wikidata as a
result of converting data from open data portals.
    For further works we suggest that adding more context such as metadata from
open data portal to the mapping and linking phases will increase the accuracy
of mapping and linking results.

Acknowledgements. This work is supported by the 2019 PITTA B research grant
“Analysis and Enrichment of Wikidata Knowledge Graph” from Universitas In-
                   OD2WD: From Open Data to Wikidata through Patterns               15

donesia and the Wikimedia Indonesia project “Peningkatan Konten Wikidata”.
We thank students of Faculty of Computer Science Universitas Indonesia for
their help in evaluation part of this work, as well as Raisha Abdillah from Wiki-
media Indonesia for her assistance during final checking before deploying the
conversion results to Wikidata.


References
 1. Berners-Lee, T.: Linked Data (2006), https://www.w3.org/DesignIssues/
    LinkedData.html
 2. Bizer, C., Seaborne, A.: D2RQ – Treating Non-RDF Databases as Virtual RDF
    Graphs. In: ISWC (Posters) (2004)
 3. Crestan, E., Pantel, P.: Web-scale table census and classification. In: WSDM (2011)
 4. Cyganiak, R.: Tarql (SPARQL for tables) (2019), http://tarql.github.io
 5. Gangemi, A.: Ontology design patterns for semantic web content. In: Gil, Y.,
    Motta, E., Benjamins, V.R., Musen, M.A. (eds.) The Semantic Web - ISWC 2005,
    4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, Novem-
    ber 6-10, 2005, Proceedings. Lecture Notes in Computer Science, vol. 3729, pp.
    262–276. Springer (2005)
 6. Gangemi, A., Presutti, V.: Ontology design patterns. In: Staab, S., Studer, R.
    (eds.) Handbook on Ontologies (2009)
 7. Han, L., Finin, T., Parr, C.S., Sachs, J., Joshi, A.: RDF123: from spreadsheets to
    RDF. In: ISWC (2008)
 8. Knoblock, C.A., Szekely, P.A., Ambite, J.L., Goel, A., Gupta, S., Lerman, K.,
    Muslea, M., Taheriyan, M., Mallick, P.: Semi-automatically Mapping Structured
    Sources into the Semantic Web. In: ESWC (2012)
 9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
    sentations in vector space. In: ICLR (2013)
10. Scharffe, F., Zamazal, O., Fensel, D.: Ontology alignment design patterns. Knowl.
    Inf. Syst. 40(1), 1–28 (2014). https://doi.org/10.1007/s10115-013-0633-y, https:
    //doi.org/10.1007/s10115-013-0633-y
11. Souripriya Das, Seema Sundara, R.C. (ed.): R2RML: RDB to RDF Mapping Lan-
    guage. W3C Recommendation (27 September 2012), https://www.w3.org/TR/
    r2rml/
12. Tandy, J., Herman, I., Kellogg, G. (eds.): Generating RDF from Tabular Data on
    the Web. W3C Recommendation (17 December 2015), https://www.w3.org/TR/
    csv2rdf/
13. Villazón-Terrazas, B., Priyatna, F.: Building ontologies by using re-engineering
    patterns and R2RML mappings. In: Blomqvist, E., Gangemi, A., Hammar, K.,
    Suárez-Figueroa, M.C. (eds.) Proceedings of the 3rd Workshop on Ontology Pat-
    terns, Boston, USA, November 12, 2012. CEUR Workshop Proceedings, vol. 929.
    CEUR-WS.org (2012), http://ceur-ws.org/Vol-929/paper10.pdf
14. van der Waal, S., Wecel, K., Ermilov, I., Janev, V., Milosevic, U., Wainwright, M.:
    Lifting Open Data portals to the data web. In: Auer, S., Bryl, V., Tramp, S. (eds.)
    Linked Open Data - Creating Knowledge Out of Interlinked Data (2014)
15. World Wide Web Foundation: Open data barometer - leaders edition (2018), http:
    //bit.ly/odbLeadersEdition