<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Wikipedia Infobox Type Prediction Using Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Russa Biswas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rima Turker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Farshad Bakhshandegan-Moghaddam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Koutraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Karlsruhe Institute of Technology, Institute AIFB</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>Wikipedia, the multilingual, free content encyclopedia has evolved as the largest and the most popular general reference work on the Internet. Since the time of commencement of Wikipedia, crowd sourcing of articles has been one of the most salient features of this open encyclopedia. It is obvious that enormous amount of work and expertise goes in the creation of a self-content article. However, it has been observed that the infobox type information in Wikipedia articles is often incomplete, incorrect and missing. This is due to the human intervention in creating Wikipedia articles. Moreover, the type of the infoboxes in Wikipedia plays a vital role in the determination of RDF type inference in the Knowledge Graphs such as DBpedia. Hence, there arouses a necessity to have the correct infobox type information in the Wikipedia articles. In this paper, we propose an approach of predicting Wikipedia infobox type information using both word and network embeddings. Furthermore, the impact of using minimalistic information such as Table of Contents and Named Entity mentions in the abstract of a Wikipedia article in the prediction process has been analyzed as well.</p>
      </abstract>
      <kwd-group>
        <kwd>Wikipedia</kwd>
        <kwd>Infobox</kwd>
        <kwd>Embeddings</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Since the commencement of Wikipedia, it has emerged as the largest multilingual
encyclopedias available on the Internet. It is the most widely used general
reference non-pro t crowd sourcing project, owned by the Wikimedia Foundation3.
A huge amount of expertise and e ort is involved in the creation of Wikipedia
articles. Wikipedia articles are generated as an amalgamation of the
information contributed by humans in all the di erent segments of the article layout. A
typical Wikipedia article comprises both structured and unstructured data. The
unstructured data consists of the text describing the article content whereas,
structured data is represented in the form of an infobox containing property</p>
    </sec>
    <sec id="sec-2">
      <title>3 https://en.wikipedia.org/wiki/Wikipedia</title>
      <p>
        value pairs summarizing the content of the article. An infobox is a xed-format
table usually added to consistently present a summary of some unifying aspects
that the articles share and sometimes to improve navigation to other
interrelated articles4. Furthermore, the structured data present in the infoboxes of the
Wikipedia articles are widely used in di erent Knowledge Graphs (KGs) such
as DBpedia, Google's Knowledge Graph, Microsoft Bing's Satori etc. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The selection of the infobox type is determined collaboratively through
discussion and consensus among the editors. The infobox types or templates are
created and assigned based on categorical type of the articles, i.e. the same
template should be assigned to similar articles. However, no integrity tests are
conducted to check the correctness of the infobox assignments, leading to
erroneous infobox types [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10,11,12</xref>
        ]. For instance, George H. W. Bush and George
W. Bush both are former Presidents of the USA, but they have di erent
infobox types assigned to their respective Wikipedia articles. The former has the
infobox type of o ce holder whereas the later has the infobox type president.
Additionally, it is not mandatory to select an infobox type for the creation of
an article. Thus, about 70% of the Wikipedia articles do not contain an infobox.
It has been observed that infoboxes are missing for newer articles or articles on
less popular topics.
      </p>
      <p>
        Moreover, RDF type information in KGs such as DBpedia is derived directly
from Wikipedia infobox types by automated information extraction. Therefore,
the completeness and correctness of the infobox type information is of great
importance. Di erent studies [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10,11,12</xref>
        ] have strengthened the fact that the infobox
type information is often noisy, incomplete and incorrect. However, the infobox
type prediction problem in Wikipedia can be viewed as a text classi cation
problem with infobox types as labels.
      </p>
      <p>In this paper, we present a novel approach to predict Wikipedia infobox types
by using word embeddings on the text present in theTable of Contents(TOC), the
article's abstract, and additionally network embeddings on the Named Entities
mentioned in the abstract of the article. The TOC consists of the headings and
the subheadings of the article text, summarizing the information content of the
text in the section underneath. To the best of our knowledge so far, Wikipedia
infobox types have not been predicted using di erent types of embeddings
including TOC as one of the features. In this paper, the impact of using a minimalistic
yet informative feature such as the TOC in the classi cation process via classical
and neural network based classi ers is studied. Additionally, the importance of
Named Entities mentioned in the abstract of the article to determine the infobox
types has been analyzed.</p>
      <p>The rest of the paper is structured as follows. To begin with, a review of
the related work is provided in Section 2 followed by a short description of the
approach in Section 3. Section 4 accommodates the outline of the experimental
setup followed by a report on the results in Section 5. Finally, an outlook of
future work is provided in Section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>4 https://en.wikipedia.org/wiki/Help:Infobox</title>
      <sec id="sec-3-1">
        <title>Related Work</title>
        <p>
          Scope. The aim of this work is to predict the infobox types by leveraging the
Table of Contents, abstract and the Named Entity mentions in the abstract of
the Wikipedia articles. This section presents prior related work on infobox type
prediction. RDF type information in DBpedia is derived from Wikipedia infobox
type information. Therefore, the Wikipedia infobox type prediction problem can
be seen as a closely related task of RDF type prediction, which is covered rst
in the subsequent section followed by Wikipedia infobox type prediction.
RDF Type Prediction. A statistical heuristic link based type prediction
mechanism, SDTyped, has been proposed by Paulheim et al. and was evaluated on
DBpedia and OpenCyc [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Another RDF type prediction of KGs has been
studied by Melo et al., where the type prediction of the KGs is performed via the
hierarchical SLCN algorithm using a set of incoming and outgoing relations as
features for classi cation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Kleigr et al.[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed a supervised
hierarchical SVM classi cation approach for predicting the RDF types in DBpedia by
exploiting the contents of Wikipedia articles.
        </p>
        <p>As already mentioned, these approaches infer DBpedia RDF type information
of entities by taking into account properties present in DBpedia or Wikipedia
content. On the contrary, we intend to predict the Wikipedia infobox types
by considering the TOC, abstract and Named Entities. Hence, these are worth
mentioning but di erent from the proposed work of this paper.</p>
        <p>
          Wikipedia Infobox Type Prediction. One of the initial works in this domain
was proposed by Wu et al.[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. They presented KYLIN, a prototype which
automatically creates new infoboxes and updates the existing incomplete ones. To
do so, KYLIN takes into account pages having similar infoboxes, determines the
common attributes in them to create training examples, followed by learning
a CRF extractor. KYLIN also automatically identi es missing links for proper
nouns on each page, resolving each to a unique identi er.
        </p>
        <p>
          Sultana et al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] focuses on automated Wikipedia infobox type prediction
by training a SVM classi er on the feature set of TF-IDF on the rst k sentences
of an article as well as on categories and Named Entity mentions.
        </p>
        <p>
          Yus et al.[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] introduced a tool based on Semantic Web technologies which
uses statistical information of Linked Open Data(LOD) to create, update and
suggest infoboxes to users during creation of a Wikipedia article. Bhuiyan et
al.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] presented an automated NLP based unsupervised infobox type prediction
approach by exploiting the hyponyms and holonyms in Wikipedia articles.
        </p>
        <p>In contrast to the aforementioned works, we propose a classi cation based
infobox type prediction approach by combining word embeddings and network
embeddings to generate feature vectors instead of TF-IDF. The proposed method
does not focus on creating new infobox templates rather it focuses on correcting
and complementing the missing infoboxes in the articles. Furthermore, unlike
prior works, the TOC is also considered as one of the features for the classi cation
process.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Infobox Type Prediction</title>
        <p>This section contains a detailed explanation of the work ow and the
methodologies used for the multi-label classi cation process to predict the infobox type
information. The work ow is illustrated in Figure 1.
3.1</p>
        <sec id="sec-3-2-1">
          <title>Features</title>
          <p>Three di erent features are extracted from the Wikipedia articles for the
classication task:
{ TOC5 which is automatically generated based on the section and subsection
headers of the Wikipedia articles depicting a summarization of the content
in a single word or a short sentence.
{ Abstract(A) of the Wikipedia articles i.e. the summary of the entire article
content.
{ Named Entities(E) present in the abstract section of the articles are most
likely to be related to the article hence assumed to provide more information
in the classi cation process. The internal hyperlinks within Wikipedia are
used to identify the Named Entities mentioned in the abstract.
3.2</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Embeddings</title>
          <p>Both word and network embeddings are used to generate features for the
classiers.</p>
          <p>
            Word2Vec. Word2vec [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] aims to learn the distributed representation for words
reducing the high dimensional word representations as well as categorize
semantic similarities between them in large samples of text. The semantic similarities
of the linguistic terms based on their distribution in the TOCs as well as in
the abstract of di erent types of Wikipedia articles is vital. In this paper, the
Google pre-trained word vectors6 are used to generate word vectors for each
word present in the TOC and the abstract. Google pre-trained word2vec model
includes word vectors for a vocabulary of three million words and phrases trained
on roughly 100 billion words from a Google News dataset. The vector length has
been restricted to 300 features. It is to be noted that the information in TOC
has been considered as free text in this work.
          </p>
          <p>
            RDF2Vec. RDF2Vec [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] is an approach of latent representations of entities of
a KG into a lower dimensional feature space with the property that semantically
similar entities appear closer to each other in the feature space. In this work,
the pre-trained RDF2Vec uniform model vectors from DBpedia7 have been used
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 https://en.wikipedia.org/wiki/Help:Section</title>
    </sec>
    <sec id="sec-5">
      <title>6 https://code.google.com/archive/p/word2vec/</title>
    </sec>
    <sec id="sec-6">
      <title>7 http://data.dws.informatik.uni-mannheim.de/rdf2vec/models/DBpedia/2016-04/</title>
      <p>to extract the vectors for the Named Entities mentioned in the abstract of the
Wikipedia article. The vector length has been restricted to 200 features.
Similar to the word2vec word vectors, these vectors are generated by learning a
distributed representation of the entities and their properties in the underlying
KG. The intuition behind incorporating the vectors of Named Entities
mentioned in the abstract of Wikipedia articles into the feature set is to include the
features or the properties of the di erent entities from the DBpedia KG into the
classi cation process.
3.3</p>
      <sec id="sec-6-1">
        <title>Feature Vectors</title>
        <p>The feature vectors for each of the Wikipedia articles are generated using the
following steps:
Step 1: Extract word vectors for each word in the TOC as well as the abstract from
the Google pre-trained model. Also, extract entity vectors for the Named
Entities mentioned in the abstract from the RDF2Vec pre-trained model of
DBpedia version 2016-04.</p>
        <p>Step 2: Generate an abstract vector for each document by performing vector addition
on all the word vectors of the abstract and normalize by the total number
of words present in the abstract. Similarly, TOC vectors and entity vectors
are also generated.</p>
        <p>Step 3: Generate document vectors - Two sets of document vectors are generated
for the training of two classi ers. The abstract vector of each document
is concatenated separately with the TOC vector and entity vector of the
corresponding document to generate the document vectors.
3.4</p>
      </sec>
      <sec id="sec-6-2">
        <title>Classi cation</title>
        <p>
          As already discussed, the Wikipedia infobox type prediction problem can be
reduced to a classi cation of the Wikipedia articles with word vectors and entity
vectors as features. In this work, we have trained the Wikipedia articles using
two classi ers: Random Forest (RF) and Multilabel Convolutional Neural
Network (CNN). For the Random Forest classi er, the aforementioned document
vectors coupled with the labels of the infobox types are used to train the
classier. Random Forest, as an ensemble method is less likely to over t. Moreover,
the subsets of the training set for bagging reduces the e ects of the outliers
in the data, if any. On the other hand, for CNN, the concept of sentence level
classi cation task as discussed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] has been used for the classi cation
process. The Google pre-trained word vectors are used to generate the vectors in
the embedding layer followed by a fully connected softmax layer, whose
output is the probability distribution over the labels. A detailed description of the
experimental setup is provided in the following section.
4
        </p>
        <sec id="sec-6-2-1">
          <title>Experimental Setup</title>
          <p>
            This section contains a detailed explanation of the dataset followed by the
generation of ground truth. For the Random Forest classi er, python scikit-learn8
library has been used. For the CNN model, to classify the Wikipedia articles,
based on the sentence classi cation concept as described in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], TensorFlow
version 1.0 has been used to build the model9.
4.1
          </p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>DataSet</title>
        <p>Wikipedia 201610 version and RDF2Vec pre-trained model for DBpedia version
2016-04 have been used for this work. This version contains around 2000 di erent
Wikipedia Infobox types. It has been observed that the frequency distribution
of the Wikipedia Infobox types follows Zipf's law as shown in Figure 2. The
x-axis represents the infobox types in numbers and y-axis represents the count
of entities per infobox type.</p>
        <p>More than half of all Wikipedia pages do not contain infobox types. Also,
there are articles in Wikipedia containing more than one infobox type. However,
in this work we considered only those articles having a single infobox type. The
statistics of the Wikipedia articles with infobox types is provided in Table 1.
It consists of the count of the Wikipedia articles containing TOC, abstract,
infoboxes and the combination of these three together. Wikipedia redirect pages,
disambiguation pages and the list pages are ignored in the dataset.</p>
        <p>In this work, based on the popularity of the infoboxes, we have considered
top 30 infobox types as labels with 5000 articles per label to train the classi ers.
These 5000 articles per infobox type are selected by random sampling without
replacement and the experiments are being carried out with three such datasets
of the same size. It is important to note that the experiments are carried out
to study the impact of the TOC and Named Entities separately as well as in
combination with the abstract.
)e
tyxp 3
o
b
o
f
irenp 2
y
itt
(ceyn 1
enu
reqF 0</p>
        <p>Frequency</p>
        <p>Features</p>
        <p>TOC</p>
        <p>Infobox(I)
TOC + A + I
#Wikipedia articles
9,959,830
4,935,279
2,626,841
2,575,966
For the Random Forest classi er, the experiments are carried out for both 5-Fold
Cross Validation (CV) as well as split with 80% data as train set and 20% data
as test set. Identically, for the CNN 80% data is considered as train set and 20%
data as test set. The results are discussed in the next section.
4.3</p>
      </sec>
      <sec id="sec-6-4">
        <title>Ground Truth</title>
        <p>Cross-Validation techniques are adequate, testing the proposed approach over
unseen data reveals the generalization of the model and its robustness. To the
best of our knowledge no benchmark exists in the eld of Wikipedia infobox
type prediction. Therefore, automated Ground Truth has been generated for the
purpose. The manual creation of ground truth has the advantage of yielding
benchmarks incorporating human knowledge on the topic. On the other hand, it
incorporates signi cant disadvantages in terms of huge manual e ort leading to a
very small amount of ground truth generated. Therefore, in this work, the focus
was to automatically generate ground truth with preserving the characteristics of
the manually credited one. To do so, rst, all the articles without infoboxes were
extracted from Wikipedia version 201611. Second, these articles were checked
against the latest Wikipedia version 201812 to nd if there existed an infobox
type in the new version for them. This approach leads to the generation of ground
truth comprising of 32000 Wikipedia articles in total.
4.4</p>
      </sec>
      <sec id="sec-6-5">
        <title>Baseline</title>
        <p>
          TF-IDF is one of the most widely used method to generate the vectors for the
text classi cation problem [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. As the infobox type prediction problem can be
reduced to a classi cation problem, in this study, TF-IDF is considered as a
baseline.
11 http://downloads.dbpedia.org/2016-10/core-i18n/en/pages articles en.xml.bz2
12 https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
        </p>
        <sec id="sec-6-5-1">
          <title>Results and Discussion</title>
          <p>The experiments establish the fact that the text present in the TOC of a
Wikipedia article is very minimal yet informative and its contribution in the
classi cation process is not insigni cant. With the CNN classi er, it reaches an
accuracy of 76.5% micro-F1 score and around 65% with Random Forest. On the
other hand, with the TF-IDF vectorizer the accuracy is very little. This is due
to the fact that TF-IDF captures how important is the word for a document
whereas the word embeddings capture the semantic similarity of the words in
the text. Since the TOC is generated from the headings provided by the authors
at the time of the creation of the articles and no guidelines being available, the
vocabulary of the headings varies from article to article. Hence TF-IDF fails to
capture the semantic similarity between them. For instance, in the Wikipedia
article for Bill Gates, his early life is described under the section named `Early
Life' whereas for Steve Jobs, his early life is described under the section
`Background' and subsections `Biological and adoptive family' and `Birth'. TF-IDF
fails because it treats them as di erent words. On the other hand, `family'
appears in the top-5 similar words for `life' using the Google pre-trained vectors.
Hence, the semantic similarity of the words is considered. This also explains the
reason behind using the pre-trained vectors from Google. Therefore, it can be
inferred that using word embeddings over minimalistic yet informative key words
from an article is capable of predicting Wikipedia infoboxes.</p>
          <p>However, using the abstracts as a feature to determine infobox types improves
the quality of results both for TF-IDF as well as for word embeddings. A huge
improvement is noticed over the TF-IDF approach due to the fact that long texts
improve the quality of the vectors by this process. Furthermore, embeddings still
work better for the classi cation process capturing the semantic similarities in
text.</p>
          <p>Combining both the features, abstract and TOC together provides the best
result for all the cases. The data is less sparse now for TF-IDF to work better
as compared to only considering TOC. Also, there is an improvement of 2% in
the micro F1 score for the random forest and 1% increase in the CNN classi er
from using only abstract. It can be inferred for the articles with very less text
in the abstract but a well sectioned article for the rest of the document, TOC
can be considered to play a vital role in the prediction process.</p>
          <p>Additionally, for all the combination of the feature set as explained in Table
2, CNN performed better than Random Forest proving the fact that CNN models
are trained better for large datasets. We also carried our experiments with SVM
for a couple of the aforementioned feature sets and obtained similar results as
with the Random Forest classi er. However, multi-class SVM classi cation with
one vs. all approach was computationally expensive as compared to random
forest for this task.</p>
          <p>The impact of Named Entities in the classi cation process has also been
studied as shown in Table 3. It is to be noted that with only the Named Entities
in the abstract, the classi cation process is not as e ective achieving a micro F1
score of around 45% with Random Forest Classi er and 62% with CNN model.</p>
          <p>Feature Set
RF(CV) RF(Split) CNN</p>
          <p>RF(CV) RF(Split)
However, the addition of abstract word vectors to the entity vectors of an
article leads to an improvement of up to 86%. Moreover, the combination of the
abstract, TOC and the Named Entities vectors from the word and the network
embeddings improves the classi cation accuracy to 87% using random forest.
This is a considerable improvement in comparison to the TF-IDF score which
is 83% with all the features combined together. Furthermore, it has been
noticed that classi cation with Table of Contents has a better micro F1 measure
compared to the Named Entities with both Random Forest and CNN approach.
Hence, it can be inferred that the word vectors for the words in the Table of
Contents is capable of capturing more features relevant to the Wikipedia
infobox type prediction problem compared to the entity vectors extracted from
the RDFtoVec model. However, experiment with the CNN model by combining
both word and entity embeddings together has not been performed because of
the unequal length of the pre-trained vectors.</p>
          <p>Last, Wikipedia articles in the ground truth i.e. the articles without infobox
type, tend to be shorter and less informative due to various reasons such as,
relatively new, less popular topic, lack of knowledge of the contributor on the topic
etc. Therefore, the articles in the ground truth possess di erent characteristic as
compared to the training data. Prediction of infobox types on the ground truth
for the articles having TOC and abstract using the trained random forest model
yiel.ds 53.7% micro-F1 score.
6</p>
        </sec>
        <sec id="sec-6-5-2">
          <title>Conclusion</title>
          <p>In this paper, a novel approach for Wikipedia infobox type prediction based on
di erent types of embeddings has been analyzed. Also, the impact of using TOC
and Named Entities separately as features to predict an infobox type has been
studied. The achieved results strengthen the fact that adding only TOC with
the abstract in the feature set improves the accuracy of the classi cation process
hugely. On the other hand, entities, if used together with the abstract, also have
a positive impact on the classi cation process. Wikipedia infobox type
prediction is an important task as KGs such as DBpedia are constructed by automatic
information extraction from Wikipedia infoboxes. Hence, cleaning and assigning
the correct types of Wikipedia infoboxes indirectly leads to an improvement of
DBpedia type information. Additionally, this method can be extended to any
number of Wikipedia infobox type classes with articles having abstract and/or
TOC. Next, we would like to train CNN model combining both word and
network embeddings. Moreover, as network embedding models can be applied in
this task, we would like to train a network embedding model directly to the
Wikipedia articles, instead of using Google vectors and analyze its impact in the
classi cation process.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bhuiyan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>An Unsupervised Approach for Identifying the Infobox Template of Wikipedia Article</article-title>
          . In: CSE. pp.
          <volume>334</volume>
          {
          <fpage>338</fpage>
          . IEEE Computer Society (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text Categorization with Support Vector Machines: Learning with Many Relevant Features</article-title>
          .
          <source>In: Machine Learning: ECML-98, 10th European Conference on Machine Learning</source>
          , Chemnitz, Germany, April 21-
          <issue>23</issue>
          ,
          <year>1998</year>
          , Proceedings. pp.
          <volume>137</volume>
          {
          <issue>142</issue>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Convolutional Neural Networks for Sentence Classi cation</article-title>
          .
          <source>In: EMNLP</source>
          . pp.
          <volume>1746</volume>
          {
          <fpage>1751</fpage>
          .
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kliegr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zamazal</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <source>: LHD 2</source>
          .0:
          <string-name>
            <given-names>A</given-names>
            <surname>Text Mining</surname>
          </string-name>
          <article-title>Approach to Typing Entities in Knowledge Graphs</article-title>
          .
          <source>J. Web Sem</source>
          .
          <volume>39</volume>
          ,
          <issue>47</issue>
          {
          <fpage>61</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Kleef</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , et al.:
          <article-title>DBpedia{A large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          .
          <source>Semantic Web</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ),
          <volume>167</volume>
          {
          <fpage>195</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Volker, J.:
          <article-title>Type Prediction in RDF Knowledge Bases Using Hierarchical Multilabel Classi cation</article-title>
          .
          <source>In: WIMS</source>
          . p.
          <volume>14</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Type Inference on Noisy RDF Data</article-title>
          . In: ISWC. pp.
          <volume>510</volume>
          {
          <issue>525</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>RDF2Vec: RDF Graph Embeddings for Data Mining</article-title>
          . In: International Semantic Web Conference. pp.
          <volume>498</volume>
          {
          <fpage>514</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sultana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>Q.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biswas</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.H.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Infobox Suggestion for Wikipedia Entities</article-title>
          . In: CIKM. pp.
          <volume>2307</volume>
          {
          <fpage>2310</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          :
          <article-title>Autonomously Semantifying Wikipedia</article-title>
          . In: CIKM. pp.
          <volume>41</volume>
          {
          <fpage>50</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Yus</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mulwad</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mena</surname>
          </string-name>
          , E.:
          <article-title>Infoboxer: Using Statistical and Semantic Knowledge to Help Create Wikipedia Infoboxes</article-title>
          .
          <article-title>In: ISWC (Posters &amp; Demos)</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>1272</volume>
          , pp.
          <volume>405</volume>
          {
          <fpage>408</fpage>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>