Predicting Wikipedia Infobox Type Information
    using Word Embeddings on Categories

             Russa Biswas1,2 , Maria Koutraki1,2 , and Harald Sack1,2
    1
        FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, Germany
            2
              Karlsruhe Institute of Technology, Institute AIFB, Germany
                           {firstname.lastname}@kit.edu


        Abstract. Wikipedia has emerged as the largest multilingual, web based
        general reference work on the Internet. A huge amount of human re-
        sources have been invested in the creation and update of Wikipedia ar-
        ticles which are ideally complemented by so-called infobox templates
        defining the type of the underlying article. It has been observed that the
        Wikipedia infobox type information is often incomplete and inconsistent
        due to various reasons. However, the Wikipedia infobox type informa-
        tion plays a fundamental role for the RDF type information of Wikipedia
        based Knowledge Graphs such as DBpedia. This stimulates the need of
        always having the correct and complete infobox type information. In this
        work, we propose an approach to predict Wikipedia infobox types by us-
        ing word embeddings on categories of Wikipedia articles, and analyze
        the impact of using minimal information from the Wikipedia articles in
        the prediction process.

        Keywords: Wikipedia · Infobox · Word Embeddings · Text Classifica-
        tion


1    Introduction

Wikipedia has become the most widely used and largest multilingual open en-
cyclopedia. Huge amount of human skills, expertise and efforts goes in for the
creation of Wikipedia articles. It comprises of both structured and unstructured
or free text. Structured data in Wikipedia is represented in the form of an infobox
containing property value pairs summarizing the information content of the ar-
ticle. An infobox is a fixed-format table usually added to consistently present a
summary of some unifying aspects that the articles share and sometimes to im-
prove navigation to other interrelated articles. Furthermore, infobox information
is widely used in different Knowledge Graphs (KGs) such as DBpedia.
    Wikipedia infobox templates are created and assigned based on the categori-
cal type of the article, i.e. articles belonging to a specific genre or type should be
assigned the same template. The assignment of the infobox type to a Wikipedia
article is executed based on the discussions between the contributors and the
editors of the content of the Wikipedia article. However, no integrity tests are
conducted to determine the correctness of the infobox assignment. This leads to
2                                Russa Biswas et al.


    Fig. 1. Feature extraction from the Wikipedia articles (best viewed in color)


the assignment of incomplete and or incorrect infobox type information which
eventually leads to erroneous RDF type information in the KGs [2].
    The prediction of Wikipedia infobox type can be viewed upon as a text
classification problem [7] in which the infobox types are nothing but the classes.
In this paper, it is intended to classify the Wikipedia articles by exploiting word
embeddings on the Wikipedia categories the articles belong to. Moreover, the
impact of using minimal information such as only the first line of the abstract
against the whole abstract in the classification process has also been studied.
This work is inspired by the work done on Wikipedia infobox type prediction in
[3] and extends the approach by making use of Wikipedia categories.


2   Related Work

Wu et al.[8] proposed KYLIN, a method of automatically creating new infoboxes
and updating the existing incomplete ones, by learning a CRF extractor over
common attributes. An automated Wikipedia infobox type prediction has been
proposed by Sultana et al.[6] by training a SVM classifier over the first k-
sentences of the articles as well as categories and named entities. Bhuniyan et
al.[1] focuses on an automated NLP based infobox type prediction system.
    The work presented in this paper is inspired by another work on Wikipedia
infobox type prediction presented by Biswas et al.[3] in which the prediction
problem is converted to a classification problem, where word and graph em-
beddings have been applied to generate the feature set for various classifiers.
However, in this work, our aim is to predict the Wikipedia infobox types lever-
aging word embeddings on Wikipedia categories followed by a neural network
based classification process. The proposed method does not focus on the creation
of new infoboxes rather it helps to predict correct infobox types.


3   Infobox Type Prediction

In this work, the infobox type prediction problem is considered as a text classi-
fication problem, in which infobox types are regarded as classes.
                                        Wikipedia Infobox Type Prediction       3

Features. Three features from the Wikipedia articles are being used for the
classification process.
  – Table of Contents (TOC) is the collection of section headers and sub
    headers of the Wikipedia articles.
  – Abstract (A) of the Wikipedia articles i.e. summary of the entire content.
  – Categories (C) is a list of Wikipedia’s main categorization system, in-
    tended to group together articles on similar subjects.
Feature Vector. Word2Vec [5] word embeddings are applied to generate the
feature vectors. Word2Vec aims to learn the distributed representation for words
reducing high dimensional word representations while keeping linguistic contexts
of words.
    In this paper, the Google pre-trained word vectors 3 of length 300 are used to
generate word vectors for each word present in the TOC, abstract as well as for
the categories. The Google pre-trained word2vec model includes word vectors for
a vocabulary of three million words and phrases trained on roughly 100 billion
words from a Google News dataset.
    For each Wikipedia articlean abstract vector, a TOC vector and a category
vector are generated by performing vector addition on all the word vectors of
the abstract and normalized by the total number of words present in each of
these features. Finally, a document vector is generated by concatenating these
three vectors as shown in Figure 1.
Classification. Two classifiers have been trained to predict the Wikipedia in-
fobox types. The aforementioned document vector is used as the feature vector
in the classification method using a Random Forest(RF) Classifier. For a multi-
label convolutional neural network(CNN), categories and TOC are considered
as free text and sentence classification [4] method has been used where each
Wikipedia article is considered as a sentence.


4     Results

The classifiers have been trained on the most popular 30 infobox types with 5000
articles for each type from the Wikipedia 2016 version. Features generated using
TF-IDF have been used as a baseline.
    The experiments established the fact that categories of Wikipedia articles
play a vital role to determine the infobox type. With CNN, categories can pre-
dict the infobox types with a micro F1 -score of 96.8% which is 0.7% better
than our previous results obtained when the prediction was based on the en-
tire abstract and TOC combined as shown in Table 1. Furthermore, the word
embeddings approach performs much better than the TF-IDF baseline, since
word embeddings are able to capture semantic similarities. For instance, the
Wikipedia article of the album The Wall by Pink Floyd is assigned to most of
the categories containing the word album. Furthermore, using categories only
results in higher scores than using the entire abstract. Moreover, the prediction
3
    https://code.google.com/archive/p/word2vec/
4                                  Russa Biswas et al.


                                       With Embedding                 TF-IDF
         Feature Set
                                RF(CV) RF(Split) CNN            RF(CV) RF(Split)
             TOC                  65%      65.8%    76.5%          38%      32.3%
            A(full)               86%      86.4%    95.1%          80%      80.4%
          A(1stSent)             82.2%      82%     93.5%         70.4%      71%
               C                  88%      88.3%    96.8%          33%      34.4%
          TOC + C                88.6%      89%     97.6%          81%      81.7%
        A(full) + TOC             88%       88%     96.1%          83%      83.9%
      A(1stSent) + TOC            82%      82.1%     95%          77.8%     78.2%
          A(full) + C            88.6%      89%     97.6%          82%      82.4%
       A(1stSent) + C            88.4%     89.3%     98%          80.8%      81%
     A(full) + TOC + C            89%      89.7%    98.3%         84.6%     85.3%
    A(1stSent) + TOC + C          86%      87.1%    98.2%         83.3%     84.2%

     Table 1. Performance of classifiers using micro F1 score over the features

results are slightly better if categories are considered together with the first sen-
tence of the abstract instead of considering the whole abstract, which means
that less information is sufficient to infer Wikipedia infoboxes. Furthermore, in
all the experiments word embeddings perform better than TF-IDF. However,
rather similar results are obtained when TOC and categories are combined with
the entire abstract as well as only with the first sentence of the abstract.

5     Conclusion
In this paper, the achieved results strengthen the fact that Wikipedia categories
as well as minimal text plays a vital role in the prediction of infobox types.
In future we intend to design a semi-supervised approach to correct existing
Wikipedia infobox types and to predict infobox types for newly created articles.

References
1. Bhuiyan, H., Oh, K., Hong, M., Jo, G.: An Unsupervised Approach for Identifying
   the Infobox Template of Wikipedia Article. In: CSE (2015)
2. Biswas, R., Koutraki, M., Sack, H.: Exploiting Equivalence to Infer Type Subsump-
   tion in Linked Graphs. In: European Semantic Web Conference (2018)
3. Biswas, R., Türker, R., Moghaddam, F.B., Koutraki, M., Sack, H.: Wikipedia in-
   fobox type prediction using embeddings. In: DL4KGS@ ESWC (2018)
4. Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: EMNLP
   (2014)
5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Repre-
   sentations in Vector Space. CoRR (2013)
6. Sultana, A., Hasan, Q.M., Biswas, A.K., Das, S., Rahman, H., Ding, C.H.Q., Li, C.:
   Infobox Suggestion for Wikipedia Entities. In: CIKM (2012)
7. Türker, R., Zhang, L., Koutraki, M., Sack, H.: ”the less is more” for text classifica-
   tion,. In: SEMANTiCS 2018 (2018)
8. Wu, F., Weld, D.S.: Autonomously Semantifying Wikipedia. In: CIKM (2007)