The Need of Structured Data:
            Introducing the OKgraph Project
                           (Extended Abstract)

                                  Maurizio Atzori

                  Department of Math/CS, University of Cagliari,
                     Via Ospedale 72, 09124 Cagliari (Italy),
                                atzori@unica.it
                            https://git.io/atzori


      Abstract. Although many computational problems can be approached
      using Deep Learning, in this position paper we argue that in the case
      of Information Retrieval tasks this is not mandatory and even detrimen-
      tal whenever alternatives exist. Instead of learning (by training) how
      to solve the full problem, we suggest to split it into two sub-problems:
      a) producing structured data (specifically knowledge graphs) out of the
      corpora, and b) providing usable tools (including natural language) to
      querying such structured data. Motivated by this two-step approach and
      its need of structured data, we introduce the Open Knowledge Graph
      (OKgraph) project, an initiative recently funded by Regione Autonoma
      della Sardegna aiming at providing insights on the first part of the prob-
      lem: a general way of generating knowledge graphs from text corpora,
      unsupervisedly.

      Keywords: word embeddings, knowledge graphs, unsupervised learn-
      ing, machine understanding


The Need of Structured Data in Information Retrieval
Information Retrieval (IR) is about satisfying the information needs of users.
From a very wide perspective, it means providing humans with a simple way
to pose questions, for instance through natural language, and then computing
the corresponding answer out of some given data, usually unstructured (texts,
images, sounds, etc.). Depending on the complexity required to compute the
answer, IR can be very challenging, ranging from simple keyword lookup to
something close or even harder than passing the Turing test. Therefore, from one
side we have an unstructured question (e.g., “how many countries in Europe?”),
on the other side we have some mostly-unstructured big data to use in order to
answer. How to compute answers from the data is the problem.
    The most innovative way of dealing with virtually any problem, especially
those where humans are someway involved such as in IR, is to resort to deep
learning. In deep learning, what to do is induced from a vast amount of training
data. In the case at hand, supposing the answer is not explicitly available in
2

the data, the model should explicitly or implicitly learn, among the others, the
concept of being a country, associate it to the appropriate entities, separate those
in Europe from the others, and then count them.
    This research path is disruptive because it generates useful computable func-
tions no programmer is able to code. For instance, a function that given an image
can list the objects contained within, or provide a textual description for it. But
the greatness of deep learning comes at a cost, actually two drawbacks: on one
hand, a) the generated function is somewhat a black box, that is, in most of the
cases it does not provide any insight or knowledge about the problem; on the
other hand b) it requires a huge amount of trained data, and for some problems
such data simply does not exist yet.
    We argue that for the specific research objectives of IR, there is a different
research path to approach the problem, that is, exploiting what we already
understand well about the problem. By using the same example of counting the
countries in Europe, we have that projections, filtering, and aggregates are well-
studied (for instance in the Database community) and every programmer can
code this functions. In other words, we do not need to learn (using deep learning
with its associated drawbacks) what we already know from decades of research in
Databases. What we really need is to start from what databases need: structured
data.


Knowledge Graphs

Knowledge Graphs are a machine representation of what Semantic Memories are
for humans: general world knowledge that we have accumulated throughout our
lives.
    For instance, knowing that Rome is (an instance of) a settlement, and it is
also the capital of Italy, which is a country. This kind of common sense knowledge
is of paramount importance in a number of human tasks such as question answer-
ing, disambiguation, understanding, translation, learning, etc. The very same is
true for machines, which are able to perform those tasks with high accuracy and
recall whenever knowledge graphs about the given context are available.
    Many successful crowdsourced projects focused on building curated knowl-
edge that can be read by machines: Wikidata, Freebase, DBpedia (machine gen-
erated by the humanly-curated infoboxes of Wikipedia), Wordnet. Some other
projects focused instead on (automatic) ontology alignment of specific curated
knowledge bases, such as BabelNet that produced a multilingual dictionary by
aligning parts of Wordnet, Wikidata and other knowledge bases.
    All these projects, and their produced knowledge (either “curated-only” or
“augmented from curated data”) have their merits and very successful applica-
tions, but they are also limited by two main weaknesses:

 1. they cover only specific (biased) areas of knowledge: very deep and narrow in
    a few cases (such as LinkedMDB, that knows about movie actors names but
    not about the number of their children) or wide but shallow (for instance,
                                                                                 3

   none of the previous projects are knowledgeable on, e.g., mobile device fea-
   tures, or recent news, or stock markets);
2. they are pretty much static, that is, once created they do not vary a lot; given
   the high cost of curating structured data, they tend to avoid time-changing
   topics (such as news or stock markets). Further, ontologies and automatic
   procedures (such as alignment) consists of ad-hoc, specific purpose code that
   is expensive to maintain.
    As a consequence, we do not have ontologies and structured data to an-
swer questions such as “what was the average rate on return of italian mar-
ket on December”, or “what’s new in last Android release”, or “which deputy
was the first to sign the divorce law in Italy”. These limitations also affect the
quality of existing knowledge bases. For instance, according to DBpedia Robbie
Williams is a “musical artist” while Lady Gaga is only a “person”, therefore not
having information on her occupation nor musical genres. Our project SWiPE
(Searching WikiPedia by Example) [2, 1], allowing for structured searches in a
user-friendly way, has drawn attention to these weaknesses and inconsistencies,
clearly showing the need for a more effective way of producing complete and
accurate knowledge bases [3].
    Therefore, we want to address a currently under-investigated problem: learn-
ing machine-readable knowledge graphs from scratch, i.e., without necessarily
relying on existing curated knowledge bootstrapping. Like a newborn can learn
“from scratch” by listening other’s people talk, an empty knowledge graph can
be automatically feed with word information coming from natural language texts
(such as generalist encyclopedias, news feeds, scientific magazines), opening up
a new world of applications which are currently not available or requiring too
much efforts to be effective.


The OKgraph Project
This project is focused on investigating the fundamental relationship between
unstructured data (as natural language text) and structured data (graphs rep-
resenting knowledge), eventually leading to an autonomous way of inferring the
latter from the former. The main outcome of our research will be a computer sys-
tem, called OKgraph, that learns meaningful graph triples autonomously from
scratch, that is, from non-annotated free text such as Wikipedia, WMT11 text
data, UMBC WebBase and other available corpora, continuously updating a self-
generated knowledge graph (that may resemble DBpedia, Wikidata or Freebase,
plus statistics). The approach we are going to follow is based on the exploita-
tion of linear regularities in word vector representations (vectors in a RN vector
space with N indicatively ranging from 300 to 1000) obtained by state-of-the-art
word embedding systems. In particular, we are interested in the analogies that
such word vectors can represent. For instance, assuming vec(X) is the vector
representing the word “X” and vicinity is given by cosine similarity, we have
that vec(“Rome”) - vec(“Italy”) + vec(“France”) is closer to vec(“Paris”) than
to any other word vector. Also, vec(“Germany”) + vec(“capital”) is close to
4

vec(“Berlin”). This linearity relationship among vectors implies that not only
words are semantically correlated through cosine similarity, but information on
the specific kind of correlation (concepts relationship such as ”being capital of”
or ”married to”) is hidden in the vector representation.
    Also inspired by our successful SWiPE System, where structured knowl-
edge graphs (in particular entity-property-value relationships) are merged within
text/html data in order to be easily queried through a QBE-like interface [4], we
propose to extract entity-property-value data from the word vectors, through
techniques to be developed within this project. Vector linearity over different
dimensions (in the vector space model) makes our approach very promising.
    While current mainstream research efforts are focused on either crowdsourc-
ing (that is humanly-curated, such as Wikidata) or by writing specific parsers
for each attribute of interest (DBpedia), using word embeddings to generate
knowledge graphs is a novel approach that is expected to be more scalable and
more general than existing approaches, with possible disruptive outcomes for the
scientific research area.
    Learning graphs is a process that will be conduced by using both very large
corpora (such as English Wikipedia and books from Gutenberg Project) and
medium to small corpora. For the latter, among the others, we want to investi-
gate the use of the Sardinian Wikipedia project1 , currently counting about 5000
entries. We expect many positive outcomes from the project, with possible large
impact in the society. Having machine-learned open knowledge graphs allow to
populate Wikidata and Wikipedia infoboxes could potentially have a positive
impact on less-used languages such as Sardinian or other local languages. More
generally, we remark that other than being useful as an interpretable instrument
for humans (in contrast with other knowledge representations such as artificial
neural networks), knowledge graphs are also beneficial in many areas such as
factoid and general question answering, disambiguation, search engines, etc.,
therefore addressing the need of structured data for IR tasks mentioned in the
first part of the paper.

References
1. M. Atzori, S. Gao, G. M. Mazzeo, and C. Zaniolo. Answering end-user questions,
   queries and searches on wikipedia and its history. IEEE Data Eng. Bull., 39(3):85–
   96, 2016.
2. M. Atzori and C. Zaniolo.              Swipe: searching wikipedia by example.
   In WWW 2012, pages 309–312, 2012.                 Also featured on New Scien-
   tist, see https://www.newscientist.com/article/dn21625-new-search-tool-to-unlock-
   wikipedia.
3. M. Atzori and C. Zaniolo. Expressivity and accuracy of by-example structured
   queries on wikipedia. In 24th IEEE WETICE 2015, pages 239–244, 2015.
4. M. M. Zloof. Query-by-example: The invocation and definition of tables and forms.
   In Proceedings of the 1st International Conference on Very Large Data Bases, VLDB
   ’75, pages 1–24, New York, NY, USA, 1975. ACM.

1
    https://sc.wikipedia.org/