=Paper= {{Paper |id=Vol-2481/paper20 |storemode=property |title=From Sartre to Frege in Three Steps: A* Search for Enriching Semantic Text Similarity Measures |pdfUrl=https://ceur-ws.org/Vol-2481/paper20.pdf |volume=Vol-2481 |authors=Davide Colla,Marco Leontino,Enrico Mensa,Daniele P. Radicioni |dblpUrl=https://dblp.org/rec/conf/clic-it/CollaLMR19 }} ==From Sartre to Frege in Three Steps: A* Search for Enriching Semantic Text Similarity Measures== https://ceur-ws.org/Vol-2481/paper20.pdf
                        From Sartre to Frege in Three Steps:
                ?
              A Search for Enriching Semantic Text Similarity Measures
                      Davide Colla                                   Marco Leontino
                    University of Turin,                            University of Turin,
                Computer Science Department                     Computer Science Department
                davide.colla@unito.it                          marco.leontino@unito.it

                      Enrico Mensa                               Daniele P. Radicioni
                    University of Turin,                         University of Turin,
                Computer Science Department                   Computer Science Department
                enrico.mensa@unito.it                      daniele.radicioni@unito.it

                        Abstract                           meaningful information contained in text docu-
                                                           ments, also based on background information con-
        English. In this paper we illustrate a             tained in an encyclopedic resource such as Wiki-
        preliminary investigation on semantic text         data (Vrandecic and Krötzsch, 2014).
        similarity. In particular, the proposed ap-           Although our approach has been devised on a
        proach is aimed at complementing and en-           specific application domain (PhD theses in philos-
        riching the categorization results obtained        ophy), we argue that it can be easily extended to
        by employing standard distributional re-           further application settings. The approach focuses
        sources. We found that the paths con-              on the ability to extract relevant pieces of informa-
        necting entities and concepts from docu-           tion from text documents, and to map them onto
        ments at stake provide interesting informa-        the nodes of a knowledge graph, obtained from
        tion on the connections between document           semantic networks representing encyclopedic and
        pairs. Such semantic browsing device en-           lexicographic knowledge. In this way it is possi-
        ables further semantic processing, aimed           ble to compare different documents based on their
        at unveiling contexts and hidden connec-           graphical description, which has a direct anchor-
        tions (possibly not explicitly mentioned in        ing to their semantic content.
        the documents) between text documents.1               We propose a system to assess the similarity be-
                                                           tween textual documents, hybridising the propo-
1       Introduction                                       sitional approach (such as traditional statements
In the last few years many efforts have been               expressed through RDF triples) with a distribu-
spent to extract information contained in text doc-        tional description (Harris, 1954) of the nodes con-
uments, and a large number of resources have               tained in the knowledge graph, that are repre-
been developed that allow exploring domain-                sented with word embeddings (Mikolov et al.,
based knowledge, defining a rich set of specific           2013; Camacho-Collados et al., 2015; Speer et al.,
semantic relationships between nodes (Vrandecic            2017). This step allows to obtain similarity mea-
and Krötzsch, 2014; Auer et al., 2007; Navigli            sures (based on vector descriptions, and on path-
and Ponzetto, 2012). Being able to extract and             finding algorithms) and explanations (represented
to make available the semantic content of docu-            as paths over a semantic network) more focused
ments is a challenging task, with beneficial impact        on the semantic definition of concepts and entities
on different applications, such as document cat-           involved in the analysis.
egorisation (Carducci et al., 2019), keyword ex-
                                                           2    Related Work
traction (Colla et al., 2017), question answering,
text summarisation, semantic texts comparison, on          Surveying the existing approaches requires to
building explanations/justifications for similarity        briefly introduce the most widely used resources
judgements (Colla et al., 2018) and more. In this          along with their main features.
paper we present an approach aimed at extracting
                                                           Resources
    1
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   BabelNet (BN) is a wide-coverage multilingual
International (CC BY 4.0).                                 semantic network, originally built by integrating
WordNet (Miller, 1995) and Wikipedia (Navigli           years (Schuhmacher and Ponzetto, 2014). Sev-
and Ponzetto, 2010). NASARI is a vectorial re-          eral approaches have been developed, e.g., aimed
source whose senses are represented as vectors as-      at extracting knowledge graphs from textual cor-
sociated to BabelNet synsets (Camacho-Collados          pora, attaining a network focused on the type of
et al., 2015). Wikidata is a knowledge graph based      documents at hand (Pujara et al., 2013). Such ap-
on Wikipedia, whose goal is to overcome prob-           proaches may be affected by scalability and gen-
lems related to information access by creating new      eralisation issues. In the last years many resources
ways for Wikipedia to manage its data on a global       representing knowledge in a structured form have
scale (Vrandecic and Krötzsch, 2014).                  have been proposed that build on encyclopedic re-
                                                        sources (Auer et al., 2007; Suchanek et al., 2007;
2.1   Approaches to semantic text similarity            Vrandecic and Krötzsch, 2014).
                                                           As regards as semantic similarity, a frame-
Most literature in computing semantic similarity
                                                        work has been proposed based on entity extraction
between documents can be arranged into three
                                                        from documents, providing mappings to knowl-
main classes.
                                                        edge graphs in order to compute semantic sim-
   Word-based similarity. Word-based metrics are
                                                        ilarities between documents (Paul et al., 2016).
used to compute the similarity between documents
                                                        Their similarity measures are mostly based on the
based on their terms; examples of features anal-
                                                        network structure, without introducing other in-
ysed are common morphological structures (Islam
                                                        struments such as embeddings, that are largely
and Inkpen, 2008) and words overlap (Huang et
                                                        acknowledged as relevant in semantic similarity.
al., 2011) between the texts. In one of the most
                                                        Hecht et al. (2012) propose a framework endowed
popular theories on similarity (the Tversky’s con-
                                                        with explanatory capabilities from similarity mea-
trast model) the similarity of a word pair is defined
                                                        sures based on relations between Wikipedia pages.
as a direct function of their common traits (Tver-
sky, 1977). This notion of similarity has been re-
                                                        3     The System
cently adjusted to model human similarity judg-
ments for short texts: the Symmetrical Tversky          In this Section we illustrate the generation process
Ratio Model (Jimenez et al., 2013), and employed        of the knowledge graph from Wikidata, which will
to compute semantic similarity between word- and        be instrumental to build paths across documents.
sense-pairs (Mensa et al., 2017; Mensa et al.,          Such paths are then used, at a later time, to enrich
2018).                                                  the similarity scores computed during the classifi-
   Corpus-based similarity. Corpus-based mea-           cation.
sures try to identify the degree of similarity be-
tween words using information derived from large        3.1    Knowledge Graph Extraction
corpora (Mihalcea et al., 2006; Gomaa and Fahmy,        The first step consists of the extraction of a knowl-
2013).                                                  edge graph related to the given reference domain.
   Knowledge-based similarity. Knowledge-based          Wikidata is then searched for concepts and entities
measures try to estimate the degree of seman-           related to the domain being analysed. By start-
tic similarity between documents by using infor-        ing from the extracted elements, which constitute
mation drawn from semantic networks (Mihalcea           the basic nodes of the knowledge graph, we still
et al., 2006). In most cases only the hierarchi-        consider Wikidata and look for relevant semantic
cal structure of the information contained in the       relationships towards other nodes, not necessarily
network is considered, without considering the          already extracted in the previous step. The types
relation types within nodes (Jiang and Conrath,         of relevant relationships depend on the treated do-
1997; Richardson et al., 1994); some authors con-       main. Considering the philosophical domain, we
sider the “is-a” relation (Resnik, 1995), but leav-     selected a set of 30 relations relevant to com-
ing unexploited the more domain-dependent ones.         pare the documents. For example, we considered
Moreover, only concepts are usually considered,         the relation movement that represents the literary,
omitting the Named Entities.                            artistic, scientific or philosophical movement,the
   An emerging paradigm is that of knowl-               relation studentOf that represents the person who
edge graphs. Knowledge graph extraction is a            has taught the considered philosopher, and the
challenging task, particularly popular in recent        relation influencedBy that represents the person’s
           Aleksei      hasInfluenced    Immanuel        isInfluencedBy         Baruch                  … the philosophy of
                                                                                                        Baruch Spinoza,
           Losev                           Kant                                 Spinoza                  with analysis…

                                                                                        hasMovement
                               hasAwardReceived     … the relevance
                Christian                           of Kant is put in                           isMovementOf        René
              Jakob Kraus                           perspective by…            Rationalism
                                                                                                                   Decartes

Figure 1: A small portion of the knowledge graph extracted from Wikidata, related to the philosophical
domain; nodes represent BabelSynsets (concepts or NEs), rectangles represent documents.


idea from which the considered philospher’s idea                   tion rules based on morphological and syntacti-
has been influenced. In this way, we obtain a graph                cal patterns, considering for example sequences
where each node is a concept or entity extracted                   of words starting with a capital letter or associ-
from Wikidata; such nodes are connected with                       ated to a particular Part-Of-Speech pattern. Simi-
edges labeled with specific semantic relations.                    larly, we extract relevant concepts based on partic-
   The obtained graph is then mapped onto Ba-                      ular PoS patterns (such as NOUN-PREPOSITION-
belNet. At the end of the first stage, the knowl-                  NOUN, thereby recognizing, for example, philoso-
edge graph represents the relevant domain knowl-                   phy of mind).
edge (Figure 1) encoded through BabelNet nodes,                       We are aware that we are not considering the
that are connected through the rich set of relations               problem of word sense disambiguation (Navigli,
available in Wikidata. Each text document can be                   2009; Tripodi and Pelillo, 2017). The underly-
linked to the knowledge graph, thereby allowing to                 ing assumption is that as long as we are concerned
make semantic comparisons by analysing the pos-                    with a narrow domain, this is a less severe prob-
sible paths connecting document pairs.                             lem: e.g., if we recognise the person Kant in a doc-
   Without loss of generality, we considered the                   ument related to philosophy, probably the person
philosophical domain, and extracted a knowl-                       cited is the philosopher whose name is Immanuel
edge graph containing 22, 672 nodes and 135, 910                   Kant (please refer to Figure 1), rather than the less
typed edges; Wikidata entities were mapped onto                    philosophical Gujarati poet, playwright and essay-
BabelNet approximately in the 90% of cases.                        ist Kavi Kant.3
                                                                      By mapping concepts and Named Entities
3.2     Information extraction and semantic                        found in a document onto the graph, we gain a set
        similarity                                                 of access points to the knowledge graph. Once ac-
The second step consists in connecting the docu-                   quired the access points to the knowledge graph
ments to the obtained knowledge graph. We har-                     for a pair of documents, we can compute the se-
vested a set of 475, 383 UK doctoral theses in sev-                mantic similarity between documents by analysing
eral disciplines through the Electronic Theses On-                 the paths that connect them.
line Service (EThOS) of the British National Li-
brary.2 At first, concepts and entities related to the             3.3        Building Paths across Documents
reference domain were extracted from the consid-                   The developed framework is used to compute
ered documents, with a special focus on two dif-                   paths between pairs of senses and/or entities fea-
ferent types of information, such as concepts and                  turing two given documents. Each edge in the
Named Entities. Concepts are keywords or multi-                    knowledge graph has associated a semantic re-
word expressions representing meaningful items                     lation type (such as, e.g., “hasAuthor”, “influ-
related to the domain (such as, e.g., ‘philosophy-                 encedBy”, “hasMovement”). Each path interven-
of-mind’, ‘Rationalism’, etc.) while Named En-                     ing between two documents is in the form
tities are persons, places or organisations (mostly
universities, in the present setting) strongly related                          ACCESS
                                                                    DOC1 −−−−−−→ SaulKripke −−−−−−−−−→
                                                                                                          inf luencedBy

to the considered domain. Named entities are ex-                                                 inf luencedBy
                                                                    LudwigW ittgenstein −−−−−−−−−→ BertrandRussell
tracted using the Stanford CoreNLP NER mod-
                                                                        inf luencedBy                            ACCESS
ule (Manning et al., 2014) improved with extrac-                        −−−−−−−−−→ BaruchDeSpinoza ←−−−−−− DOC2

   2                                                                     3
       https://ethos.bl.uk.                                                  https://tinyurl.com/y3s9lsp7.
In this case we can argue in favor of the relatedness    and the broad theme of ethics in the latter case.
of the two documents based on the chain of rela-         Intra-domain classes (that is both ‘Antibiotics’-
tionships illustrating that Saul Kripke (from docu-      ‘Molecular’ and ‘Hegel’-‘Ethics’) are not sup-
ment d1 ) has been influenced-by Ludwig Wittgen-         posed to be linearly separable, as it mostly occurs
stein, that has been influenced-by Bertrand Rus-         in real problems. Of course, this feature makes
sel, that in turn has been influenced-by Baruch De       more interesting the categorization problem. The
Spinoza, mentioned in d2 . The whole set of paths        dataset was used to compute some descriptive stats
connecting elements from a document d1 to a doc-         (such as inverse document frequency), character-
ument d2 can be thought of as a form of evidence         izing the whole collection of considered docu-
of the closeness of the two documents: documents         ments.
with numerous shorter paths connecting them are             From the aforementioned set of 400 documents
intuitively more related. Importantly enough, such       we randomly chose a subset of 20 documents, 5
paths over the knowledge graph do not contain            documents for each of the 4 classes from those
general information (e.g., Kant was a man), but          containing the terms (either ‘Antibiotics’, ’Molec-
rather they are highly domain-specific (e.g., Oskar      ular’, ‘Hegel’ or ‘Ethics’) in the title. This selec-
Becker had as doctoral student Jürgen Habermas).        tion strategy was aimed at selecting more clearly
                                                         individuated documents, exhibiting a higher simi-
A? Search
                                                         larity degree within classes than across classes.4
The computation of the paths is performed via a
modified version of the A? algorithm (Hart et al.,       4.1   Investigation on Text Similarity with
1968). In particular, paths among access nodes are             Standard Distributional Approaches
returned in order, from the shortest to the longest      GLoVE and Word Embedding Similarity
one. Given the huge dimension of the network,
                                                         The similarity scores were computed for each doc-
and since we are guaranteed to retrieve shortest
                                                         ument pair with a Word Embedding Similarity ap-
paths first, we stop the search after one second of
                                                         proach (Agirre et al., 2016). In particular, each
computation time.
                                                         document d has been provided with a vector de-
4   Experimentation                                      scription averaging the GloVe embeddings ti (Pen-
                                                         nington et al., 2014) for all terms in the title and
In this Section we report the results of a prelimi-      abstract:
nary experimentation: given a dataset of PhD the-                        −→      1 X ~
                                                                         Nd =             ti ,             (1)
ses, we first explore the effectiveness of standard                             |Td |
                                                                                         ti ∈Td
distributional approaches to compute the semantic
similarity between document pairs; we then elab-         where each t~i is the GloVe vector for the term ti .
orate on how such results can be complemented            Considering two documents d1 ad d2 , each one as-
                                                                                         −→
and enriched through the computation of paths be-        sociated to a particular vector Ndi , we compare
tween entities therein.                                  them using the cosine similarity metrics:

Experimental setting We extracted 4 classes of                                       −−→ −−→
                                                                       −−→ −−→       Nd · Nd2
documents (100 for each class) from the EThOS                      sim(Nd1 , Nd2 ) = −−→1 −−→    .                (2)
                                                                                    kNd1 kkNd2 k
dataset. For each record we retrieved the title and
abstract fields, that were used for subsequent pro-      The obtained similarities between each document
cessing. We selected documents containing ‘An-           pair are reported in Figure 2(a).5 The computed
tibiotics’, ’Molecular’, ‘Hegel’ or ‘Ethics’ either      distances show that overall this approach is suffi-
in their title (in 15 documents per class) or in their   cient to discriminate the scientific doctoral theses
abstract (15 documents per class). Each class is         from the philosophical ones. In particular, the top
featured on average by 163.5 tokens (standard de-        green triangle shows the correlation scores among
viation σ = 39.3), including both title and ab-          antibiotics documents, while the bottom trian-
stract. The underlying rationale has been that of        gle reports the correlation scores among philo-
selecting documents from two broad areas, each
                                                            4
one composed by two different sets of data, hav-              In future work we will verify such assumptions by in-
                                                         volving domain experts in order to validate and/or refine the
ing to do with medical disciplines and molecular         heuristics employed in the document selection.
                                                            5
biology in the former case, and with Hegelianism              The plot was computed using the corrplot package in R.
                                       M1
                                            M2
                                                 M3
                                                      M4
                                                           M5




                                                                                                                                                 M1
                                                                                                                                                      M2
                                                                                                                                                           M3
                                                                                                                                                                M4
                                                                                                                                                                     M5
                                                                                         H1
                                                                                              H2
                                                                                                   H3
                                                                                                        H4
                                                                                                             H5




                                                                                                                                                                                                   H1
                                                                                                                                                                                                        H2
                                                                                                                                                                                                             H3
                                                                                                                                                                                                                  H4
                                                                                                                                                                                                                       H5
              A1
                   A2
                        A3
                             A4
                                  A5




                                                                E1
                                                                     E2
                                                                          E3
                                                                               E4
                                                                                    E5




                                                                                                                       A1
                                                                                                                             A2
                                                                                                                                  A3
                                                                                                                                       A4
                                                                                                                                            A5




                                                                                                                                                                          E1
                                                                                                                                                                               E2
                                                                                                                                                                                    E3
                                                                                                                                                                                         E4
                                                                                                                                                                                              E5
                                                                                                                       1                                                                                                    1
         A1                                                                                                       A1
              A2                                                                                                       A2
                                                                                                                       0.9                                                                                                  0.9
                   A3                                                                                                        A3
                        A4                                                                                                        A4
                                                                                                                       0.8                                                                                                  0.8
                             A5                                                                                                        A5
                                  M1                                                                                                        M1
                                                                                                                       0.7                                                                                                  0.7
                                       M2                                                                                                        M2
                                            M3                                                                                                        M3
                                                                                                                       0.6                                                                                                  0.6
                                                 M4                                                                                                        M4
                                                      M5                                                                                                        M5
                                                                                                                       0.5                                                                                                  0.5
                                                           E1                                                                                                        E1
                                                                E2                                                                                                        E2
                                                                                                                       0.4                                                                                                  0.4
                                                                     E3                                                                                                        E3
                                                                          E4                                                                                                        E4
                                                                                                                       0.3                                                                                                  0.3
                                                                               E5                                                                                                        E5
                                                                                    H1                                                                                                        H1
                    (a) Glove Embeddings                                                                               0.2             (b) One-Hot Vector                                                                   0.2
                                                                                         H2                                                                                                        H2
                                                                                              H3                                                                                                        H3
                                                                                                                       0.1                                                                                                  0.1
                                                                                                   H4                                                                                                        H4
                                                                                                        H5                                                                                                        H5
                                                                                                                       0                                                                                                    0
                                       M1
                                            M2
                                                 M3
                                                      M4
                                                           M5




                                                                                         H1
                                                                                              H2
                                                                                                   H3
                                                                                                        H4
                                                                                                             H5




                                                                                                                                                 M1
                                                                                                                                                      M2
                                                                                                                                                           M3
                                                                                                                                                                M4
                                                                                                                                                                     M5
              A1
                   A2
                        A3
                             A4
                                  A5




                                                                E1
                                                                     E2
                                                                          E3
                                                                               E4
                                                                                    E5




                                                                                                                                                                                                   H1
                                                                                                                                                                                                        H2
                                                                                                                                                                                                             H3
                                                                                                                                                                                                                  H4
                                                                                                                                                                                                                       H5
                                                                                                                       A1
                                                                                                                             A2
                                                                                                                                  A3
                                                                                                                                       A4
                                                                                                                                            A5




                                                                                                                                                                          E1
                                                                                                                                                                               E2
                                                                                                                                                                                    E3
                                                                                                                                                                                         E4
                                                                                                                                                                                              E5
                                                                                                                        1                                                                                                   1
         A1                                                                                                       A1
              A2                                                                                                       A2
                                                                                                                       0.9                                                                                                  0.9
                   A3                                                                                                        A3
                        A4                                                                                                        A4
                                                                                                                       0.8                                                                                                  0.8
                             A5                                                                                                        A5
                                  M1                                                                                                        M1
                                                                                                                       0.7                                                                                                  0.7
                                       M2                                                                                                        M2
                                            M3                                                                                                        M3
                                                                                                                       0.6                                                                                                  0.6
                                                 M4                                                                                                        M4
                                                      M5                                                                                                        M5
                                                                                                                       0.5                                                                                                  0.5
                                                           E1                                                                                                        E1
                                                                E2                                                                                                        E2
                                                                                                                       0.4                                                                                                  0.4
                                                                     E3                                                                                                        E3
                                                                          E4                                                                                                        E4
                                                                                                                       0.3                                                                                                  0.3
                                                                               E5                                                                                                        E5
                                                                                    H1                                                                                                        H1
                   (c) NASARI Embeddings                                                                               0.2    (d) NASARI Embeddings                                                                         0.2
                                                                                         H2                                                                                                        H2
                                                                                              H3
                                                                                                                                with connectivity and idf                                               H3
                                                                                                                       0.1                                                                                                  0.1
                                                                                                   H4                                                                                                        H4
                                                                                                        H5                                                                                                        H5
                                                                                                                        0                                                                                                   0




Figure 2: Comparison between correlation scores. Documents have scientific subject (‘A’ for ‘Antibi-
otics’, ‘M’ for ‘Molecular’ biology), and philosophic subject (‘E’ for ‘Ethics’, ‘H’ for ‘Hegel’).


sophical documents. The red square graphi-                                                                                   mula as in Equation 1. We then computed the sim-
cally illustrates the poor correlation between the                                                                           ilarity matrix, displayed in Figure 2(c). It clearly
two classes of documents. On the other side,                                                                                 emerges that also NASARI is well suited to solve
the subclasses (Hegelism-Ethics and Antibiotics-                                                                             a classification task when domains are well sepa-
Molecular) could not be separated. Provided                                                                                  rated. However, also in this case the adopted ap-
that word embeddings are known to conflate all                                                                               proach does not seem to discriminate well within
senses in the description of each term (Camacho-                                                                             the two main classes: for instance, the square with
Collados and Pilehvar, 2018), this approach per-                                                                             vertices E1-H1; E5-H1; E5-H5; E1-H5 should be
formed surprisingly well in comparison to a base-                                                                            reddish, indicating a lower average similarity be-
line based on a one-hot vector representation, only                                                                          tween documents pertaining the Hegel and Ethics
dealing with term-based features (Figure 2(b)).                                                                              classes. We experimented in a set of widely varied
                                                                                                                             conditions and parameters, obtaining slightly bet-
NASARI and Sense Embedding Similarity
                                                                                                                             ter similarity scores by weighting NASARI vec-
We then explored the hypothesis that seman-                                                                                  tors with senses idf, and senses connectivity (c,
tic knowledge can be beneficial for better sepa-                                                                             obtained from BabelNet):
rating documents: after performing word sense                                                                                                                           
disambiguation (the BabelFy service was em-                                                                                    −
                                                                                                                               →        1 X                 |Sd |        1
                                                                                                                               Nd =              s~i ·log           · 1−     , (3)
ployed (Moro et al., 2014)), we used the NASARI                                                                                       |Sd |                 H(si )       c
                                            −→                                                                                                              si ∈Sd
embedded version to compute the vector Nd , as
the average of all vectors associated to the senses                                                                          where H(si ) is the number of documents contain-
contained in Sd , basically employing the same for-                                                                          ing the sense si . The resulting similarities scores
are provided in Figure 2(d).                              cepts and entities.
   Documents are in fact too close, and pre-                 The illustrated approach allows the uncover-
sumably the adopted representation (merging all           ing of insightful and specific connections between
senses in each document) is not as precise as             documents pairs. However, this preliminary study
needed. In this setting, we tried to investigate the      also pointed out some issues. One key problem is
documents similarity based on the connections be-         the amount of named entities contained in the con-
tween their underlying sets of senses. Such con-          sidered documents (e.g., E5 only has one access
nections were computed on the aforementioned              point, while E3 has none). Another issue has to
graph.                                                    do with the inherently high connectivity of some
                                                          nodes of the knowledge graph (hubness). For in-
4.2   Enriching Text Similarity with Paths                stance, the nodes Philosophy, Plato and Aristotle
      across Documents                                    are very connected, which results in the extraction
                                                          of some trivial and uninteresting paths among the
In order to examine the connections between the           specific documents. The first issue could be tack-
considered documents we focused on the philo-             led by also considering the main concepts of a doc-
sophical portion of our dataset, and exploited the        ument if no entity can be found, whilst the second
knowledge graph described in Section 3. The               one could be mitigated by taking into account the
computed paths are not presently used to refine           connectivity of the nodes as a negative parameter
the similarity scores, but only as a suggestion to        while computing the paths.
characterize possible connections between docu-
ment pairs. The extracted paths contain precious          5   Conclusions
information that can be easily integrated in down-
stream applications, by providing specific infor-         In this paper we have investigated the possibil-
mation that can be helpful for domain experts             ity of enriching semantic text similarity measures
to achieve their objectives (e.g., in semantically        via symbolic and human readable knowledge. We
browsing text documents, in order to find influence       have shown that distributional approaches allow
relations across different philosophical schools).        for a satisfactory classification of documents be-
                                                          longing to different topics, however, our prelimi-
   As anticipated, building paths among the fun-
                                                          nary experimentation showed that they are not able
damental concepts of the documents allows grasp-
                                                          to capture the subtle aspects characterizing docu-
ing important ties between the documents top-
                                                          ments in close areas. As we have argued, exploit-
ics. For instance, one of the extracted paths (be-
                                                          ing paths over graphs to explore connections be-
tween the author ‘Hegel’ and the work ‘Sense
                                                          tween document pairs may be beneficial in making
and Reference’ (Frege, 1948)) shows the con-
                                                          explicit domain-specific links between documents.
nections between the entities at stake as follows.
                                                             As a future work, we could refine the methodol-
G.W.F. Hegel hasMovement Continental Philoso-
                                                          ogy related to the extraction of the concepts in the
phy, which is in turn the movementOf H.L. Berg-
                                                          Knowledge Graph, defining approaches based on
son, who has been influencedBy G. Frege, who fi-
                                                          specific domain-related ontologies. Two relevant
nally hasNotableWork Sense and Reference. The
                                                          works, to these ends, are the PhilOnto ontology,
semantic specificity of this information provides
                                                          that represents the structure of philosophical lit-
precious insights that allow for a proper considera-
                                                          erature (Grenon and Smith, 2011), and the InPho
tion of the relevance of the second document w.r.t.
                                                          taxonomy (Buckner et al., 2007), combining auto-
the first one. It is worth noting that the fact that
                                                          mated information retrieval methods with knowl-
Hegel is a continental philosopher is trivial –tacit
                                                          edge from domain experts. Both resources will
knowledge– for philosophers, and was most prob-
                                                          be employed in order to extract a more concise,
ably left implicit in the thesis abstract, while it can
                                                          meaningful and discriminative Knowledge Graph.
be a relevant piece of information for a system re-
quested to assess the similarity of two philosoph-
                                                          Acknowledgments
ical documents. Also, this sort of path over the
extracted knowledge graph enables a form of se-           The authors are grateful to the EThOS staff for
mantic browsing that benefits from the rich set of        their prompt and kind support. Marco Leontino
Wikidata relations paired with the valuable cover-        has been supported by the REPOSUM project,
age ensured by BabelNet on domain-specific con-           BONG CRT 17 01 funded by Fondazione CRT.
References                                                  Peter E. Hart, Nils J. Nilsson, and Bertram Raphael.
                                                              1968. A formal basis for the heuristic determination
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab,            of minimum cost paths. IEEE Transactions on Sys-
  Aitor Gonzalez-Agirre, Rada Mihalcea, German                tems Science and Cybernetics, SSC-4(2):100–107.
  Rigau, and Janyce Wiebe. 2016. Semeval-2016
  task 1: Semantic textual similarity, monolingual          Brent Hecht, Samuel H Carton, Mahmood Quaderi,
  and cross-lingual evaluation. In Proceedings of the         Johannes Schöning, Martin Raubal, Darren Gergle,
  10th International Workshop on Semantic Evalua-             and Doug Downey. 2012. Explanatory semantic re-
  tion (SemEval-2016), pages 497–511.                         latedness and explicit spatialization for exploratory
                                                              search. In Proceedings of the 35th international
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens          ACM SIGIR conference on Research and develop-
   Lehmann, Richard Cyganiak, and Zachary Ives.               ment in information retrieval, pages 415–424. ACM.
   2007. Dbpedia: A nucleus for a web of open data.
   In The semantic web, pages 722–735. Springer.            Cheng-Hui Huang, Jian Yin, and Fang Hou. 2011.
                                                              A text similarity measurement combining word se-
Cameron Buckner, Mathias Niepert, and Colin Allen.            mantic information with tf-idf method. Jisuanji
  2007. Inpho: the indiana philosophy ontology. APA           Xuebao(Chinese Journal of Computers), 34(5):856–
  Newsletters-newsletter on philosophy and comput-            864.
  ers, 7(1):26–28.
                                                            Aminul Islam and Diana Inkpen. 2008. Semantic text
Jose Camacho-Collados and Mohammad Taher Pile-               similarity using corpus-based word similarity and
   hvar. 2018. From word to sense embeddings: A              string similarity. ACM Transactions on Knowledge
   survey on vector representations of meaning. Jour-        Discovery from Data (TKDD), 2(2):10.
   nal of Artificial Intelligence Research, 63:743–788.
                                                            Jay J Jiang and David W Conrath. 1997. Semantic
José Camacho-Collados, Mohammad Taher Pilehvar,               similarity based on corpus statistics and lexical tax-
   and Roberto Navigli. 2015. NASARI: a novel                  onomy. arXiv preprint cmp-lg/9709008.
   approach to a semantically-aware representation of
   items. In Proceedings of NAACL, pages 567–577.           Sergio Jimenez, Claudia Becerra, Alexander Gelbukh,
                                                              Av Juan Dios Bátiz, and Av Mendizábal. 2013.
Giulio Carducci, Marco Leontino, Daniele P Radicioni,         Softcardinality-core: Improving text overlap with
  Guido Bonino, Enrico Pasini, and Paolo Tripodi.             distributional measures for semantic textual similar-
  2019. Semantically aware text categorisation for            ity. In Proceedings of *SEM 2013, volume 1, pages
  metadata annotation. In Italian Research Confer-            194–201.
  ence on Digital Libraries, pages 315–330. Springer.       Christopher D. Manning, Mihai Surdeanu, John Bauer,
                                                              Jenny Finkel, Steven J. Bethard, and David Mc-
Davide Colla, Enrico Mensa, and Daniele P Radicioni.
                                                              Closky. 2014. The Stanford CoreNLP natural lan-
  2017. Semantic measures for keywords extraction.
                                                              guage processing toolkit. In Proceedings of 52nd
  In Conference of the Italian Association for Artificial
                                                              Annual Meeting of the Association for Computa-
  Intelligence, pages 128–140. Springer.
                                                              tional Linguistics: System Demonstrations, pages
Davide Colla, Enrico Mensa, Daniele P. Radicioni, and         55–60.
  Antonio Lieto. 2018. Tell me why: Computational           Enrico Mensa, Daniele P. Radicioni, and Antonio Li-
  explanation of conceptual similarity judgments. In          eto. 2017. Merali at semeval-2017 task 2 subtask
  Proceedings of the 17th International Conference on         1: a cognitively inspired approach. In Proceed-
  Information Processing and Management of Uncer-             ings of the 11th International Workshop on Semantic
  tainty in Knowledge-Based Systems (IPMU), Special           Evaluation (SemEval-2017), pages 236–240, Van-
  Session on Advances on Explainable Artificial Intel-        couver, Canada, August. Association for Computa-
  ligence, Communications in Computer and Informa-            tional Linguistics.
  tion Science (CCIS), Cham. Springer International
  Publishing.                                               Enrico Mensa, Daniele P Radicioni, and Antonio Li-
                                                              eto. 2018. Cover: a linguistic resource combining
Gottlob Frege. 1948. Sense and reference. The philo-          common sense and lexicographic information. Lan-
  sophical review, 57(3):209–230.                             guage Resources and Evaluation, 52(4):921–948.
Wael H Gomaa and Aly A Fahmy. 2013. A survey of             Rada Mihalcea, Courtney Corley, Carlo Strapparava,
  text similarity approaches. International Journal of        et al. 2006. Corpus-based and knowledge-based
  Computer Applications, 68(13):13–18.                        measures of text semantic similarity. In AAAI, vol-
                                                              ume 6, pages 775–780.
Pierre Grenon and Barry Smith. 2011. Foundations of
   an ontology of philosophy. Synthese, 182(2):185–         Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
   204.                                                       rado, and Jeff Dean. 2013. Distributed representa-
                                                              tions of words and phrases and their compositional-
Zellig S Harris. 1954. Distributional structure. Word,        ity. In Advances in neural information processing
  10(2-3):146–162.                                            systems, pages 3111–3119.
George A Miller.      1995.  WordNet: a lexical         Amos Tversky. 1977. Features of similarity. Psycho-
  database for English. Communications of the ACM,       logical review, 84(4):327.
  38(11):39–41.
                                                        Denny Vrandecic and Markus Krötzsch. 2014. Wiki-
Andrea Moro, Alessandro Raganato, and Roberto Nav-        data: A free collaborative knowledge base. Commu-
  igli. 2014. Entity linking meets word sense disam-      nications of the ACM, 57(10).
  biguation: a unified approach. Transactions of the
  Association for Computational Linguistics, 2:231–
  244.
Roberto Navigli and Simone Paolo Ponzetto. 2010.
  BabelNet: Building a very large multilingual se-
  mantic network. In Proceedings of the 48th Annual
  Meeting of the Association for Computational Lin-
  guistics, pages 216–225. Association for Computa-
  tional Linguistics.
Roberto Navigli and Simone Paolo Ponzetto. 2012.
  BabelNet: The automatic construction, evaluation
  and application of a wide-coverage multilingual se-
  mantic network. Artif. Intell., 193:217–250.
Roberto Navigli. 2009. Word sense disambiguation: A
  survey. ACM Computing Surveys (CSUR), 41(2):10.
Christian Paul, Achim Rettinger, Aditya Mogadala,
  Craig A Knoblock, and Pedro Szekely. 2016. Effi-
  cient graph-based document similarity. In European
  Semantic Web Conference, pages 334–349. Springer.
Jeffrey Pennington, Richard Socher, and Christopher
   Manning. 2014. Glove: Global vectors for word
   representation. In Proceedings of the 2014 confer-
   ence on empirical methods in natural language pro-
   cessing (EMNLP), pages 1532–1543.
Jay Pujara, Hui Miao, Lise Getoor, and William Co-
   hen. 2013. Knowledge graph identification. In In-
   ternational Semantic Web Conference, pages 542–
   557. Springer.
Philip Resnik. 1995. Using information content to
  evaluate semantic similarity in a taxonomy. arXiv
  preprint cmp-lg/9511007.
Ray Richardson, A Smeaton, and John Murphy. 1994.
  Using wordnet as a knowledge base for measuring
  semantic similarity between words.
Michael Schuhmacher and Simone Paolo Ponzetto.
  2014. Knowledge-based graph document modeling.
  In Proceedings of the 7th ACM international con-
  ference on Web search and data mining, pages 543–
  552. ACM.
Robert Speer, Joshua Chin, and Catherine Havasi.
  2017. Conceptnet 5.5: An open multilingual graph
  of general knowledge. In AAAI, pages 4444–4451.
Fabian M Suchanek, Gjergji Kasneci, and Gerhard
  Weikum. 2007. Yago: a core of semantic knowl-
  edge. In Proceedings of the 16th international con-
  ference on World Wide Web, pages 697–706. ACM.
Rocco Tripodi and Marcello Pelillo. 2017. A game-
  theoretic approach to word sense disambiguation.
  Computational Linguistics, 43(1):31–70.