<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Selectivity-Based Keyword Extraction for Croatian News</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Rijeka</institution>
          ,
          <addr-line>Radmile Matejcic 2, 51000 Rijeka</addr-line>
          ,
          <country country="HR">Croatia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Our approach proposes a novel network measure - the node selectivity for the task of keyword extraction. The node selectivity is dened as the average strength of the node. Firstly, we show that selectivitybased keyword extraction slightly outperforms the extraction based on the standard centrality measures: in-degree, out-degree, betweenness, and closeness. Furthermore, from the data set of Croatian news we extract keyword candidates and expand extracted nodes to word-tuples ranked with the highest in/out selectivity values. The obtained sets are evaluated on manually annotated keywords: for the set of extracted keyword candidates the average F 1 score is 24.63%, and the average F 2 score is 21.19%; for the exacted word-tuples candidates the average F 1 score is 25.9% and the average F 2 score is 24.47%. Selectivity-based extraction does not require linguistic knowledge while it is purely derived from statistical and structural information of the network.</p>
      </abstract>
      <kwd-group>
        <kwd>keyword extraction</kwd>
        <kwd>complex network</kwd>
        <kwd>centrality measures</kwd>
        <kwd>selectivity</kwd>
        <kwd>Croatian news texts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The task of keyword extraction is to automatically identify a set of terms that
best describe the document [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Automatic keyword extraction establishes a
foundation for various natural language processing applications: information
retrieval, the automatic indexing and classi cation of documents, automatic
summarization, high-level semantic description, etc.
      </p>
      <p>
        State-of-the-art keyword extraction approaches are based on statistical
methods which require learning from hand-annotated data sets. In the last decade
the focus of research has shifted toward unsupervised methods, mainly towards
network or graph enabled keyword extraction. In a network enabled keyword
extraction the document representation may vary from very simple (words are
nodes and their co-occurrence is represented with links), or can incorporate very
sophisticated linguistic knowledge like syntactic [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or semantic relations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
Typically, the source (document, text, data) for keyword extraction is modelled
with one network. This way, both the statistical properties (frequencies) as well
as the structure of the source text are represented by a unique formal
representation, hence a complex network.
      </p>
      <p>
        A network (or graph, since the number of words in isolated documents is
limited) enabled keyword extraction exploits di erent measures for the task of
identifying and ranking the most representative features of the source - the
keywords. The keyword extraction powered by network measures can be on the node,
network or subnetwork level. Measures on the node level are: degree, strength,
centrality [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]; on the network level: coreness, clustering coe cient, PageRank
motivated ranking score or HITS motivated hub and authority score [
        <xref ref-type="bibr" rid="ref10 ref11 ref14">10, 11,
14</xref>
        ]; on the subnetwork level: communities [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Most of the of the research was
motivated with various centrality measures: degree, betweenness, closeness and
eigenvector centrality [9{11, 13{15].
      </p>
      <p>
        Our research aims at proposing a novel selectivity-based method for the
unsupervised keyword extraction from the network of Croatian texts. Since Croatian
is a highly ective Slavic language, the source text usually needs a substantial
preprocessing (lemmatization - morphological normalization, stopwords removal,
part-of-speech (POS) annotation, morphosyntactic descriptions (MSD) tagging,
etc.), we design our approach with little or no linguistic knowledge. A new
network measure - the node selectivity, originally proposed by Masucci and Rodgers
[
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] (that can distinguish a real from a shu ed one), is applied to automatic
keyword extraction. Selectivity is de ned as the average weight distribution on
the links of the single node. In our previous work, the node selectivity measure
performed in favour of the di erentiation between original and shu ed Croatian
texts [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], and for the di erentiation of blog and literature text genres [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In
this work we explore the potential of the selectivity for the keyword extraction
in the Croatian news articles. To the best of our knowledge, the node selectivity
measure has not been applied to the keyword extraction task before.
      </p>
      <p>Section 2 presents an overview of related work on automatic keyword
extraction. In Section 3 we present the de nition of the measures for the network
structure analysis. In Section 4 we present the construction of co-occurrence
networks from collection of used text. The methods used for network based keyword
extraction are explained in Section 5. The evaluation of obtained keywords and
results are in Section 6. In the nal Section, we elaborate upon the selectivity
method and make conclusions regarding future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Lahiri et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] extract keywords and keyphrases form co-occurrence networks of
words and from noun phrases collocations networks. Eleven measures (degree,
strength, neighbourhood size, coreness, clustering coe cient, structural
diversity index, page rank, HITS hub and authority score, betweenness, closeness and
eigenvector centrality) are used for keyword extraction from directed/undirected
and weighted networks. The obtained results on 4 data sets suggest that
centrality measures outperform the baseline term frequency/inverse document
frequency (tf-idf) model, and simpler measures like degree and strength outperform
computationally more expensive centrality measures like coreness and
betweenness.
      </p>
      <p>
        Boudin [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] compares various centrality measures for graph-based keyphrase
extraction. Experiments on standard data sets of English and French show that
simple degree centrality achieves results comparable to the widely used
TextRank algorithm; and that closeness centrality obtains the best results on short
documents. Undirected and weighted co-occurrence networks are constructed
from syntactically (only nouns and adjectives) parsed and lemmatized text
using co-occurrence window. Degree, closeness, betweenness and eigenvector
centrality are compared to PageRank ad proposed by Mihalcea in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] as a
baseline. Degree centrality achieve similar performance as much complex TextRank.
Closeness centrality outperforms TextRank on short documents (scienti c papers
abstracts).
      </p>
      <p>
        Litvak and Last [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] compare supervised and unsupervised approaches for
keywords identi cation in the task of extractive summarization. The approaches
are based on the graph-based syntactic representation of text and web
documents. The results of the HITS algorithm on a set of summarized documents
performed comparably to supervised methods (Naive Bayes, J48, Support
Vector Machines). The authors suggest that simple degree-based rankings from the
rst iteration of HITS, rather than running it to its convergence, should be
considered.
      </p>
      <p>
        Grineva et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use community detection techniques for key terms
extraction on Wikipedia's texts, modelled as a graph of semantic relationships
between terms. The results showed that the terms related to the main topics
of the document tend to form a community, thematically cohesive groups of
terms. Community detection allows the e ective processing of multiple topics in
a document and e ciently lters out noise. The results achieved on weighted
and directed networks from semantically linked, morphologically expanded and
disambiguated n-grams from the article's titles. Additionally, for the purpose of
the noise stability, they repeated the experiment on di erent multi-topic web
pages (news, blogs, forums, social networks, product reviews) which con rmed
that community detection outperforms td-idf model.
      </p>
      <p>
        Palshikar [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] proposes a hybrid structural and statistical approach to extract
keywords from a single document. The undirected co-occurrence network, using
a dissimilarity measure between two words, calculated from the frequency of
their co-occurrence in the preprocessed and lemmatized document, as the edge
weight, was shown to be appropriate for the centrality measures based approach
for keyword extraction.
      </p>
      <p>
        Mihalcea and Tarau [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] report a seminal research which introduced a
stateof-the-art TextRank model. TextRank is derived from PageRank and introduced
to graph based text processing, keyword and sentence extraction. The abstracts
are modelled as undirected or directed and weighted co-occurrence networks
using a co-occurrence window of variable sizes (2..10). Lexical units are
preprocessed: stopwords removed, words restricted with POS syntactic lters (open
class words, nouns and adjectives, nouns). The PageRank motivated score of the
importance of the node derived from the importance of the neighboring nodes
is used for keyword extraction. The obtained TextRank performance compares
favorably with the supervised machine learning n-gram based approach.
      </p>
      <p>
        Matsou et al. in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] present an early research where a text document is
represented as an undirected and unweighted co-occurrence network. Based on
the network topology, the authors proposed an indexing system called KeyWorld,
which extracts important terms (pairs of words) by measuring their contribution
to small-world properties. The contribution of the node is based on closeness
centrality calculated as the di erence in small-world properties of the network
with the temporarily elimination of a node combined with inverse document
frequency (idf).
      </p>
      <p>
        Erkan and Radev [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] introduce a stochastic graph-based method for
computing the relative importance of textual units on the problem of text summarization
by extracting the most important sentences. LexRank calculates sentence
importance based on the concept of eigenvector centrality in a graph representation
of sentences. A connectivity matrix based on intra-sentence cosine similarity is
used as the adjacency matrix of the graph representation of sentences. LexRank
is shown to be quite insensitive to the noise in the data.
      </p>
      <p>
        Mihalcea in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] presents an extension to earlier work [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], where the
TextRank algorithm is applied for the text summarization task powered by sentence
extraction. On this task TextRank performed on a par with the supervised and
unsupervised summarization methods, which motivated the new branch of
research based on the graph-based extracting and ranking algorithms.
      </p>
      <p>
        Tsatsaronis et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] present SemanticRank, a network based ranking
algorithm for keyword and sentence extraction from text. Semantic relation is based
on the calculated knowledge-based measure of semantic relatedness between
linguistic units (keywords or sentences). The keyword extraction from the Inspec
abstracts' results reported a favorable performance of SemanticRank over
stateof-the-art counterparts - weighted and unweighted variations of PageRank and
HITS.
      </p>
      <p>
        Huang et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose an automatic keyphrase extraction algorithm using
an unsupervised method based on connectedness and betweeness centrality.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Related Work on the Croatian Language</title>
        <p>
          The keyphrase extraction for the Croatian language has been addressed in both
supervised [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and unsupervised [20{22] settings. Ahel et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] use a Naive
Bayes classi er combined with tf-idf (term frequency/inverse document
frequency), [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] utilizes the part-of-speech (POS) and morphosyntactic description (MSD)
tags ltering followed by tf-idf ranking, and [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] exploits the distributional
semantics to build topically related word clusters, from which they extract
keywords and expand them to keyphrases. Bekavac et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] propose a genetic
programming approach for keyphrases the extraction for the Croatian language
on the same data set. GPKEX can evolve simple and interpretable keyphrase
scoring measures that perform comparably to other machine learning methods
for Croatian. Reported research on extraction of Croatian keywords use a data
set composed of Croatian news articles from the Croatian News Agency (HINA),
with hand annotated keywords by human experts.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Complex Network Analysis</title>
      <p>
        This section describes the basic network measures that are necessary for
understanding our approach. More details about these measures can be found in [
        <xref ref-type="bibr" rid="ref24 ref8">8,
24</xref>
        ]. In the network, N is the number of nodes and K is the number of links.
In weighted language networks every link connecting two nodes i and j has an
associated weight wij which is a positive integer number.
      </p>
      <p>The node degree ki is de ned as the number of edges incident upon a node.
The in degree and out degree kiin=out of node i is de ned as the number of its in
and out neighbours.</p>
      <p>Degree centrality of the node i is the degree of that node. It can be normalised
by dividing it by the maximum possible degree N 1:</p>
      <p>Analogue, the in/out degree centralities are de ned as in/out degree of a
node:
(1)
(2)
(3)
(4)
(5)</p>
      <p>Closeness centrality is de ned as the inverse of farness, i.e. the sum of the
shortest distances between a node and all the other nodes. Let dij be the shortest
path between nodes i and j. The normalised closeness centrality of a node i is
given by:</p>
      <p>Betweenness centrality quanti es the number of times a node acts as a bridge
along the shortest path between two other nodes. Let jk be the number of the
shortest paths from node j to node k and let jk(i) be the number of those paths
that pass through the node i. The normalised betweenness centrality of a node
i is given by:
dci =</p>
      <p>N
ki :</p>
      <p>1
k(in=out)
dci(in=out) = i</p>
      <p>N 1</p>
      <p>:
si = X wij:</p>
      <p>j</p>
      <p>The strength of the node i is a sum of the weights of all the links incident
with the node i:</p>
      <p>
        The selectivity measure is introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It is actually an average strength
of a node. For the node i the selectivity is calculated as a fraction of the node
weight and node degree:
      </p>
      <p>In the directed network, the in/out selectivity of the node i is de ned as:</p>
      <p>All given measures are de ned for directed networks, but language networks
are weighted, therefore, the weights should be considered. In the directed
network, the in/out strength siin=out of the node i is de ned as the number of its
incoming and outgoing links, that is:
sin=out =
i</p>
      <p>X wji=ij :</p>
      <p>j
ei =
si :
ki
ein=out = siin=out
i kin=out
i
:
(6)
(7)
(8)
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>
          For the network based keyword extraction we use the data set composed of
Croatian news articles [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The data set contains 1020 news articles from the
Croatian News Agency (HINA), with manually annotated keywords (key phrases) by
human experts. The set is divided: 960 annotated documents for learning of
supervised methods, and 60 documents for testing. The test set of 60 documents is
annotated by 8 di erent experts, where the inter-annotator agreement in terms
of F2 scores (see Section 5) are on average 46% (between 29.3% and 66.1%).
        </p>
        <p>We selected the rst 30 texts from the HINA collection for our experiment.
The texts required some preprocessing: parsing only textual part and title part
excluding annotations, cleaning of diacritics and symbols (w instead of vv, !
instead of l, etc.) and lemmatization. Non-standard word forms numbers, dates,
acronyms, abbreviations etc. remain in text, since the method is preferably
resistant to the noise presented in the data source.</p>
        <p>The selected 30 texts varied in length: from very short 60 tokens up to 800
tokens (318 tokens on average). The number of keywords per document varies
between 9 and 42 (24 on average). One annotator on average annotated 10
keywords per document.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>The Construction of Co-occurrence Networks</title>
        <p>
          Text can be represented as a complex network of linked words: each individual
word is a node and interactions amongst words are links. Co-occurrence networks
exploit simple neighbour relation, two words are linked if they are adjacent in the
sentence [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The weight of the link is proportional to the overall co-occurrence
frequencies of the corresponding word pairs within a corpus.
        </p>
        <p>From the documents in the HINA data set we construct directed and weighted
co-occurence networks: one from the text in each document and an integral one
from the texts in all documents; 31 in total.</p>
        <p>
          Network construction and analysis was implemented with the Python
programming language using the NetworkX software package developed for the
creation, manipulation, and study of the structure, dynamics and functions of
complex networks [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
5
5.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Keyword Extraction</title>
      <sec id="sec-5-1">
        <title>Centrality Motivated Keyword Extraction</title>
        <p>Network based keyword extraction methods exploit di erent measures for the
task of identi cation and ranking the most representative features of the source
the keywords. The rst part of our research compares the performance of di erent
centrality motivated network measures (in/out degree, closeness and
betweenness) with the performance of proposed selectivity measure. The second part
develops a selectivity-based method for keyword extraction with a comparative
analysis of unsupervised (non-network enabled) approaches.</p>
        <p>The degree (Eq. 1 and 2) of a node (word) is the number of neighbouring
nodes (di erent neighbouring words). Typically, the nodes with the highest
degree in the network are hubs, analogously the words with the highest degree are
expectedly stopwords. The closeness (Eq. 3) of a node (word) is related to the
farness of the word from all other words in the text. The betweenness (Eq. 4)
of a node (word) is the measure of how many shortest paths between all other
node-pairs are traversing a node. The words with the highest values of the
betweenness centrality are considered to be important for the information ow as
well. Selectivity is a local (node level) network measure, de ned as the ratio of
the node strength and the node degree. In weighted and directed co-occurrence
networks one can consider the in- and out- links for obtaining in/out selectivity of
the node (Eq. 8). The computation of the node's selectivity value is less complex
and expensive than the computation of closeness and betweenness values.</p>
        <p>From the network constructed from all the texts in the HINA news data set
we calculate in/out degree, closeness, betweenness and in/out selectivity. Based
on the obtained values we rank the top 10 or the top 24 keyword candidates from
the network and evaluate them on the set of manually annotated keywords, as
presented in Table 1. The top 10 or the top 24 keywords are selected due to
the average number of human assigned keywords: on average 10 keywords from
one annotator and on average 24 keywords from all 8 annotators per document.
We evaluate the performance of each network measure based on standard recall
(R), precision (P ) and F 1 score. F 1 score is a harmonic mean of precision and
recall: F1 = 2P R=(P + R). Beside the standard F 1 score we also calculate the
F 2 score, which gives twice as much importance to the recall as to the precision:
F2 = 5P R=(4P + R).
The results in Table 1 are in favour of the selectivity over other standard
centrality network measures. The selectivity can e ciently di erentiate between two
basic types of nodes (words). The nodes with high strength and high degree
values, have low selectivity and they are usually closed-class words (e.g. stopwords,
conjunctions, prepositions). The nodes with high strength and low degree have
high selectivity values. Typically, the highest selectivity value nodes are
openclass words which are preferred keyword candidates (nouns, adjectives, verbs) or
even part of collocations, keyphrases, names, etc. On the other hand, the
highest ranked words with in/out degree, closeness and betweenness are stopwords,
which are not suitable keyword candidates. For example the top 10 ranked words
according to in-degree centrality are: to be, and, in, on, which, for, but, this, self,
of ; according to betweenness they are: to be, and, in, on, self, this, which, for,
Croatian, but ; according to in/out selectivity they are: Bratislava, area, Tuesday,
inland, revolution, veri cation, decade, Balkan, freedom, Universe.</p>
        <p>In short, it seems that selectivity is insensitive to stopwords (the most
frequent function words, which do not carry strong semantic properties, but are
needed for the syntax of language) and therefore can e ciently detect
semantically rich open-class words from the network and extract better keyword
candidates.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Selectivity-Based Keyword Extraction</title>
        <p>The second part of our research develops a selectivity-based method for keyword
extraction. In order to compare the selectivity-based extraction to non-network
based approaches (unsupervised machine learning methods) we construct 30
networks (directed and weighted) from the 30 texts in the HINA data set and
evaluate with manually annotated keyword sets.</p>
        <p>From 30 networks we compute in/out selectivity for all nodes. The nodes
are ranked according to the highest in/out selectivity values above a threshold
value. Preserving the same threshold value ( 1) in all documents resulted in
di erent number of nodes (one word long keyword candidates) extracted per
each network. The obtained set of one word long keyword candidates is noted
as SET1.</p>
        <p>Then, for every ltered node we detect neighbouring nodes: for the
in-selectivity we isolate one neighbour node with the highest outgoing weight; for the
out-selectivity we isolate one neighbour node with the highest ingoing weight.
The result of in/out selectivity extraction is a set of ranked word-tuples - SET2.
Word-tuples are two-word long sequences of keyword candidates. From the
obtained tuples we ltered out those containing stopwords in order to compare
with the manually annotated evaluation set.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation and Results</title>
      <p>For the keyword extraction task the strategy "more is better" can be utilized,
since there is no objective judgement on keywords. Hence, it is preferable to
extract more keywords which makes trade a o between precision and recall of
the methods. The second polemic issue of keyword extraction task is: shorter
keywords are more general vs. longer which are more accurate. Motivated by
these open arguments, and by the approach of other authors, we decided to
follow the same principle: to extract as many keyword candidates as possible
and evaluate them on the basis of recall (R) and F 2 score, beside the standard
precision (P ) and F 1 score.</p>
      <p>Evaluation is the nal part of the experiment based on the intersection of
the obtained sets SET1 and SET2 of keyword candidates with the union of
all 8 annotators keywords. The results in terms of precision and recall are in
Figures 1 and 2 respectively, and in terms of F 1 and F 2 scores in Figures 3
and 4 respectively. The obtained average F 1 score for the SET 1 is 24.63%, and
the average F 2 score is 21.19%. The expansion of obtained candidates to SET2
increased the average F 1 score to 25.9% and F 2 score to 24.47%.</p>
      <p>
        All supervised and unsupervised methods reported on keyphrases extraction
from the HINA data set incorporate the linguistic knowledge (POS, MSD,..) of
Croatian. Mijic et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] initially extracted the list of keyword candidates as
a comprehensive list of all words without stopwords) which was expanded into
longer n-gram sequences up to a length of four. In [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] a keyphrase extraction
system developed for a large-scale Croatian news production system the tf-idf
ranking model was used to extract n-grams of up to length of four, which were
lemmatized, and POS and MSD ltered. For evaluation the manually annotated
key phrases from 60 documents were used. The evaluation set was reduced to
keywords suggested only by 3 top annotators (having the highest inter-annotator
agreement among all 8 annotators). The results indicate that the performance
is comparable to that of the human annotators. Ahel et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] for the one-word
long keywords reported precision of 22% and recall of 3.4%.
      </p>
      <p>
        We designed our method purely from statistical and structural information
encompassed in the source text which is re ected in the structure of the network.
Our method achieved on a SET1 average recall of 19.53% and precision of 39.1%.
Expansion to the word-tuples in SET2 increased average recall to 23.87% and
decreased precision to 32.23%. The obtained results are comparable to [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], but with a slightly di erent evaluation set up.
      </p>
      <p>The obtained selectivity-based results are promising and have potential to
improve in several directions which is elaborated at the end of the next
section. An additional remark regarding results, is that beside keyword candidates
our method captures personal names and entities, which were not marked as
keyphrases and lowered the score. Capturing names and entities can be of high
relevance for the tasks such as name-entity recognition, text summarization, etc.</p>
      <p>
        Keyword annotation is an extremely subjective task as even human experts
have di culties to agree upon keyphrases (inter-agreement around 40%).
Croatian is a highly morphologically rich language, which puts another magnitude
of challenge on the task, since annotators are freely choosing the morphological
word form as a tag, which seems appropriate at the moment. Additionally, there
was no prede ned set of index or keywords list, so annotators could make up
their own, even worse in some cases it seemed appropriate to annotate with
keywords, which were not present in the original article (out-of-vocabulray words).
In [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] the number of out-of-vocabulary keywords on the whole of the HINA
data set is estimated to a high of 57%. Since our method is derived from purely
text statistics, it is not capable to capture all the possible subjective variations
of the annotators or out-of-vocabulary words. Still it is close to the range of the
inter-annotator achieved agreement.
This research on selectivity-based keyword extraction for Croatian news (HINA
data set) describes an unsupervised method which extracts nodes from a complex
network as keyword candidates. We build our approach with a new network
measure - the node selectivity (de ned as the average weight distribution on the
links of the single node). The node selectivity value is used for extracting and
ranking the keyword candidates. Initially, we compare selectivity extraction to
standard centrality motivated measures, and propose the selectivity measure for
the keyword extraction.
      </p>
      <p>The selectivity-based keyword extraction method is comprised of: the
extraction of the seed keyword set (words with the highest in/out selectivity) and
expanding them to word-tuples with the highest in/out selectivity values. The
obtained average F 1 score for the set of extracted keyword candidates is 24.63%,
and the average F 2 score is 21.19%. The expansion of the obtained candidates
to word-tuples increased the average F 1 score to 25.9% and F 2 score to 24.47%,
which is comparable to the results on the same data set achieved by
supervised and unsupervised methods, and is close to the range of the inter-annotator
achieved agreement. The selectivity-based extraction does not require
linguistic knowledge as it is purely derived from statistical and structural information
encompassed in the source text which is re ected in the structure of the network.</p>
      <p>Our results imply that the structure of the network can be applied to the
Croatian keyword extraction task with many possible improvements. This should
be thoroughly examined in future work, which will cover: a) evaluation -
considering all ective word forms; considering various matching strategies - exact,
fuzzy, part-of-match; b) text types - considering texts of varying length, genres
and topics; c) multitopic - comparing isolate document extraction vs. multitopic
extraction; d) other languages - testing on standard English (and other) data
sets; e) longer keyword candidate sets - constructing keyword sequences up to a
length of 3; f) entity extraction - testing weather entities can be extracted from
complex networks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Hagberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Swart</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Chult</surname>
          </string-name>
          .
          <article-title>Exploring network structure, dynamics, and function using networkX</article-title>
          .
          <source>Technical report</source>
          , Los Alamos National Laboratory (LANL) (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          .
          <article-title>What role does syntax play in a language network? EPL (Europhysics Letters)</article-title>
          ,
          <volume>83</volume>
          (
          <issue>1</issue>
          ):
          <volume>18002</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Margan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Martincic-Ipsic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mestrovic</surname>
          </string-name>
          .
          <article-title>Preliminary report on the structure of Croatian linguistic co-occurrence networks</article-title>
          .
          <source>5th International Conference on Information Technologies and Information Society (ITIS)</source>
          ,
          <year>Slovenia</year>
          ,
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Margan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Martincic-Ipsic</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mestrovic</surname>
          </string-name>
          .
          <source>Network Di erences Between Normal and Shu ed Texts: Case of Croatian. Studies in Computational Intelligence</source>
          , Complex Networks V. Vol.
          <volume>549</volume>
          . Italy, pp.
          <fpage>275</fpage>
          -
          <lpage>283</lpage>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Margan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mestrovic</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Martincic-Ipsic</surname>
          </string-name>
          .
          <article-title>Complex Networks Measures for Di erentiation between Normal and Shu ed Croatian Texts</article-title>
          .
          <source>IEEE MIPRO</source>
          <year>2014</year>
          , Croatia, pp.
          <fpage>1819</fpage>
          -
          <lpage>1823</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Sisovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Martincic-Ipsic</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mestrovic</surname>
          </string-name>
          .
          <article-title>Comparison of the language networks from literature and blogs</article-title>
          .
          <source>IEEE MIPRO</source>
          <year>2014</year>
          , Croatia, pp.
          <fpage>1824</fpage>
          -
          <lpage>1829</lpage>
          , (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Masucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rodgers</surname>
          </string-name>
          .
          <article-title>Network properties of written human language</article-title>
          . Physical Review E,
          <volume>74</volume>
          (
          <issue>2</issue>
          ):
          <volume>026102</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Masucci</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rodgers</surname>
          </string-name>
          .
          <article-title>Di erences between normal and shu ed texts: structural properties of weighted networks</article-title>
          .
          <source>Advances in Complex Systems</source>
          ,
          <volume>12</volume>
          (
          <issue>01</issue>
          ):
          <fpage>113</fpage>
          -
          <lpage>129</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Lahiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.R.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Caragea</surname>
          </string-name>
          .
          <article-title>Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks</article-title>
          .
          <source>arXiv preprint arXiv:1401.6571</source>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Boudin</surname>
          </string-name>
          .
          <article-title>A comparison of centrality measures for graph-based keyphrase extraction</article-title>
          .
          <source>International Joint Conference on Natural Language Processing (IJCNLP)</source>
          , pp.
          <volume>834</volume>
          {
          <fpage>838</fpage>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>M.</given-names>
            <surname>Litvak</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Last</surname>
          </string-name>
          .
          <article-title>Graph-based keyword extraction for single-document summarization</article-title>
          .
          <source>ACM Workshop on Multi-source Multilingual Information Extraction and Summarization</source>
          . pp.
          <volume>1724</volume>
          , (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. Grineva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Grinev</surname>
            , and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lizorkin</surname>
          </string-name>
          .
          <article-title>Extracting key terms from noisy and multitheme documents</article-title>
          .
          <source>ACM 18th conference on World Wide Web</source>
          , pp.
          <volume>661</volume>
          {
          <fpage>670</fpage>
          , (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Palshikar</surname>
          </string-name>
          .
          <article-title>Keyword extraction from a single document using centrality measures</article-title>
          .
          <source>Pattern Recognition and Machine Intelligence</source>
          , pp.
          <volume>503</volume>
          {
          <fpage>510</fpage>
          , (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tarau</surname>
          </string-name>
          . TextRank:
          <article-title>Bringing order into texts</article-title>
          .
          <source>ACL Empirical Methods in Natural Language Processing-EMNLP04</source>
          ,
          <article-title>(</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ohsawa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ishizuka</surname>
          </string-name>
          . Keyworld:
          <article-title>Extracting keywords from document s small world</article-title>
          .
          <source>Discovery Science</source>
          , pp.
          <volume>271</volume>
          {
          <fpage>281</fpage>
          , (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>G.</given-names>
            <surname>Erkan</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          . LexRank:
          <article-title>Graph-based lexical centrality as salience in text summarization</article-title>
          .
          <source>Arti cial Intelligence Res</source>
          .
          <source>(JAIR)</source>
          , vol.
          <volume>22</volume>
          (
          <issue>1</issue>
          ), pp.
          <volume>457</volume>
          {
          <fpage>479</fpage>
          , (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          .
          <article-title>Graph-based ranking algorithms for sentence extraction, applied to text summarization</article-title>
          .
          <source>Proc. ACL</source>
          <year>2004</year>
          , pp.
          <fpage>20</fpage>
          , (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. G. Tsatsaronis,
          <string-name>
            <given-names>I.</given-names>
            <surname>Varlamis</surname>
          </string-name>
          and
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>N rvag. SemanticRank: ranking keywords and sentences using semantic graphs</article-title>
          .
          <source>ACL 23rd International Conference on Computational Linguistics</source>
          , pp.
          <volume>1074</volume>
          {
          <fpage>1082</fpage>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>C. Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>C.X.</given-names>
          </string-name>
          <string-name>
            <surname>Ling</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Keyphrase extraction using semantic networks structure analysis</article-title>
          .,
          <source>IEEE International Conference on Data Mining</source>
          , pp.
          <volume>275</volume>
          {
          <fpage>284</fpage>
          , (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>J. Mijic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dalbelo-Basic</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Snajder</surname>
          </string-name>
          .
          <article-title>Robust keyphrase extraction for a largescale Croatian news production system</article-title>
          .
          <source>FASSBL</source>
          <year>2010</year>
          , pp.
          <volume>59</volume>
          {
          <fpage>66</fpage>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>M.</given-names>
            <surname>Bekavac</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Snajder</surname>
          </string-name>
          .
          <source>GPKEX: Genetically Programmed Keyphrase Extraction from Croatian Texts. ACL</source>
          <year>2013</year>
          , pp.
          <fpage>43</fpage>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>J. Saratlija</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Snajder</surname>
            and
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dalbelo-Basic</surname>
          </string-name>
          .
          <article-title>Unsupervised topic-oriented keyphrase extraction and its application to Croatian</article-title>
          . Text, Speech and Dialogue, pp.
          <volume>340</volume>
          {
          <fpage>347</fpage>
          , (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>R.</given-names>
            <surname>Ahel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dalbelo-Basic</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Snajder</surname>
          </string-name>
          .
          <article-title>Automatic keyphrase extraction from Croatian newspaper articles</article-title>
          .
          <source>The Future of Information Sciences, Digital Resources and Knowledge Sharing</source>
          , pp.
          <volume>207</volume>
          {
          <fpage>218</fpage>
          , (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>M.E.J.</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <source>Networks: An Introduction</source>
          . Oxford University Press.(
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>