<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Modeling of lexical relations between topics
Modeling of Lexical Relations Between Topics
retrieved from DBLP journals
Retrieved from DBLP Journals</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukas Hlavacek</string-name>
          <email>C@zvescbh.cRzepublic</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Vymola Lukas Hlavacek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Vymola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, FEI, VSB - Technical University of Ostrava</institution>
          ,
          <addr-line>Departmen1t7o.fliCstoompapduute1r5S,c7i0en8c3e3,,FOEsIt,rVavSaB-P-orTuebcah,nCiczaelcUhnRiveeprusibtlyicof Ostrava, 17. list</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>olpuakdaus.</institution>
          <addr-line>1h5,la7v0a8c3e3k,,Omsitrcahvaal-.Pvoyrmuoblaa</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>153</fpage>
      <lpage>160</lpage>
      <abstract>
        <p>In this paper we present a method for getting topics from aims and scopes in DBLP journals and following construction of their hierarchical order. We focused on semantic relations between topics. Our method is fully automatic but manual cleaning of topic database would lead to much better accuracy. Another purpose of our work is to provide similar system to ACM classification system. We want to provide better, newer and fully automatic system contrary to ACM.</p>
      </abstract>
      <kwd-group>
        <kwd>DBPL</kwd>
        <kwd>ACM</kwd>
        <kwd>Lexical acquisition</kwd>
        <kwd>Text mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper describes a method for semantically retrieved topics hierarchy into
a graph where the nodes are topics and the edges (also called links) represent
relationships between them. We present how to retrieve data from plain text,
then algorithm itself an in the last part testing and result analysis. There were
not well presented yet methods for automatic topic retrieval from plain text.
Of course, our technique is applicable to many other issues requiring word
categorization with semantic influence. Existing methods are primarily focused on
semantic of ordinary words. In next chapter we mention some of them. But we
want to focus in this paper especially on semantic of topics because topics have a
special semantic. This paper describes a method for semantically retrieved
topics hierarchy into a graph where the nodes are topics and the edges (also called
links) represent relationships between them. Levels in our graph represent level
in hierarchy.</p>
      <p>
        Our motivation was Widdow’s method described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but we apply his
technique on topics retrieved from DBLP. For testing we could not use WordNet
database as Widdows did because this database contains only information about
simple words (synonyms). Topics mostly consists of more words (2-3). That is
why we choose ACM classification (latest 1998).
      </p>
      <p>ACM is widely recognized as the premier membership organization for
computing professionals, delivering resources that advance computing as a science
and a profession, enable professional development and promote policies and
research that benefit society.</p>
      <p>ACM hosts the computing industry’s leading Digital Library and Guide to
Computing Literature, and serves its global members and the computing
profession with journals and magazines, conferences, workshops, electronic forums,
and Online Books and Courses.</p>
      <p>
        ACM’s first classification system for the computing field was published in
1964. Then, in 1982, the ACM published an entirely new system. New versions
based on the 1982 system followed, in 1983, 1987, 1991, and 1998 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. ACM
classification is already more than 10 years old and does not contain many new
topics, for instance social network or wireless networks.
      </p>
      <p>The DBLP data source, which is representative of conventional database
applications, is maintained by a single source. It is one of the best formatted
and organized bibliography datasets. DBLP covers approximately 400,000
researchers who have publications in major Computer Science publication venues.
Bibliographic datasets have been used for social network analysis, such as
studying the structure and the spread of influence in scientific communities.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Previous work</title>
      <p>In this paper we continue in D. Widdows’ work. He presented method for
assembling semantic knowledge from a part-of-speech tagged corpus using graph
algorithms. His graph model is built by linking pairs of words which participate
in particular syntactic relationships.</p>
      <p>
        In this part we present some of most important previous methods. Most
work on automatic lexical acquisition has been based at some point on the
notion of semantic similarity. Roark and Charniak described a generic algorithm
for extracting such lists of similar words using the notion of semantic similarity
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Roark and Charniak reported accuracies of 17% and 35%. The early
results have been improved upon by Riloff and Jones in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where a mutual
bootstrapping approach is used to extract words in particular semantic categories
and expression patterns for recognizing relationships between these words for
the purposes of information extraction. The accuracy achieved in this
experiment was about 78%.
      </p>
      <p>
        Another way to obtain word-senses directly from corpora is to use clustering
algorithms on feature-vectors. General problem for such clustering techniques
lies in the question of how many clusters one should have, i.e. how many senses
are appropriate for a particular word in a given domain. Lin’s approach to this
problem in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is to build a similarity tree (using what is in effect a
hierarchical clustering method) of words related to a target word. Another approach
described T. P. Martin and M. Azmi-Murad in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Widdows described a new incremental algorithm for extracting categories
of similar words. He defined following method for constructing a hierarchy of
words, affecting how they depend on each other.</p>
      <p>
        Definition 1. Let A be a set of nodes and let N(A) be the neighbours of A.
These neigbour nodes are linked to any a ∈ A. (So N (A) = Sa∈A N (a).) The
best new node is taken to be the node b ∈ N (A) \ A with the highest proportion
of links to N (A). More precisely, for each u ∈ N (A) \ A, let the affinity between
u and A be given by the ratio
If N (u) = ∅ then af(N(u),N(A))=0. The best new node b ∈ N (A) \ A is the node
which maximises this affinity score [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Here it depends which seed topic is taken as the first one. This algorithm
may find for particular topic A some other node B as the best node. But this
does not have to find topic A as the best node for topic B if B is a seed topic.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>This chapter describes our approach to retrieve topics from text and how to
construct topics hierarchy. Source of our data are manually extracted aims and
scopes from DBLP journals. Follows automatic extraction particular topics from
these texts.
3.1</p>
      <sec id="sec-3-1">
        <title>The algorithm for creating of dictionary</title>
        <p>Let R be the set of abstracts where ri ∈ R is abstract of journal. Then let N
be the set of improper words (negative dictionary), where nm ∈ T is improper
word in negative dictionary. Also we define δ, that represents maximal number
of words in topic. For example if δ = 3 then topic contantains max three words.
Function split is standart function of programing language C# and splits text
by conjuction.</p>
        <p>Algorithm is based on searching of specific important conjunctions in
sentences (and, or, comma, semi-comma) and words with a short distance to these
conjunctions. These conjunctions have various priority whereas the most
important is and. Other special characters were removed. In the case of and conjunction
is necessary to correctly analyze the topic which we do following way:</p>
        <p>In fig.1 is an example of syntactic compound of topic with and. Arrows show
which word is needed to add and where. For instance software and hardware
architecture or other example software architecture and hardware. In the fig.1 is
suggested an algorithm for topics consisting of two words. Similar algorithm is
for 3-word topics. It is important to proceed from lexical structure of more-words
conjunctions when creating such rules.</p>
        <p>It might happen that this method divides the topic into two new topics
which were not supposed to be divided, e.g. Data terminals and printers. Other
conjunctions usually does not have this issue and these conjunctions have for
semantic of topics also the same priority. Next step is to save these parsed topics
into incident matrix which creates relations between particular nodes. Then we
can easily set up level of importance according to frequency of their occurrence
in text.</p>
        <p>Algorithm 1 The algorithm for create dictionary
Require: δ = 3, R=set of abstracts, N=set of improper words
for all ri ∈ R do</p>
        <p>S=split(ri, ’.’) {where S is the set of sentences for ri ∈ R}
for all sj ∈ S do</p>
        <p>W=split(sj,conjunction) {where W is potencial the set of topics for sj ∈ S}
for all wk ∈ W do
if |wk| &lt;= δ and wk ∈/ N then</p>
        <p>Add wk to T {where T is the set of topics}
end if
end for
end for
end for
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The algorithm for create graph of topics</title>
        <p>Let T be the set of topics, where ti ∈ T is topic in dictionary of topics. Let
R be the set of abstracts, where ri ∈ R is abstract of journal. Also we use
incidence matrix Im×n, that represents link between words in dictionary.We
define that represents required strength of relationship between two words in
incidence matrix.</p>
        <p>In the first phase we analyze each journal separately and we search for one,
two or three-word topics.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Testing</title>
      <p>We identified about 750 journals in DBLP.For testing we used 500 aims and
scopes from these journals and containing about 90 000 words. This text was
parsed by algorithm 1 into about 4000 topics. During testing we found out that
Algorithm 2 The algorithm for searching of similar topics
for our topic-database size we are obtaining the best results if we take first 1000
topics with the highest frequency of occurrence. So there is about 3000 topics
left but their rating is too low to affect the result significantly.</p>
      <p>We achieved this result by following way: first we took 400 topics, but in this
setting we found only few related topics. Then why we doubled size of topics.
In this way we continued up to 2000 topics and we were observing generated
groups. We were also focusing on reliability of topics in particular groups so we
do not obtain topics out of seed topic in group. We set a border up to 10 good
topics in each group, if possible. However for some of topics algorithm finished
in lower number. From all results we found that for our database is optimal to
take first 1000 topics with the highest number of occurrences. We compared our
results to ACM hierarchy.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conlusion</title>
      <p>Most from previous works were using WordNet database for testing. Because
topics are semantically different we could not use this one but instead we used
ACM hierarchy database which is a database of topics hierarchically ordered. It
is possible to download also in XML so it is very easy to work with it. Even we
are not able to measure accuracy as a single number, we can pursue conformity
to ACM.</p>
      <p>Result is strongly affected by the age of last release of ACM hierarchy (in
1998) when many topics did not exist, for instance social networks or business
intelligence. Our algorithm found about 5000 automatically generated topics
from 500 journals. There were manually found approximately 10% of topics
which were not good topics. Latest ACM database consists of 1200 topics. We
choose some topics common for ACM and our results and compared them. Since
we found only about 20 percent of topics same in both databases it does not make
much sence to compare them together. The difference between both databases
is that for instance in ACM there is a single topic graphic but our algorithm
generates computer graphic and 8 other topics containing word graphic. This is
also a reason why is our hierarchy so much larger.</p>
      <p>Figure 4 shows how algorithm is adding every new topic into hierarchy. First
are added topics as first level which were directly connected to the seed topic,
then to the level 2 were added topics with the highest ratio to topics from first
level. This algorithm iterates the same for next levels.</p>
      <p>In Table 1 we present results for few randomly chosen seed topics. We can
see that our method generates corectly related topics to the seed topic. However
there might be some topics not related to the seed word and these topics are
marked in table as italic font. Compare to ACM we obtained much better related
topics in groups. More aims and scopes we have better results we get.
artificial intel- formation retrieval, digital library, mechanical engineering, data
ligence warehousing, logic programming, machine learning, evolutionary
computation, quantum computing, artificial life, learning
system, neural network
wireless
computer
graphic
language
research
fspread-spectrum system, cellular system, ip, adaptive antenna,
network protocol, telecommunications network, computer
communication, lan, man, satellite network, wireless computing
image processing, computer vision, speech recognition, data
mining, machine learning, digital library, evolutionary computation,
artificial life, learning system, data warehousing
mathematical computational method, numerical mathematic, computational
method mathematic, stochastic process, set theory, probability theory,
matroid theory, computer scientist, coding theory, differential
equation,
process engi- production engineering, electronics engineering, unmanned
sysneering tem, electrical engineering, chemical engineering, mechanical
engineering, traffic engineering, applied mathematic
language technology, human-computer interaction, decision
support, computational linguistic, spatial reasoning, qualitative
reasoning, global warming, semantic web, object-oriented
programming
combinatorial randomized algorithm, mathematical programming, graph
aloptimization gorithm, indexing information, data analysis, computer vision,
information fusion, database design, web mining, risk analysis</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Widdows</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beate Dorow</surname>
          </string-name>
          :
          <article-title>A Graph Model for Unsupervised Lexical Acquisition</article-title>
          .
          <source>COLING</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Widdows</surname>
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ferraro</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Semantic vectors: A scalable open source package and online technology management application</article-title>
          .
          <source>In Proceedings of the sixth international conference on Language Resources and Evaluation</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Widdows</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Orthogonal negation in vector spaces for modeling word-meanings and document retrieval</article-title>
          .
          <source>In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rieffel</surname>
          </string-name>
          , E.:
          <article-title>Certainty and uncertainly in quantum information processing</article-title>
          . In Bruza et al. (
          <year>2007</year>
          ),
          <fpage>134</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Widdows</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bruza P.</surname>
          </string-name>
          :
          <article-title>Quantum information dynamics and open world science</article-title>
          . In Bruza et al. (
          <year>2007</year>
          ),
          <fpage>126</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Ellen</given-names>
            <surname>Riloff</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rosie</given-names>
            <surname>Jones</surname>
          </string-name>
          :
          <article-title>Learning dictionaries for information extraction by multi-level bootstrapping</article-title>
          .
          <source>In Proceedings of the Sixteenth National Conference on Artificial Intelligence</source>
          ,
          <year>1999</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Brian</given-names>
            <surname>Roark</surname>
          </string-name>
          and Eugene Charniak:
          <article-title>Noun-phrase co-occurence statistics for semiautomatic semantic lexicon construction</article-title>
          .
          <source>In COLING-ACL</source>
          ,
          <year>1998</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Trevor</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Masrah</surname>
          </string-name>
          Azmi-Murad:
          <article-title>An Incremental Algorithm to find Asymmetric Word Similarities for Fuzzy Text Mining</article-title>
          .
          <source>WSTST</source>
          <year>2005</year>
          ,
          <volume>838</volume>
          -
          <fpage>847</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. Communications of the ACM, New York, NY, USA</mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dekang</surname>
            <given-names>Lin</given-names>
          </string-name>
          :
          <article-title>Automatic retrieval and clustering similar words</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>COLINGACL</given-names>
          </string-name>
          ,Montreal,Athens
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>