<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Thematically Related Words toward Creative Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eiko Yamamoto</string-name>
          <email>eiko@mech.kobe-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hitoshi Isahara</string-name>
          <email>isahara@nict.go.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Engineering, Kobe University</institution>
          ,
          <addr-line>1-1 Rokkodai-cho, Nada-ku, Kobe, Hyogo, 657-8501</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Information and, Communications Technology</institution>
          ,
          <addr-line>3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0289</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce a mechanism that provides key words which can make human-computer interaction increase in the course of information retrieval, by using natural language processing technology and mathematic measure for calculating degree of inclusion. We show what type of words should be added to the current query, i.e. keywords which previously had been input, in order to make humancomputer interaction more creative. We try to extract related word sets from documents by employing casemarking particles derived from syntactic analysis. Then, we verify which kind of related words is more useful as an additional word for retrieval support.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ACM Classification Keywords
H5.2. INFORMATION INTERFACES AND
PRESENTATION (e.g., HCI): User Interfaces – Natural
Language; H.3.3. INFORMATON STORAGE and
RETRIEVAL: Information Search and Retrieval.
INTRODUCTION
Nowadays, we can access huge amount of text data
available on the web. The increase of the data quantity
causes a paradigm shift for web retrieval. Rhetorically
speaking, we can take a walk among the huge text data. The
web retrieval supports we need in this novel situation are
neither simple query expansion nor our (or someone’s)
record of previously input keywords, but we need interfaces
which interact with people in new ways. What is crucial for
such interface is not constructions of interface, i.e. how
each part of interface is arranged on the screen, but what
information is presented to interact with users.</p>
      <p>New ideas pop into one’s head when he/she strolls in
library, bookstore, or even around town. We need retrieval
supports which enable us to expand such creativity. Making
computer smarter to automatically extract “correct”
retrieval result is one-side way of developing support
systems for information retrieval. Seeing the advice
provided to a user by computer, how the user achieves next
retrieval is one of the most important viewpoints for the
future intelligent user interface. We need a technology that
enables computer to understand huge text data and make it
possible to expand the users’ way of thinking.</p>
      <p>In this paper, we introduce a mechanism that provides key
words which can make human-computer interaction (HCI)
during the information retrieval increase, by using natural
language processing technology and mathematic measure
for calculating degree of inclusion. Concretely, we show
what type of words should be added to the current query, i.e.
keywords which previously had been input, in order to
make HCI more creative.</p>
      <p>
        RELATION BETWEEN WORDS
Many researchers in natural language processing have
developed many methodologies for extracting various
relations from corpora. Several methods exist for extracting
relations such as “is-a” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], “part-of” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], causal [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and
entailment [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] relations. Moreover, methods to learn
patterns for extracting relations between words have been
presented [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ]. Such related words can be used to support
retrieval in order to lead users to high-quality information.
One simple method is to provide additional key words
related to the key words users have input. Here we have a
question, which is what kinds of relations between the
previous key words and the additional word are effective
for information retrieval.
      </p>
      <p>
        As for the relations among words, at least two kinds of
relations exist: the taxonomical relation and the thematic
relation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. 1 The former is a relation representing the
physical resemblance among objects, such as, “cow” and
“animal,” which is typically a semantic relation; the latter is
a non-taxonomical relation among objects through a
thematic scene, such as “milk” and “cow” as recollected in
the scene “milking a cow,” which includes causal relation
and entailment relation. Taxonomically related words are
generally used to query expansion and it is comparatively
easy to identify taxonomical relations from linguistic
resources such as dictionaries and thesauri. On the other
hand, it is difficult to identify thematic relations because
they are rarely maintained in linguistic resources.
Most of the previous researches of information retrieval
support tended to focus on the improvement of recall by the
query expansion. However, our aim is to direct users to the
informative information for them by query suggestion.
Though some users sometimes do not realize their real
intention of retrieval, we would like them to find their
hidden needs via interaction in which system shows them
suggestive terms.
      </p>
      <p>In this paper, we try to extract related word sets from
documents in Japanese by employing case-marking
particles derived from syntactic analysis. Then, we
compared the results retrieved with words related only
taxonomically and those retrieved with words that included
a word related non-taxonomically to the other words in
order to verify what kind of relation makes
humancomputer interaction more creative.</p>
      <p>
        WORD SET EXTRACTION METHOD
In order to derive word sets that direct users to obtain
information, we applied the method based on the
Complementary Similarity Measure (CSM) which can
estimate inclusive relations between two vectors [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This
measure was developed as a means of recognizing degraded
machine-printed text [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Estimating Inclusive Relation between Words
We first extract word pairs having an inclusive relation of
the appearance patterns between the words by calculating
the CSM values. An appearance pattern is expressed a kind
of co-occurrence relation by an n-dimensional binary
feature vector. Therefore, the dimension of each vector
corresponds to a co-occurring word, a document, or a
sentence. When Vi = (vi1, ..., vin) is a vector for word wi and
Vj = (vj1, ..., vjn) is a vector for word wj, CSM(Vi, Vj) is
defined by the following formula:
1 The taxonomical relation which is, for example, provided
by WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] corresponds to “classical” relation by
Morris and Hirst [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and the thematic relation corresponds
to “non-classical” relation.
      </p>
      <p>CSM (Vi ,V j ) =
a = ∑ kn=1 vik ⋅ v jk ,
(a + c)(b + d )
b = ∑ kn=1 vik ⋅ (1 − v jk ),
c = ∑
n
k =1 (1 − vik ) ⋅ v jk ,
d = ∑
n
k =1 (1 − vik ) ⋅ (1 − v jk ).</p>
      <p>CSM is an asymmetric measure because the denominator is
asymmetric. Therefore, CSM(Vi, Vj) usually differs from
CSM(Vj, Vi) exchanged between Vi and Vj. For example,
when Vi is 1110010111 and Vj is 1000110110, parameters
for CSM(Vi, Vj) are a = 4, b = 3, c = 1, and d = 2, and
CSM(Vi, Vj) is greater than CSM(Vj, Vi). According to the
asymmetric feature, we can estimate whether the
appearance pattern of wi includes the appearance pattern of
wj. If wi is “animal” and wj is “tiger,” CSM would estimate
that “animal” is a hypernym of “tiger.”
Extracted word pairs based on the appearance patterns are
expressed by a tuple &lt;wi, wj&gt; which is a directed set of
words. Tuple &lt;wi, wj&gt; represents that CSM(Vi, Vj) is
greater than CSM(Vj, Vi) when words wi and wj have each
appearance pattern represented by each binary vector Vi and
Vj. We call wi the “left word” and wj the “right word.”
Constructing Related Word Sets
We next connected such word pairs with CSM values
greater than a certain threshold and constructed word sets.
If we adopt simpler mechanism such as the co-occurrence
frequency, which extracts only the co-occurrence relations
between words, two tuples extracted from different
sentences cannot be merged easily. A feature of our method
is that because we use the CSM to calculate the degree of
inclusion of appearance patterns between all combinations
of words in whole collection of texts, we can connect word
pairs consistently. That is to say, we can extract not only
pairs of related words but also sets of related words. In
other words, our CSM-based method is relevant not only to
information within a sentence or a document, but also to
information from a wider context. That is, once we obtain
two tuples &lt;A, B&gt; and &lt;B, C&gt;, even though the tuples have
been extracted from different sentences or documents, we
can obtain word set {A, B, C} in order.</p>
      <p>Suppose we have tuples &lt;A, B&gt;, &lt;B, C&gt;, &lt;Z, B&gt;, &lt;C, D&gt;,
&lt;C, E&gt;, and &lt;C, F&gt;, which are word pairs having greater
CSM values than the threshold (TH) in the order of their
values. For example, let &lt;B, C&gt; be an initial word set {B,
C}. We create a word set as follows.</p>
      <p>1. We find the tuple with the greatest CSM value among
the tuples in which the word at the tail of the current
word set — for example, C in {B, C} — is a left word,
and connect the right word of the tuple to the tail of
the current word set. In this example, word “D” is
connected to {B, C} because &lt;C, D&gt; has the greatest
CSM value among the three tuples &lt;C, D&gt;, &lt;C, E&gt;,
and &lt;C, F&gt;, making the current word set {B, C, D}.
2. This process is repeated until no tuples with a CSM
value greater than TH can be chosen.
3. We find the tuple with the greatest CSM value among
the tuples in which the word at the head of the current
word set — for example, B in {B, C, D} — is the right
word, and connect the left word of the tuple to the
head of the current word set. In this example, Word
“A” is connected to the head of {B, C, D} because &lt;A,
B&gt; has a CSM value greater than that of &lt;Z, B&gt;,
making the current word set {A, B, C, D}.
4. This process is repeated until no tuples with a CSM
value greater than TH can be chosen.</p>
      <p>In this example, we obtained the word set {A, B, C, D}
beginning with tuple &lt;B, C&gt; as the initial word set {B, C}.
In this way, we construct all word sets by beginning with
each tuple, using tuples whose CSM values are greater than
TH. Then from the word sets obtained, we remove word
sets that are embedded in other word sets.</p>
      <p>If we set TH to a low value, it is possible to obtain lengthy
word sets. When the TH is too low, the number of tuples
that must be considered becomes overwhelming and the
reliability of the measurement decreases. Consequently, we
experimentally set TH.</p>
      <p>Extracting Word Sets with Thematic Relation
Finally, we use a thesaurus to extract word sets with a
thematic relation. We remove word sets with taxonomical
relations from whole word set we extracted, and get the
leftover as word sets with thematic (at least
nontaxonomical) relations. The heading words in a thesaurus
are categorized to represent a taxonomical relationship. If a
word set extracted by the CSM-based method demonstrates
a taxonomical relation among the words, the words in the
CSM-based word set are classified into one category in the
thesaurus; that is, if an extracted word set agrees with the
thesaurus, we can conclude that a taxonomical relation
exists among the words. Through this approach, we remove
those word sets with a taxonomical relation by examining
the distribution of words in the categories. The rest of the
word sets have a non-taxonomical relation — including a
thematic relation — among the words. We then extract
those word sets that do not agree with the thesaurus, having
identified them as word sets with a thematic relation, that is,
thematically related word sets.</p>
      <p>LINGUISTIC DATA
We extract word sets by utilizing inclusive relations of the
appearance pattern between words based on a
modifiee/modifier relationship in documents. The Japanese
language has case-marking particles that indicate the
semantic relation between two elements in a dependency
relation, which is a kind of modifiee/modifier relationship.
For our experiment, we used such particles and extracted
the data from the documents we gathered.</p>
      <p>First, we parsed sentences with the KNP2. From the results,
we collected dependency relations matching one of the
following five patterns of case-marking particles. With A, B,
P, Q, R, and S as nouns (including compound words); V as
a verb; and &lt;X&gt; as a case-marking particle with its role in
parentheses, the five patterns are as follows:
Suppose we have a sentence “Chloe ha Mike ga Judy ni
bara no hanataba wo okutta to kiita (Chloe heard that Mike
had given Judy a rose bouquet.).” From this sentence, we
can extract five dependency relations between words as
follows:
z
bara (rose) &lt;no (of)&gt; hanataba (bouquet)
hanataba (bouquet) &lt;wo (object)&gt; okutta (had
presented)</p>
    </sec>
    <sec id="sec-2">
      <title>Mike &lt;ga (subject)&gt; okutta</title>
    </sec>
    <sec id="sec-3">
      <title>Judy &lt;ni (dative)&gt; okutta</title>
      <sec id="sec-3-1">
        <title>Chloe &lt;ha (topic)&gt; kiita (heard)</title>
        <p>NN-data based on co-occurrence between nouns.
For each sentence in our document collection,
we gathered nouns followed by all five of the
case-marking particles we used and nouns
proceeded by &lt;no&gt;; that is, A, B, P, Q, R, and S.</p>
        <p>For the above sentence, we can gather Chloe,
From this set of dependency relations, we compiled the
following types of experimental data3:
2 A Japanese parser developed at Kyoto University.
3 Japanese case-marking particles define not deep semantics
but rather surface syntactic relations between
words/phrases; therefore, we utilized not semantic
meanings between words, but classifications by
casemarking particles. Therefore, the method proposed in this
paper is applicable to other languages when a syntactic
analyzer that classifies relations between elements, such as
subject, direct object, and indirect object, exists for the
language. For example, from the output of English parser,
we can compile necessary linguistic data, such as Wo-data
using collocations between verb and its direct object,
Gadata from collocations between verb and its subject,
Nidata from collocations between verb and its indirect object,
and SO-data from collocations between subject and object
of a verb.
z
z</p>
        <p>Mike, Judy, bara, and hanataba. The number of
data items equals the number of sentences in the
documents.</p>
        <p>NV-data based on a dependency relation
between noun and verb. We gathered nouns P, Q,
R, and S followed by each of the case-marking
particles &lt;wo&gt;, &lt;ga&gt;, &lt;ni&gt;, and &lt;ha&gt; for each
verb V. We named them Wo-data (with 20,234
gathered data items), Ga-data (15,924), Ni-data
(14,215), and Ha-data (15,896), respectively.
For the verb okutta in the above sentence, the
Wo-data is hanataba, Ga-data is Mike, and so on.
The number of data items equals the number of
kinds of verbs.</p>
        <p>SO-data based on a collocation between subject
and object. We gathered subject Q followed by
the case-marking particle &lt;ga&gt; that depends on
the same verb V as the object P for each object
followed by the case-marking particle &lt;wo&gt;. For
the above example, we can gather the subject
Mike for the object hanataba because we have
the dependency relations Mike &lt;ga&gt; okutta and
hanataba &lt;wo&gt; okutta. The number of data
items equals the number of kinds of objects,
where each of them co-occurs with a subject in a
sentence and depends on same verb as the
subject (4,437).</p>
        <p>When we represent experimental data with a binary vector,
the vector corresponds to the appearance pattern of a noun.
Parameters for calculating the CSM-value correspond to the
number of dimensions in each situation. Figure 1 shows
images of the appearance pattern expressed by the binary
vector for each data item. The number of dimensions equals
the number of data items for each experimental data. For
NN-data, each dimension corresponds to a sentence. The
element of the vector is 1 if the noun appears in the
sentence and 0 if it does not. Similarly, for NV-data, each
dimension corresponds to a verb. For SO-data, we represent</p>
        <sec id="sec-3-1-1">
          <title>NN-data</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>NV-data</title>
          <p>SO-data
noun
noun</p>
          <p>n sentences
0001110100 .........10</p>
          <p>n kinds of verb
1001101001 .........01</p>
          <p>n kinds of object
subject 0101110000 .........10
the appearance pattern for each subject with a binary vector
whose dimension corresponds to an object.</p>
          <p>Therefore, if we calculate the CSM value between Vector A
and Vector B, each of the parameters a, b, c, and d used for
CSM formula explained above corresponds to the number
of each of the following cases:
z
z
z
z</p>
          <p>The number of dimensions which both Vector A
and Vector B have 1 as each element.</p>
          <p>The number of dimensions which Vector A has 1,
though Vector B has 0.</p>
          <p>The number of the dimensions which Vector B
has 1, though Vector A has 0.</p>
          <p>The number of the dimensions which both Vector
A and Vector B have 0.</p>
          <p>EXPERIMENT
In our experiment, we used domain-specific documents in
Japanese from the medical domain gathered from the Web
pages of a medical school. The Japanese documents we
used totaled 225,402 sentences (10,144 pages, 37MB).
In applying the CSM-based method, we represented
experimental data for medical terms with a binary vector as
explained above. We used descriptors in the 2005 Medical
Subject Headings (MeSH) thesaurus4 and translated them
into Japanese. The number of terms in Japanese appearing
in this experiment is 2,557. We constructed word sets
consisting of these medical terms and chose the word sets
consisting of three or more terms from them. Figures 2 and
3 show examples of word sets constructed with the
CSMbased method. Note that we obtained word sets comprising
Japanese medical terms that appear in the
Japaneselanguage medical documents we used. For explanatory
purposes, in the following part of this paper we use English
terms obtained from the MeSH thesaurus.</p>
          <p>data - causation - depression - reduction</p>
          <p>- platelet count - bone marrow examination
neonate - patent ductus arteriosus</p>
          <p>- necrotizing enterocolitis
secretion - gastric acid - gastric mucosa</p>
          <p>- duodenal ulcer
skin - atopic dermatitis - herpes viruses</p>
          <p>- antiviral drugs
fatigue - uterine muscle - pregnancy toxemia
water - oxygen - hydrogen - hydrogen ion
person - nicotiana - smoke - oxygen deficiencies
4 The U.S. National Library of Medicine created, maintains,
and provides the Medical Subject Headings (MeSH®)
thesaurus.
latency period - erythrocyte - hepatic cell
snow - school - gas
variation - death - limb
hospitalist - corneal opacities - triazolam
cross reaction - apoptoses - injuries
research - survey - altered taste - rice
environment - state interest - water - meat - diarrhea
rights - energy generating resources - cordia - education
- deforestation
Then, to obtain the thematically related word sets from the
word sets extracted by the CSM-based method, we use the
MeSH thesaurus .The MeSH headings are organized into 15
categories and the MeSH trees are hierarchical
arrangements of headings with their associated tree
numbers, which include information about the category.
Notice that some headings are classified into more than one
category.</p>
          <p>We examined the distribution of terms in the MeSH
categories for each word set and extracted word sets that do
not agree with the MeSH thesaurus as word sets with a
thematic relation. Table 1 shows the number of word sets
that agree and that disagree with the MeSH thesaurus. By
way of exception, for example, we obtained the word set
“tree - forest - orangutan” from NN-data. “Tree” is
classified into two categories “Organisms (B)” and
“Technology and Food and Beverages (J)”; “forest” is
classified into “J” and “orangutan” is classified into “B.” In
this case, we consider that a relation exists between “forest”
and “orangutan” via “tree,” and we treat this word set as
being distributed in one category.</p>
          <p>In Table 1 we found that, for NN-data and NV-data, the
ratio of CSM-based word sets that agreed with MeSH
thesaurus was between 7.5% and 29.1%. Of the CSM-based
word sets, Wo-data provided the highest agreement ratio.
The apparent reason for this is that the object case
represented by the case-marking particle &lt;wo&gt; restricts
nouns more stringently than do the others. Also, comparing
the results of NN-data and NV-data, we found that the word
sets extracted from NV-data agreed with the MeSH
thesaurus to a greater degree than did those extracted from
NN-data. This suggests that we obtained more word sets
having taxonomical relations among words from NV-data
than from NN-data.</p>
          <p>SO-data is based on a collocation between subject and
object; that is, the word sets obtained comprise subjects
followed by the case-marking particle &lt;ga&gt; that depend on
the same verb as the object for each object followed by the
case-marking particle &lt;wo&gt;. For example, when we have
“ningen (person) &lt;ga&gt; hon (book) &lt;wo&gt; yomu (read),”
which means, “a person reads a book,” and “nezumi
(mouse) &lt;ga&gt; hon (book) &lt;wo&gt; kajiru (gnaw),” which</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>No. of word sets</title>
      </sec>
      <sec id="sec-3-3">
        <title>No. of agreed word sets (%)</title>
      </sec>
      <sec id="sec-3-4">
        <title>No. of disagreed word sets NN 594</title>
        <p>45
549</p>
        <p>Wo
199
58
141</p>
        <p>Ga
62
14
48</p>
        <p>Ni
37
6
31</p>
        <p>Ha
85
7
78
(7.5) (29.1)
(22.6)
(16.2) (8.2)
skin - abdomen - cervix - cavitas oris - chest [NN]
cardiovascular disease - coronary artery disease
- bronchitis - thrombophlebitides - flatulence
- hyperuricemia - lower back pain
- ulnar nerve palsies - brain hemorrhage
- obstructive jaundice [NV(Wo)]
extrasystole - bronchospasm - acute renal failure</p>
        <p>- colitides - diabetic coma - pancreatitides [NV(Ga)]
hand - mouth - ear - finger [NV(Ni)]
snake - praying mantis - scorpion [NV(Ha)]
means, “a mouse gnaws a book,” we estimate the relation
between the words ningen and nezumi with CSM. Therefore,
we can surmise that the information we obtain from this
data will not agree with a general thesaurus because we do
not limit the verbs that subjects and objects depend on.
Actually, the word sets we obtained from SO-data agreed
little with the MeSH thesaurus.</p>
        <p>Figure 4 shows examples of taxonomically related word
sets, which agree with the MeSH thesaurus, that is, in
which all the composing terms in a word set are classified
into one category. The symbol in brackets represents the
type of data from which each word set was obtained. As the
result, we obtained the rest as word sets with a thematic
relation, that is, thematically related word sets, which are
847 word sets.</p>
        <p>VERIFICATION
In verifying the capability of our word sets to retrieve Web
pages, we examined whether our word sets could help limit
the search results to more informative Web pages with
Google as a search engine. To do this, in our obtained word
sets with a thematic relation, we used 294 word sets in
which one of the terms is classified into one category and
the rest are classified into another category. Figure 5 shows
examples of the word sets. The terms with underline
indicate ones in a different category.
ovary - spleen - palpation [NN]
variation - cross reactions - outbreaks - secretion
[NV(Wo)]
bleeding - pyrexia - hematuria - consciousness disorder
- vertigo - high blood pressure [NV(Ga)]
space flight - insemination - immunity [NV(Ni)]
cough - fetus
- bronchiolitis obliterans organizing pneumonia
[NV(Ha)]
We used the terms that composed such word sets as the key
words to input into the search engine and retrieved Web
pages. We created three types of search terms from a word
set. Suppose the word set is {X1, …, Xn, Y}, where Xi is
classified into one category and Y is classified into another.
The first type uses all terms except the one classified into a
category different from the others: {X1, …, Xn}, removing
Y. The second type uses all terms except the one in the
same category as the rest: {X1, …, Xk-1, Xk+1, …, Xn}
removing Xk and Y. In our verification, we removed the
term Xk with the highest or lowest frequency among Xi.
The third type uses terms in Type 2 and Y, i.e., terms in
another category: {X1, …, Xk-1, Xk+1, …, Xn, Y}. When we
consider Type 2 as base key words, Type 1 is a set of key
words with the addition of one term having the highest or
lowest frequency among the terms in the same category; i.e.,
the additional term Xk has a feature related to frequency and
is taxonomically related to other terms. Type 3 is a set of
key words with the addition of one term in a category
different from those of the other component terms; i.e., the
additional term Y seems to be thematically related to other
terms.</p>
        <p>The retrieval results are shown in Figures 6 and 7 including
the results for each the highest frequency and the lowest
frequency. The horizontal axis is the number of pages
retrieved with Type 2 and the vertical axis is the number of
pages retrieved with Type 1 or Type 3 that a certain term Xk
or Y is added to Type 2. The circles show the results with
Type 1 and the crosses show the results with Type 3. The
diagonal line in the graph shows that adding one term to
Type 2 does not affect the number of Web pages retrieved.
As shown in Figure 6, most crosses fall further below the
line. This graph indicates that adding a search term related
non-taxonomically tends to make a bigger difference than
adding a term related taxonomically and with high
frequency. This means that adding a term related
nontaxonomically to key words is crucial to retrieving
informative pages, i.e., such terms are informative terms
themselves.Table 2 shows the number of cases in which
term in different category decreases the number of hit pages
more than high frequency term. Then, we found that most
of the additional terms with high frequency contributed less
than additional terms related non-taxonomically to
NN
175
108</p>
        <p>Wo
43
37</p>
        <p>NV
Ga
23
15</p>
        <p>Ni
13
12</p>
        <p>Ha
26
18</p>
      </sec>
      <sec id="sec-3-5">
        <title>Number of word sets for verification</title>
      </sec>
      <sec id="sec-3-6">
        <title>Number of cases in which Type 3 defeated Type 1 in retrieval NN</title>
        <p>175
61</p>
        <p>Wo
43
18
Ga
23
7</p>
        <p>Ni
13
6</p>
        <p>Ha
26
13
decreasing the number of Web pages retrieved. This means
that, in comparison to the high frequency terms, which
might not be so informative in themselves, the terms in the
other category — related non-taxonomically — are
effective for retrieving useful Web pages.</p>
        <p>Constantly, in Figure 7, most circles fall further below the
line. This indicates that adding a term related taxonomically
and with low frequency tends to make a bigger difference
than does adding a term with high frequency. Certainly,
additional terms with low frequency would be informative
terms, even though they are related taxonomically, because
they may be rare terms on the Internet. Thus, the
taxonomically related terms with low frequencies are
quantitatively effective for information retrieval as the
nontaxonomically related terms.</p>
        <p>Table 3 shows the number of cases in which term in
different category decreases the number of hit pages more
than low frequency term. In comparing these numbers, we
also found that the additional term with low frequency
helped to reduce the number of Web pages retrieved,
making no effort to determine the kind of relation the term
had with the other terms. Thus, the terms with low
frequencies are quantitatively effective when used for
retrieval.</p>
        <p>However, if we consider contents of the results retrieved
with Type 1 and Type 3, it is clear that big differences exist
between them. For example, consider “latency period
erythrocyte - hepatic cell” obtained from SO-data in Figure
3. “Latency period” is classified into a category different
from the other terms and “hepatic cell” has the lowest
frequency in this word set. When we used all the three
terms, we obtained pages related to “malaria” at the top of
the results and the title of the top page was “What is
malaria?” in Japanese. With “latency period” and
“erythrocyte,” we again obtained the same page at the top,
although it was not at the top when we used “erythrocyte”
and “hepatic cell” which have a taxonomical relation.
As we showed above, the terms with thematic relations with
other search terms are effective at directing users to
informative pages. Quantitatively, terms with a high
frequency are not effective at reducing the number of pages
retrieved; qualitatively, low frequency terms may not
effective to direct users to informative pages.</p>
        <p>CONCLUSION
We introduced a mechanism that provides key words which
can make human-computer interaction (HCI) increase, by
using natural language processing technology and
mathematic measure for calculating degree of inclusion. We
showed what type of words should be added to the current
query, i.e. keywords which previously had been input, in
order to make HCI more creative.</p>
        <p>We extracted related word sets from documents by
employing case-marking particles derived from syntactic
analysis. Then, we verified which kind of related word is
more useful as an additional word for retrieval support.
That is, we found the additional term which is thematically
related to other terms is effective at retrieving informative
pages by comparing the results retrieved with words related
only taxonomically and those retrieved with words that
included a word related non-taxonomically to the other
words. This suggests that words with a thematic relation
can be useful to make the HCI more active.</p>
        <p>As for the future directions of this work, one of most crucial
issues is evaluation. We will evaluate the effectiveness of
our method from human-centered viewpoints, possibly by
human judgement.</p>
        <p>In the future, we can understand the contents of huge text
data with higher natural language processing technology
and develop a system which makes it possible to expand the
users’ ways of thinking.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>WordNet: An electronic lexical database</article-title>
          . Cambridge, Mass.: The MIT Press, (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Geffet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dagan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>The distribution inclusion hypotheses and lexical entailment</article-title>
          .
          <source>In Proc. ACL</source>
          <year>2005</year>
          , (
          <year>2005</year>
          ),
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Girju</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>Automatic detection of causal relations for question answering</article-title>
          .
          <source>In Proc. ACL Workshop on Multilingual summarization and question answering</source>
          , (
          <year>2003</year>
          ),
          <fpage>76</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Girju</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badulescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Moldovan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Automatic discovery of part-whole relations</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ), (
          <year>2006</year>
          ),
          <fpage>83</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hagita</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sawaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Robust recognition of degraded machine-printed characters using complementary similarity measure and error-correction learning</article-title>
          .
          <source>In Proc. SPIE - The International Society for Optical Engineering</source>
          ,
          <volume>2442</volume>
          , (
          <year>1995</year>
          ),
          <fpage>236</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <article-title>Automatic acquisition of hyponyms from large text corpora</article-title>
          ,
          <source>In Proc. Coling</source>
          <volume>92</volume>
          , (
          <year>1992</year>
          ),
          <fpage>539</fpage>
          -
          <lpage>545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Non-classical lexical semantic relations</article-title>
          . Workshop on Computational Lexical Semantics,
          <source>In Proc. Human Language Technology Conference of the NAACL</source>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pennacchiotti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Espresso: Leveraging generic patterns for automatically harvesting semantic relations</article-title>
          <source>In Proceedings of ACL</source>
          <year>2006</year>
          , (
          <year>2006</year>
          ),
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wisniewski</surname>
            ,
            <given-names>E. J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bassok</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>What makes a man similar to a tie?</article-title>
          <source>Cognitive Psychology</source>
          ,
          <volume>39</volume>
          , (
          <year>1999</year>
          ),
          <fpage>208</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yamamoto</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanzaki</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>Extraction of hierarchies based on inclusion of co-occurring words with frequency information</article-title>
          .
          <source>In Proc. IJCAI2005</source>
          , (
          <year>2005</year>
          ),
          <fpage>1166</fpage>
          -
          <lpage>1172</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>