=Paper=
{{Paper
|id=Vol-323/paper-2
|storemode=property
|title=Thematically Related Words toward Creative Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-323/paper02.pdf
|volume=Vol-323
|dblpUrl=https://dblp.org/rec/conf/iui/YamamotoI08
}}
==Thematically Related Words toward Creative Information Retrieval==
<pdf width="1500px">https://ceur-ws.org/Vol-323/paper02.pdf</pdf>
<pre>
  Thematically Related Words toward Creative Information
                        Retrieval
                  Eiko Yamamoto                                                Hitoshi Isahara
          Graduate School of Engineering,                            National Institute of Information and
                  Kobe University                                        Communications Technology
         1-1 Rokkodai-cho, Nada-ku, Kobe,                            3-5 Hikaridai, Seika-cho, Soraku-gun,
              Hyogo, 657-8501, Japan                                       Kyoto, 619-0289, Japan
              eiko@mech.kobe-u.ac.jp                                          isahara@nict.go.jp


ABSTRACT                                                       such interface is not constructions of interface, i.e. how
We introduce a mechanism that provides key words which         each part of interface is arranged on the screen, but what
can make human-computer interaction increase in the            information is presented to interact with users.
course of information retrieval, by using natural language
processing technology and mathematic measure for               New ideas pop into one’s head when he/she strolls in
calculating degree of inclusion. We show what type of          library, bookstore, or even around town. We need retrieval
words should be added to the current query, i.e. keywords      supports which enable us to expand such creativity. Making
which previously had been input, in order to make human-       computer smarter to automatically extract “correct”
computer interaction more creative. We try to extract          retrieval result is one-side way of developing support
related word sets from documents by employing case-            systems for information retrieval. Seeing the advice
marking particles derived from syntactic analysis. Then, we    provided to a user by computer, how the user achieves next
verify which kind of related words is more useful as an        retrieval is one of the most important viewpoints for the
additional word for retrieval support.                         future intelligent user interface. We need a technology that
                                                               enables computer to understand huge text data and make it
Author Keywords                                                possible to expand the users’ way of thinking.
Natural Language Processing, retrieval support, related        In this paper, we introduce a mechanism that provides key
words, thematic relation, taxonomical relation.                words which can make human-computer interaction (HCI)
                                                               during the information retrieval increase, by using natural
ACM Classification Keywords                                    language processing technology and mathematic measure
H5.2.    INFORMATION         INTERFACES       AND              for calculating degree of inclusion. Concretely, we show
PRESENTATION (e.g., HCI): User Interfaces – Natural            what type of words should be added to the current query, i.e.
Language; H.3.3. INFORMATON STORAGE and                        keywords which previously had been input, in order to
RETRIEVAL: Information Search and Retrieval.                   make HCI more creative.
INTRODUCTION                                                   RELATION BETWEEN WORDS
Nowadays, we can access huge amount of text data               Many researchers in natural language processing have
available on the web. The increase of the data quantity        developed many methodologies for extracting various
causes a paradigm shift for web retrieval. Rhetorically        relations from corpora. Several methods exist for extracting
speaking, we can take a walk among the huge text data. The     relations such as “is-a” [6], “part-of” [4], causal [3], and
web retrieval supports we need in this novel situation are     entailment [2] relations. Moreover, methods to learn
neither simple query expansion nor our (or someone’s)          patterns for extracting relations between words have been
record of previously input keywords, but we need interfaces    presented [4, 8]. Such related words can be used to support
which interact with people in new ways. What is crucial for    retrieval in order to lead users to high-quality information.
                                                               One simple method is to provide additional key words
                                                               related to the key words users have input. Here we have a
                                                               question, which is what kinds of relations between the
                                                               previous key words and the additional word are effective
                                                               for information retrieval.
                                                               As for the relations among words, at least two kinds of
                                                               relations exist: the taxonomical relation and the thematic


     © 2008 for the individual papers by the papers' authors. Copying permitted for private and academic purposes.
                Re-publication of material from this volume requires permission by the copyright owners.
relation [9]. 1 The former is a relation representing the
physical resemblance among objects, such as, “cow” and                                         ad − bc
                                                                  CSM (V i , V j ) =                              ,
“animal,” which is typically a semantic relation; the latter is                              ( a + c )(b + d )
a non-taxonomical relation among objects through a
                                                                  a = ∑ k =1 v ik ⋅ v jk ,           b = ∑ k =1 v ik ⋅ (1 − v jk ),
                                                                           n                                  n
thematic scene, such as “milk” and “cow” as recollected in
the scene “milking a cow,” which includes causal relation
                                                                  c = ∑ k =1 (1 − v ik ) ⋅ v jk ,    d = ∑ k =1 (1 − v ik ) ⋅ (1 − v jk ).
                                                                           n                                     n
and entailment relation. Taxonomically related words are
generally used to query expansion and it is comparatively
easy to identify taxonomical relations from linguistic
resources such as dictionaries and thesauri. On the other         CSM is an asymmetric measure because the denominator is
hand, it is difficult to identify thematic relations because      asymmetric. Therefore, CSM(Vi, Vj) usually differs from
they are rarely maintained in linguistic resources.               CSM(Vj, Vi) exchanged between Vi and Vj. For example,
Most of the previous researches of information retrieval          when Vi is 1110010111 and Vj is 1000110110, parameters
support tended to focus on the improvement of recall by the       for CSM(Vi, Vj) are a = 4, b = 3, c = 1, and d = 2, and
query expansion. However, our aim is to direct users to the       CSM(Vi, Vj) is greater than CSM(Vj, Vi). According to the
informative information for them by query suggestion.             asymmetric feature, we can estimate whether the
Though some users sometimes do not realize their real             appearance pattern of wi includes the appearance pattern of
intention of retrieval, we would like them to find their          wj. If wi is “animal” and wj is “tiger,” CSM would estimate
hidden needs via interaction in which system shows them           that “animal” is a hypernym of “tiger.”
suggestive terms.                                                 Extracted word pairs based on the appearance patterns are
In this paper, we try to extract related word sets from           expressed by a tuple <wi, wj> which is a directed set of
documents in Japanese by employing case-marking                   words. Tuple <wi, wj> represents that CSM(Vi, Vj) is
particles derived from syntactic analysis. Then, we               greater than CSM(Vj, Vi) when words wi and wj have each
compared the results retrieved with words related only            appearance pattern represented by each binary vector Vi and
taxonomically and those retrieved with words that included        Vj. We call wi the “left word” and wj the “right word.”
a word related non-taxonomically to the other words in
                                                                  Constructing Related Word Sets
order to verify what kind of relation makes human-
                                                                  We next connected such word pairs with CSM values
computer interaction more creative.
                                                                  greater than a certain threshold and constructed word sets.
WORD SET EXTRACTION METHOD                                        If we adopt simpler mechanism such as the co-occurrence
In order to derive word sets that direct users to obtain          frequency, which extracts only the co-occurrence relations
information, we applied the method based on the                   between words, two tuples extracted from different
Complementary Similarity Measure (CSM) which can                  sentences cannot be merged easily. A feature of our method
estimate inclusive relations between two vectors [10]. This       is that because we use the CSM to calculate the degree of
measure was developed as a means of recognizing degraded          inclusion of appearance patterns between all combinations
machine-printed text [5].                                         of words in whole collection of texts, we can connect word
                                                                  pairs consistently. That is to say, we can extract not only
Estimating Inclusive Relation between Words                       pairs of related words but also sets of related words. In
We first extract word pairs having an inclusive relation of       other words, our CSM-based method is relevant not only to
the appearance patterns between the words by calculating          information within a sentence or a document, but also to
the CSM values. An appearance pattern is expressed a kind         information from a wider context. That is, once we obtain
of co-occurrence relation by an n-dimensional binary              two tuples <A, B> and <B, C>, even though the tuples have
feature vector. Therefore, the dimension of each vector           been extracted from different sentences or documents, we
corresponds to a co-occurring word, a document, or a              can obtain word set {A, B, C} in order.
sentence. When Vi = (vi1, ..., vin) is a vector for word wi and
Vj = (vj1, ..., vjn) is a vector for word wj, CSM(Vi, Vj) is      Suppose we have tuples <A, B>, <B, C>, <Z, B>, <C, D>,
defined by the following formula:                                 <C, E>, and <C, F>, which are word pairs having greater
                                                                  CSM values than the threshold (TH) in the order of their
                                                                  values. For example, let <B, C> be an initial word set {B,
                                                                  C}. We create a word set as follows.
                                                                    1. We find the tuple with the greatest CSM value among
                                                                       the tuples in which the word at the tail of the current
1
  The taxonomical relation which is, for example, provided             word set — for example, C in {B, C} — is a left word,
by WordNet [1] corresponds to “classical” relation by                  and connect the right word of the tuple to the tail of
Morris and Hirst [7], and the thematic relation corresponds            the current word set. In this example, word “D” is
to “non-classical” relation.                                           connected to {B, C} because <C, D> has the greatest
     CSM value among the three tuples <C, D>, <C, E>,             For our experiment, we used such particles and extracted
     and <C, F>, making the current word set {B, C, D}.           the data from the documents we gathered.
  2. This process is repeated until no tuples with a CSM          First, we parsed sentences with the KNP2. From the results,
     value greater than TH can be chosen.                         we collected dependency relations matching one of the
                                                                  following five patterns of case-marking particles. With A, B,
  3. We find the tuple with the greatest CSM value among
                                                                  P, Q, R, and S as nouns (including compound words); V as
     the tuples in which the word at the head of the current
                                                                  a verb; and <X> as a case-marking particle with its role in
     word set — for example, B in {B, C, D} — is the right        parentheses, the five patterns are as follows:
     word, and connect the left word of the tuple to the
     head of the current word set. In this example, Word                 z    A <no (of)> B
     “A” is connected to the head of {B, C, D} because <A,               z    P <wo (object)> V
     B> has a CSM value greater than that of <Z, B>,
     making the current word set {A, B, C, D}.                           z    Q <ga (subject)> V
  4. This process is repeated until no tuples with a CSM                 z    R <ni (dative)> V
     value greater than TH can be chosen.                                z    S <ha (topic)> V
In this example, we obtained the word set {A, B, C, D}
                                                                  Suppose we have a sentence “Chloe ha Mike ga Judy ni
beginning with tuple <B, C> as the initial word set {B, C}.       bara no hanataba wo okutta to kiita (Chloe heard that Mike
In this way, we construct all word sets by beginning with
                                                                  had given Judy a rose bouquet.).” From this sentence, we
each tuple, using tuples whose CSM values are greater than        can extract five dependency relations between words as
TH. Then from the word sets obtained, we remove word
                                                                  follows:
sets that are embedded in other word sets.
                                                                         z    bara (rose) <no (of)> hanataba (bouquet)
If we set TH to a low value, it is possible to obtain lengthy
word sets. When the TH is too low, the number of tuples                  z    hanataba (bouquet) <wo (object)> okutta (had
that must be considered becomes overwhelming and the                          presented)
reliability of the measurement decreases. Consequently, we
                                                                         z    Mike <ga (subject)> okutta
experimentally set TH.
                                                                         z    Judy <ni (dative)> okutta
Extracting Word Sets with Thematic Relation
Finally, we use a thesaurus to extract word sets with a                  z    Chloe <ha (topic)> kiita (heard)
thematic relation. We remove word sets with taxonomical           From this set of dependency relations, we compiled the
relations from whole word set we extracted, and get the           following types of experimental data3:
leftover as word sets with thematic (at least non-
taxonomical) relations. The heading words in a thesaurus                 z    NN-data based on co-occurrence between nouns.
are categorized to represent a taxonomical relationship. If a                 For each sentence in our document collection,
word set extracted by the CSM-based method demonstrates                       we gathered nouns followed by all five of the
a taxonomical relation among the words, the words in the                      case-marking particles we used and nouns
CSM-based word set are classified into one category in the                    proceeded by <no>; that is, A, B, P, Q, R, and S.
thesaurus; that is, if an extracted word set agrees with the                  For the above sentence, we can gather Chloe,
thesaurus, we can conclude that a taxonomical relation
exists among the words. Through this approach, we remove          2
                                                                      A Japanese parser developed at Kyoto University.
those word sets with a taxonomical relation by examining          3
the distribution of words in the categories. The rest of the        Japanese case-marking particles define not deep semantics
word sets have a non-taxonomical relation — including a           but rather surface syntactic relations between
thematic relation — among the words. We then extract              words/phrases; therefore, we utilized not semantic
those word sets that do not agree with the thesaurus, having      meanings between words, but classifications by case-
identified them as word sets with a thematic relation, that is,   marking particles. Therefore, the method proposed in this
thematically related word sets.                                   paper is applicable to other languages when a syntactic
                                                                  analyzer that classifies relations between elements, such as
LINGUISTIC DATA                                                   subject, direct object, and indirect object, exists for the
We extract word sets by utilizing inclusive relations of the      language. For example, from the output of English parser,
appearance pattern between words based on a                       we can compile necessary linguistic data, such as Wo-data
modifiee/modifier relationship in documents. The Japanese         using collocations between verb and its direct object, Ga-
language has case-marking particles that indicate the             data from collocations between verb and its subject, Ni-
semantic relation between two elements in a dependency            data from collocations between verb and its indirect object,
relation, which is a kind of modifiee/modifier relationship.      and SO-data from collocations between subject and object
                                                                  of a verb.
          Mike, Judy, bara, and hanataba. The number of       the appearance pattern for each subject with a binary vector
          data items equals the number of sentences in the    whose dimension corresponds to an object.
          documents.
                                                              Therefore, if we calculate the CSM value between Vector A
     z    NV-data based on a dependency relation              and Vector B, each of the parameters a, b, c, and d used for
          between noun and verb. We gathered nouns P, Q,      CSM formula explained above corresponds to the number
          R, and S followed by each of the case-marking       of each of the following cases:
          particles <wo>, <ga>, <ni>, and <ha> for each
                                                                    z     The number of dimensions which both Vector A
          verb V. We named them Wo-data (with 20,234
                                                                          and Vector B have 1 as each element.
          gathered data items), Ga-data (15,924), Ni-data
          (14,215), and Ha-data (15,896), respectively.             z     The number of dimensions which Vector A has 1,
          For the verb okutta in the above sentence, the                  though Vector B has 0.
          Wo-data is hanataba, Ga-data is Mike, and so on.
          The number of data items equals the number of             z     The number of the dimensions which Vector B
          kinds of verbs.                                                 has 1, though Vector A has 0.
                                                                    z     The number of the dimensions which both Vector
     z    SO-data based on a collocation between subject
          and object. We gathered subject Q followed by                   A and Vector B have 0.
          the case-marking particle <ga> that depends on
                                                              EXPERIMENT
          the same verb V as the object P for each object
                                                              In our experiment, we used domain-specific documents in
          followed by the case-marking particle <wo>. For
                                                              Japanese from the medical domain gathered from the Web
          the above example, we can gather the subject
                                                              pages of a medical school. The Japanese documents we
          Mike for the object hanataba because we have
                                                              used totaled 225,402 sentences (10,144 pages, 37MB).
          the dependency relations Mike <ga> okutta and
          hanataba <wo> okutta. The number of data            In applying the CSM-based method, we represented
          items equals the number of kinds of objects,        experimental data for medical terms with a binary vector as
          where each of them co-occurs with a subject in a    explained above. We used descriptors in the 2005 Medical
          sentence and depends on same verb as the            Subject Headings (MeSH) thesaurus 4 and translated them
          subject (4,437).                                    into Japanese. The number of terms in Japanese appearing
                                                              in this experiment is 2,557. We constructed word sets
When we represent experimental data with a binary vector,
                                                              consisting of these medical terms and chose the word sets
the vector corresponds to the appearance pattern of a noun.
                                                              consisting of three or more terms from them. Figures 2 and
Parameters for calculating the CSM-value correspond to the
                                                              3 show examples of word sets constructed with the CSM-
number of dimensions in each situation. Figure 1 shows
                                                              based method. Note that we obtained word sets comprising
images of the appearance pattern expressed by the binary
                                                              Japanese medical terms that appear in the Japanese-
vector for each data item. The number of dimensions equals
                                                              language medical documents we used. For explanatory
the number of data items for each experimental data. For
                                                              purposes, in the following part of this paper we use English
NN-data, each dimension corresponds to a sentence. The
                                                              terms obtained from the MeSH thesaurus.
element of the vector is 1 if the noun appears in the
sentence and 0 if it does not. Similarly, for NV-data, each       data - causation - depression - reduction
dimension corresponds to a verb. For SO-data, we represent             - platelet count - bone marrow examination
                                                                  neonate - patent ductus arteriosus
                                                                       - necrotizing enterocolitis
          NN-data                                                 secretion - gastric acid - gastric mucosa
                        n sentences                                    - duodenal ulcer
               noun    0001110100 .........10                     skin - atopic dermatitis - herpes viruses
                                                                       - antiviral drugs
          NV-data                                                 fatigue - uterine muscle - pregnancy toxemia
                        n kinds of verb                           water - oxygen - hydrogen - hydrogen ion
               noun    1001101001 .........01                     person - nicotiana - smoke - oxygen deficiencies

          SO-data                                                  Figure 2. Examples of word sets extracted from NN-data
                         n kinds of object
             subject   0101110000 .........10


      Figure 1. Appearance patterns of a binary vector        4
                                                                The U.S. National Library of Medicine created, maintains,
        for a noun in each type of experimental data          and provides the Medical Subject Headings (MeSH®)
                                                              thesaurus.
 latency period - erythrocyte - hepatic cell                                                                 NV
 snow - school - gas                                              Type of Data        NN
 variation - death - limb                                                                      Wo       Ga         Ni       Ha
 hospitalist - corneal opacities - triazolam                      No. of word sets    594      199      62         37       85
 cross reaction - apoptoses - injuries
 research - survey - altered taste - rice                         No. of agreed        45      58       14          6       7
 environment - state interest - water - meat - diarrhea           word sets (%)
                                                                                      (7.5)   (29.1)   (22.6)     (16.2)   (8.2)
 rights - energy generating resources - cordia - education
        - deforestation                                           No. of disagreed
                                                                                      549      141      48         31       78
                                                                  word sets
   Figure 3. Examples of word sets extracted from SO-data        Table 1. The numbers of word sets that agreed/disagreed with
                                                                 the MeSH thesaurus.

Then, to obtain the thematically related word sets from the
word sets extracted by the CSM-based method, we use the           skin - abdomen - cervix - cavitas oris - chest [NN]
MeSH thesaurus .The MeSH headings are organized into 15           cardiovascular disease - coronary artery disease
categories and the MeSH trees are hierarchical                      - bronchitis - thrombophlebitides - flatulence
arrangements of headings with their associated tree                 - hyperuricemia - lower back pain
numbers, which include information about the category.              - ulnar nerve palsies - brain hemorrhage
Notice that some headings are classified into more than one         - obstructive jaundice [NV(Wo)]
category.                                                         extrasystole - bronchospasm - acute renal failure
                                                                    - colitides - diabetic coma - pancreatitides [NV(Ga)]
We examined the distribution of terms in the MeSH                 hand - mouth - ear - finger [NV(Ni)]
categories for each word set and extracted word sets that do      snake - praying mantis - scorpion [NV(Ha)]
not agree with the MeSH thesaurus as word sets with a
thematic relation. Table 1 shows the number of word sets
that agree and that disagree with the MeSH thesaurus. By           Figure 4. Examples of taxonomically related word sets
way of exception, for example, we obtained the word set
“tree - forest - orangutan” from NN-data. “Tree” is
                                                                 means, “a mouse gnaws a book,” we estimate the relation
classified into two categories “Organisms (B)” and
                                                                 between the words ningen and nezumi with CSM. Therefore,
“Technology and Food and Beverages (J)”; “forest” is
                                                                 we can surmise that the information we obtain from this
classified into “J” and “orangutan” is classified into “B.” In
                                                                 data will not agree with a general thesaurus because we do
this case, we consider that a relation exists between “forest”
                                                                 not limit the verbs that subjects and objects depend on.
and “orangutan” via “tree,” and we treat this word set as
                                                                 Actually, the word sets we obtained from SO-data agreed
being distributed in one category.
                                                                 little with the MeSH thesaurus.
In Table 1 we found that, for NN-data and NV-data, the
                                                                 Figure 4 shows examples of taxonomically related word
ratio of CSM-based word sets that agreed with MeSH
                                                                 sets, which agree with the MeSH thesaurus, that is, in
thesaurus was between 7.5% and 29.1%. Of the CSM-based
                                                                 which all the composing terms in a word set are classified
word sets, Wo-data provided the highest agreement ratio.
                                                                 into one category. The symbol in brackets represents the
The apparent reason for this is that the object case             type of data from which each word set was obtained. As the
represented by the case-marking particle <wo> restricts          result, we obtained the rest as word sets with a thematic
nouns more stringently than do the others. Also, comparing       relation, that is, thematically related word sets, which are
the results of NN-data and NV-data, we found that the word       847 word sets.
sets extracted from NV-data agreed with the MeSH
thesaurus to a greater degree than did those extracted from      VERIFICATION
NN-data. This suggests that we obtained more word sets           In verifying the capability of our word sets to retrieve Web
having taxonomical relations among words from NV-data            pages, we examined whether our word sets could help limit
than from NN-data.                                               the search results to more informative Web pages with
                                                                 Google as a search engine. To do this, in our obtained word
SO-data is based on a collocation between subject and            sets with a thematic relation, we used 294 word sets in
object; that is, the word sets obtained comprise subjects        which one of the terms is classified into one category and
followed by the case-marking particle <ga> that depend on        the rest are classified into another category. Figure 5 shows
the same verb as the object for each object followed by the      examples of the word sets. The terms with underline
case-marking particle <wo>. For example, when we have            indicate ones in a different category.
“ningen (person) <ga> hon (book) <wo> yomu (read),”
which means, “a person reads a book,” and “nezumi
(mouse) <ga> hon (book) <wo> kajiru (gnaw),” which
                                                                                                                                                                                                                                                   Type3: With additional term in a different category Type1: With additional term in same category
 ovary - spleen - palpation [NN]                                                                                                                   100000000

 variation - cross reactions - outbreaks - secretion


                                                                 N u m b er o f W eb p a ge s retrie v ed w h en a te rm is a d de d to T y pe 2
                                                                                                                                                      10000000
                                                 [NV(Wo)]
 bleeding - pyrexia - hematuria - consciousness disorder                                                                                                                         1000000
    - vertigo - high blood pressure [NV(Ga)]
 space flight - insemination - immunity [NV(Ni)]                                                                                                                                                            100000

 cough - fetus
    - bronchiolitis obliterans organizing pneumonia                                                                                                                                                                                    10000


                                                 [NV(Ha)]
                                                                                                                                                                                                                                        1000


       Figure 5. Examples of word sets used to verify                                                                                                                                                                                    100


                                                                                                                                                                                                                                          10
We used the terms that composed such word sets as the key
words to input into the search engine and retrieved Web                                                                                                                                                                                    1
                                                                                                                                                                                                                                               1               10         100          1000      10000        100000       1000000        10000000    100000000   1000000000
pages. We created three types of search terms from a word                                                                                                                                                                                                                 Number of Web pages retrieved with Type2 (base key words)
set. Suppose the word set is {X1, …, Xn, Y}, where Xi is
classified into one category and Y is classified into another.                                                                                                                                                                         Figure 6. Fluctuation of number of Web pages retrieved by
The first type uses all terms except the one classified into a                                                                                                                                                                         adding the high frequency term in same category (Type 1)
category different from the others: {X1, …, Xn}, removing                                                                                                                                                                                     and the term in a different category (Type 3)
Y. The second type uses all terms except the one in the
same category as the rest: {X1, …, Xk-1, Xk+1, …, Xn}
removing Xk and Y. In our verification, we removed the                                                                                                                                                                                                                                                                                               NV
                                                                                                                                                                                                                                                             Type of Data                                         NN
term Xk with the highest or lowest frequency among Xi.                                                                                                                                                                                                                                                                           Wo             Ga           Ni         Ha
The third type uses terms in Type 2 and Y, i.e., terms in
another category: {X1, …, Xk-1, Xk+1, …, Xn, Y}. When we                                                                                     Number of word sets for
                                                                                                                                                                                                                                                                                                                  175                43          23          13          26
consider Type 2 as base key words, Type 1 is a set of key                                                                                    verification
words with the addition of one term having the highest or                                                                                    Number of cases in which
lowest frequency among the terms in the same category; i.e.,                                                                                 Type 3 defeated Type 1 in                                                                                                                                            108                37          15          12          18
the additional term Xk has a feature related to frequency and                                                                                retrieval
is taxonomically related to other terms. Type 3 is a set of
key words with the addition of one term in a category
different from those of the other component terms; i.e., the                                                                                                                                                                                   Table 2. Number of cases in which the term in different
additional term Y seems to be thematically related to other                                                                                                                                                                                    category decreases the number of Web pages retrieved
                                                                                                                                                                                                                                                         more than the high frequency term.
terms.
The retrieval results are shown in Figures 6 and 7 including                                                                                                                                                                                            Type3: With additional term in a different category     Type1: With additional term in same category
the results for each the highest frequency and the lowest                                                                                                                                                                              100000000

frequency. The horizontal axis is the number of pages
                                                                                                                                                   N u m b er o f W eb p a ge s re trie v ed w h e n a te rm is a d de d to T y p e2


                                                                                                                                                                                                                                        10000000
retrieved with Type 2 and the vertical axis is the number of
pages retrieved with Type 1 or Type 3 that a certain term Xk                                                                                                                                                                             1000000
or Y is added to Type 2. The circles show the results with
Type 1 and the crosses show the results with Type 3. The                                                                                                                                                                                  100000

diagonal line in the graph shows that adding one term to
                                                                                                                                                                                                                                           10000
Type 2 does not affect the number of Web pages retrieved.
                                                                                                                                                                                                                                               1000
As shown in Figure 6, most crosses fall further below the
line. This graph indicates that adding a search term related                                                                                                                                                                                   100
non-taxonomically tends to make a bigger difference than
adding a term related taxonomically and with high                                                                                                                                                                                                  10

frequency. This means that adding a term related non-
                                                                                                                                                                                                                                                    1
taxonomically to key words is crucial to retrieving                                                                                                                                                                                                     1           10     100         1000     10000    100000        1000000       10000000   100000000 1000000000 10000000000

informative pages, i.e., such terms are informative terms                                                                                                                                                                                                                       Number of Web pages retrieved with Type2 (base key words)
themselves.Table 2 shows the number of cases in which
term in different category decreases the number of hit pages                                                                                                                                                                           Figure 7. Fluctuation of number of Web pages retrieved by
more than high frequency term. Then, we found that most                                                                                                                                                                                 adding the low frequency term in same category (Type 1)
of the additional terms with high frequency contributed less                                                                                                                                                                                  and the term in a different category (Type 3)
than additional terms related non-taxonomically to
                                                                 frequency are not effective at reducing the number of pages
                                             NV                  retrieved; qualitatively, low frequency terms may not
        Type of Data          NN
                                     Wo    Ga     Ni   Ha        effective to direct users to informative pages.

  Number of word sets for                                        CONCLUSION
                              175     43    23    13   26
  verification                                                   We introduced a mechanism that provides key words which
  Number of cases in                                             can make human-computer interaction (HCI) increase, by
  which Type 3 defeated        61     18    7     6    13        using natural language processing technology and
  Type 1 in retrieval                                            mathematic measure for calculating degree of inclusion. We
                                                                 showed what type of words should be added to the current
                                                                 query, i.e. keywords which previously had been input, in
   Table 3. Number of cases in which the term in different       order to make HCI more creative.
   category decreases the number of Web pages retrieved
             more than the low frequency term.                   We extracted related word sets from documents by
                                                                 employing case-marking particles derived from syntactic
decreasing the number of Web pages retrieved. This means         analysis. Then, we verified which kind of related word is
that, in comparison to the high frequency terms, which           more useful as an additional word for retrieval support.
might not be so informative in themselves, the terms in the      That is, we found the additional term which is thematically
other category — related non-taxonomically — are                 related to other terms is effective at retrieving informative
effective for retrieving useful Web pages.                       pages by comparing the results retrieved with words related
                                                                 only taxonomically and those retrieved with words that
Constantly, in Figure 7, most circles fall further below the     included a word related non-taxonomically to the other
line. This indicates that adding a term related taxonomically    words. This suggests that words with a thematic relation
and with low frequency tends to make a bigger difference         can be useful to make the HCI more active.
than does adding a term with high frequency. Certainly,
additional terms with low frequency would be informative         As for the future directions of this work, one of most crucial
terms, even though they are related taxonomically, because       issues is evaluation. We will evaluate the effectiveness of
they may be rare terms on the Internet. Thus, the                our method from human-centered viewpoints, possibly by
taxonomically related terms with low frequencies are             human judgement.
quantitatively effective for information retrieval as the non-   In the future, we can understand the contents of huge text
taxonomically related terms.                                     data with higher natural language processing technology
Table 3 shows the number of cases in which term in               and develop a system which makes it possible to expand the
different category decreases the number of hit pages more        users’ ways of thinking.
than low frequency term. In comparing these numbers, we
                                                                 REFERENCES
also found that the additional term with low frequency
                                                                 1. Fellbaum, C. WordNet: An electronic lexical database.
helped to reduce the number of Web pages retrieved,
                                                                    Cambridge, Mass.: The MIT Press, (1998).
making no effort to determine the kind of relation the term
had with the other terms. Thus, the terms with low               2. Geffet, M. and Dagan, I. The distribution inclusion
frequencies are quantitatively effective when used for              hypotheses and lexical entailment. In Proc. ACL 2005,
retrieval.                                                          (2005), 107-114.
However, if we consider contents of the results retrieved        3. Girju, R. Automatic detection of causal relations for
with Type 1 and Type 3, it is clear that big differences exist      question answering. In Proc. ACL Workshop on
between them. For example, consider “latency period -               Multilingual summarization and question answering,
erythrocyte - hepatic cell” obtained from SO-data in Figure         (2003), 76-114.
3. “Latency period” is classified into a category different      4. Girju, R., Badulescu, A., and Moldovan, D. Automatic
from the other terms and “hepatic cell” has the lowest              discovery of part-whole relations. Computational
frequency in this word set. When we used all the three              Linguistics, 32(1), (2006), 83-135.
terms, we obtained pages related to “malaria” at the top of      5. Hagita, N. and Sawaki, M. Robust recognition of
the results and the title of the top page was “What is              degraded      machine-printed      characters    using
malaria?” in Japanese. With “latency period” and                    complementary similarity measure and error-correction
“erythrocyte,” we again obtained the same page at the top,          learning. In Proc. SPIE – The International Society for
although it was not at the top when we used “erythrocyte”           Optical Engineering, 2442, (1995), 236-244.
and “hepatic cell” which have a taxonomical relation.
                                                                 6. Hearst, M. A. Automatic acquisition of hyponyms from
As we showed above, the terms with thematic relations with          large text corpora, In Proc. Coling 92, (1992), 539-545.
other search terms are effective at directing users to
informative pages. Quantitatively, terms with a high             7. Morris, J. and Hirst, G. Non-classical lexical semantic
                                                                    relations. Workshop on Computational Lexical
   Semantics, In Proc. Human Language Technology            9. Wisniewski, E. J. and Bassok, M. What makes a man
   Conference of the NAACL, (2004).                            similar to a tie? Cognitive Psychology, 39, (1999), 208-
8. Pantel, P. and Pennacchiotti, M. Espresso: Leveraging       238.
   generic patterns for automatically harvesting semantic   10. Yamamoto, E., Kanzaki, K., and Isahara, H. Extraction
   relations In Proceedings of ACL 2006, (2006), 113–120.       of hierarchies based on inclusion of co-occurring words
                                                                with frequency information. In Proc. IJCAI2005,
                                                                (2005), 1166-1172.

</pre>