=Paper=
{{Paper
|id=Vol-323/paper-2
|storemode=property
|title=Thematically Related Words toward Creative Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-323/paper02.pdf
|volume=Vol-323
|dblpUrl=https://dblp.org/rec/conf/iui/YamamotoI08
}}
==Thematically Related Words toward Creative Information Retrieval==
Thematically Related Words toward Creative Information
Retrieval
Eiko Yamamoto Hitoshi Isahara
Graduate School of Engineering, National Institute of Information and
Kobe University Communications Technology
1-1 Rokkodai-cho, Nada-ku, Kobe, 3-5 Hikaridai, Seika-cho, Soraku-gun,
Hyogo, 657-8501, Japan Kyoto, 619-0289, Japan
eiko@mech.kobe-u.ac.jp isahara@nict.go.jp
ABSTRACT such interface is not constructions of interface, i.e. how
We introduce a mechanism that provides key words which each part of interface is arranged on the screen, but what
can make human-computer interaction increase in the information is presented to interact with users.
course of information retrieval, by using natural language
processing technology and mathematic measure for New ideas pop into one’s head when he/she strolls in
calculating degree of inclusion. We show what type of library, bookstore, or even around town. We need retrieval
words should be added to the current query, i.e. keywords supports which enable us to expand such creativity. Making
which previously had been input, in order to make human- computer smarter to automatically extract “correct”
computer interaction more creative. We try to extract retrieval result is one-side way of developing support
related word sets from documents by employing case- systems for information retrieval. Seeing the advice
marking particles derived from syntactic analysis. Then, we provided to a user by computer, how the user achieves next
verify which kind of related words is more useful as an retrieval is one of the most important viewpoints for the
additional word for retrieval support. future intelligent user interface. We need a technology that
enables computer to understand huge text data and make it
Author Keywords possible to expand the users’ way of thinking.
Natural Language Processing, retrieval support, related In this paper, we introduce a mechanism that provides key
words, thematic relation, taxonomical relation. words which can make human-computer interaction (HCI)
during the information retrieval increase, by using natural
ACM Classification Keywords language processing technology and mathematic measure
H5.2. INFORMATION INTERFACES AND for calculating degree of inclusion. Concretely, we show
PRESENTATION (e.g., HCI): User Interfaces – Natural what type of words should be added to the current query, i.e.
Language; H.3.3. INFORMATON STORAGE and keywords which previously had been input, in order to
RETRIEVAL: Information Search and Retrieval. make HCI more creative.
INTRODUCTION RELATION BETWEEN WORDS
Nowadays, we can access huge amount of text data Many researchers in natural language processing have
available on the web. The increase of the data quantity developed many methodologies for extracting various
causes a paradigm shift for web retrieval. Rhetorically relations from corpora. Several methods exist for extracting
speaking, we can take a walk among the huge text data. The relations such as “is-a” [6], “part-of” [4], causal [3], and
web retrieval supports we need in this novel situation are entailment [2] relations. Moreover, methods to learn
neither simple query expansion nor our (or someone’s) patterns for extracting relations between words have been
record of previously input keywords, but we need interfaces presented [4, 8]. Such related words can be used to support
which interact with people in new ways. What is crucial for retrieval in order to lead users to high-quality information.
One simple method is to provide additional key words
related to the key words users have input. Here we have a
question, which is what kinds of relations between the
previous key words and the additional word are effective
for information retrieval.
As for the relations among words, at least two kinds of
relations exist: the taxonomical relation and the thematic
© 2008 for the individual papers by the papers' authors. Copying permitted for private and academic purposes.
Re-publication of material from this volume requires permission by the copyright owners.
relation [9]. 1 The former is a relation representing the
physical resemblance among objects, such as, “cow” and ad − bc
CSM (V i , V j ) = ,
“animal,” which is typically a semantic relation; the latter is ( a + c )(b + d )
a non-taxonomical relation among objects through a
a = ∑ k =1 v ik ⋅ v jk , b = ∑ k =1 v ik ⋅ (1 − v jk ),
n n
thematic scene, such as “milk” and “cow” as recollected in
the scene “milking a cow,” which includes causal relation
c = ∑ k =1 (1 − v ik ) ⋅ v jk , d = ∑ k =1 (1 − v ik ) ⋅ (1 − v jk ).
n n
and entailment relation. Taxonomically related words are
generally used to query expansion and it is comparatively
easy to identify taxonomical relations from linguistic
resources such as dictionaries and thesauri. On the other CSM is an asymmetric measure because the denominator is
hand, it is difficult to identify thematic relations because asymmetric. Therefore, CSM(Vi, Vj) usually differs from
they are rarely maintained in linguistic resources. CSM(Vj, Vi) exchanged between Vi and Vj. For example,
Most of the previous researches of information retrieval when Vi is 1110010111 and Vj is 1000110110, parameters
support tended to focus on the improvement of recall by the for CSM(Vi, Vj) are a = 4, b = 3, c = 1, and d = 2, and
query expansion. However, our aim is to direct users to the CSM(Vi, Vj) is greater than CSM(Vj, Vi). According to the
informative information for them by query suggestion. asymmetric feature, we can estimate whether the
Though some users sometimes do not realize their real appearance pattern of wi includes the appearance pattern of
intention of retrieval, we would like them to find their wj. If wi is “animal” and wj is “tiger,” CSM would estimate
hidden needs via interaction in which system shows them that “animal” is a hypernym of “tiger.”
suggestive terms. Extracted word pairs based on the appearance patterns are
In this paper, we try to extract related word sets from expressed by a tuple which is a directed set of
documents in Japanese by employing case-marking words. Tuple represents that CSM(Vi, Vj) is
particles derived from syntactic analysis. Then, we greater than CSM(Vj, Vi) when words wi and wj have each
compared the results retrieved with words related only appearance pattern represented by each binary vector Vi and
taxonomically and those retrieved with words that included Vj. We call wi the “left word” and wj the “right word.”
a word related non-taxonomically to the other words in
Constructing Related Word Sets
order to verify what kind of relation makes human-
We next connected such word pairs with CSM values
computer interaction more creative.
greater than a certain threshold and constructed word sets.
WORD SET EXTRACTION METHOD If we adopt simpler mechanism such as the co-occurrence
In order to derive word sets that direct users to obtain frequency, which extracts only the co-occurrence relations
information, we applied the method based on the between words, two tuples extracted from different
Complementary Similarity Measure (CSM) which can sentences cannot be merged easily. A feature of our method
estimate inclusive relations between two vectors [10]. This is that because we use the CSM to calculate the degree of
measure was developed as a means of recognizing degraded inclusion of appearance patterns between all combinations
machine-printed text [5]. of words in whole collection of texts, we can connect word
pairs consistently. That is to say, we can extract not only
Estimating Inclusive Relation between Words pairs of related words but also sets of related words. In
We first extract word pairs having an inclusive relation of other words, our CSM-based method is relevant not only to
the appearance patterns between the words by calculating information within a sentence or a document, but also to
the CSM values. An appearance pattern is expressed a kind information from a wider context. That is, once we obtain
of co-occurrence relation by an n-dimensional binary two tuples and , even though the tuples have
feature vector. Therefore, the dimension of each vector been extracted from different sentences or documents, we
corresponds to a co-occurring word, a document, or a can obtain word set {A, B, C} in order.
sentence. When Vi = (vi1, ..., vin) is a vector for word wi and
Vj = (vj1, ..., vjn) is a vector for word wj, CSM(Vi, Vj) is Suppose we have tuples , , , ,
defined by the following formula: , and , which are word pairs having greater
CSM values than the threshold (TH) in the order of their
values. For example, let be an initial word set {B,
C}. We create a word set as follows.
1. We find the tuple with the greatest CSM value among
the tuples in which the word at the tail of the current
1
The taxonomical relation which is, for example, provided word set — for example, C in {B, C} — is a left word,
by WordNet [1] corresponds to “classical” relation by and connect the right word of the tuple to the tail of
Morris and Hirst [7], and the thematic relation corresponds the current word set. In this example, word “D” is
to “non-classical” relation. connected to {B, C} because has the greatest
CSM value among the three tuples , , For our experiment, we used such particles and extracted
and , making the current word set {B, C, D}. the data from the documents we gathered.
2. This process is repeated until no tuples with a CSM First, we parsed sentences with the KNP2. From the results,
value greater than TH can be chosen. we collected dependency relations matching one of the
following five patterns of case-marking particles. With A, B,
3. We find the tuple with the greatest CSM value among
P, Q, R, and S as nouns (including compound words); V as
the tuples in which the word at the head of the current
a verb; and as a case-marking particle with its role in
word set — for example, B in {B, C, D} — is the right parentheses, the five patterns are as follows:
word, and connect the left word of the tuple to the
head of the current word set. In this example, Word z A B
“A” is connected to the head of {B, C, D} because V
B> has a CSM value greater than that of ,
making the current word set {A, B, C, D}. z Q V
4. This process is repeated until no tuples with a CSM z R V
value greater than TH can be chosen. z S V
In this example, we obtained the word set {A, B, C, D}
Suppose we have a sentence “Chloe ha Mike ga Judy ni
beginning with tuple as the initial word set {B, C}. bara no hanataba wo okutta to kiita (Chloe heard that Mike
In this way, we construct all word sets by beginning with
had given Judy a rose bouquet.).” From this sentence, we
each tuple, using tuples whose CSM values are greater than can extract five dependency relations between words as
TH. Then from the word sets obtained, we remove word
follows:
sets that are embedded in other word sets.
z bara (rose) hanataba (bouquet)
If we set TH to a low value, it is possible to obtain lengthy
word sets. When the TH is too low, the number of tuples z hanataba (bouquet) okutta (had
that must be considered becomes overwhelming and the presented)
reliability of the measurement decreases. Consequently, we
z Mike okutta
experimentally set TH.
z Judy okutta
Extracting Word Sets with Thematic Relation
Finally, we use a thesaurus to extract word sets with a z Chloe kiita (heard)
thematic relation. We remove word sets with taxonomical From this set of dependency relations, we compiled the
relations from whole word set we extracted, and get the following types of experimental data3:
leftover as word sets with thematic (at least non-
taxonomical) relations. The heading words in a thesaurus z NN-data based on co-occurrence between nouns.
are categorized to represent a taxonomical relationship. If a For each sentence in our document collection,
word set extracted by the CSM-based method demonstrates we gathered nouns followed by all five of the
a taxonomical relation among the words, the words in the case-marking particles we used and nouns
CSM-based word set are classified into one category in the proceeded by ; that is, A, B, P, Q, R, and S.
thesaurus; that is, if an extracted word set agrees with the For the above sentence, we can gather Chloe,
thesaurus, we can conclude that a taxonomical relation
exists among the words. Through this approach, we remove 2
A Japanese parser developed at Kyoto University.
those word sets with a taxonomical relation by examining 3
the distribution of words in the categories. The rest of the Japanese case-marking particles define not deep semantics
word sets have a non-taxonomical relation — including a but rather surface syntactic relations between
thematic relation — among the words. We then extract words/phrases; therefore, we utilized not semantic
those word sets that do not agree with the thesaurus, having meanings between words, but classifications by case-
identified them as word sets with a thematic relation, that is, marking particles. Therefore, the method proposed in this
thematically related word sets. paper is applicable to other languages when a syntactic
analyzer that classifies relations between elements, such as
LINGUISTIC DATA subject, direct object, and indirect object, exists for the
We extract word sets by utilizing inclusive relations of the language. For example, from the output of English parser,
appearance pattern between words based on a we can compile necessary linguistic data, such as Wo-data
modifiee/modifier relationship in documents. The Japanese using collocations between verb and its direct object, Ga-
language has case-marking particles that indicate the data from collocations between verb and its subject, Ni-
semantic relation between two elements in a dependency data from collocations between verb and its indirect object,
relation, which is a kind of modifiee/modifier relationship. and SO-data from collocations between subject and object
of a verb.
Mike, Judy, bara, and hanataba. The number of the appearance pattern for each subject with a binary vector
data items equals the number of sentences in the whose dimension corresponds to an object.
documents.
Therefore, if we calculate the CSM value between Vector A
z NV-data based on a dependency relation and Vector B, each of the parameters a, b, c, and d used for
between noun and verb. We gathered nouns P, Q, CSM formula explained above corresponds to the number
R, and S followed by each of the case-marking of each of the following cases:
particles , , , and for each
z The number of dimensions which both Vector A
verb V. We named them Wo-data (with 20,234
and Vector B have 1 as each element.
gathered data items), Ga-data (15,924), Ni-data
(14,215), and Ha-data (15,896), respectively. z The number of dimensions which Vector A has 1,
For the verb okutta in the above sentence, the though Vector B has 0.
Wo-data is hanataba, Ga-data is Mike, and so on.
The number of data items equals the number of z The number of the dimensions which Vector B
kinds of verbs. has 1, though Vector A has 0.
z The number of the dimensions which both Vector
z SO-data based on a collocation between subject
and object. We gathered subject Q followed by A and Vector B have 0.
the case-marking particle that depends on
EXPERIMENT
the same verb V as the object P for each object
In our experiment, we used domain-specific documents in
followed by the case-marking particle . For
Japanese from the medical domain gathered from the Web
the above example, we can gather the subject
pages of a medical school. The Japanese documents we
Mike for the object hanataba because we have
used totaled 225,402 sentences (10,144 pages, 37MB).
the dependency relations Mike okutta and
hanataba okutta. The number of data In applying the CSM-based method, we represented
items equals the number of kinds of objects, experimental data for medical terms with a binary vector as
where each of them co-occurs with a subject in a explained above. We used descriptors in the 2005 Medical
sentence and depends on same verb as the Subject Headings (MeSH) thesaurus 4 and translated them
subject (4,437). into Japanese. The number of terms in Japanese appearing
in this experiment is 2,557. We constructed word sets
When we represent experimental data with a binary vector,
consisting of these medical terms and chose the word sets
the vector corresponds to the appearance pattern of a noun.
consisting of three or more terms from them. Figures 2 and
Parameters for calculating the CSM-value correspond to the
3 show examples of word sets constructed with the CSM-
number of dimensions in each situation. Figure 1 shows
based method. Note that we obtained word sets comprising
images of the appearance pattern expressed by the binary
Japanese medical terms that appear in the Japanese-
vector for each data item. The number of dimensions equals
language medical documents we used. For explanatory
the number of data items for each experimental data. For
purposes, in the following part of this paper we use English
NN-data, each dimension corresponds to a sentence. The
terms obtained from the MeSH thesaurus.
element of the vector is 1 if the noun appears in the
sentence and 0 if it does not. Similarly, for NV-data, each data - causation - depression - reduction
dimension corresponds to a verb. For SO-data, we represent - platelet count - bone marrow examination
neonate - patent ductus arteriosus
- necrotizing enterocolitis
NN-data secretion - gastric acid - gastric mucosa
n sentences - duodenal ulcer
noun 0001110100 .........10 skin - atopic dermatitis - herpes viruses
- antiviral drugs
NV-data fatigue - uterine muscle - pregnancy toxemia
n kinds of verb water - oxygen - hydrogen - hydrogen ion
noun 1001101001 .........01 person - nicotiana - smoke - oxygen deficiencies
SO-data Figure 2. Examples of word sets extracted from NN-data
n kinds of object
subject 0101110000 .........10
Figure 1. Appearance patterns of a binary vector 4
The U.S. National Library of Medicine created, maintains,
for a noun in each type of experimental data and provides the Medical Subject Headings (MeSH®)
thesaurus.
latency period - erythrocyte - hepatic cell NV
snow - school - gas Type of Data NN
variation - death - limb Wo Ga Ni Ha
hospitalist - corneal opacities - triazolam No. of word sets 594 199 62 37 85
cross reaction - apoptoses - injuries
research - survey - altered taste - rice No. of agreed 45 58 14 6 7
environment - state interest - water - meat - diarrhea word sets (%)
(7.5) (29.1) (22.6) (16.2) (8.2)
rights - energy generating resources - cordia - education
- deforestation No. of disagreed
549 141 48 31 78
word sets
Figure 3. Examples of word sets extracted from SO-data Table 1. The numbers of word sets that agreed/disagreed with
the MeSH thesaurus.
Then, to obtain the thematically related word sets from the
word sets extracted by the CSM-based method, we use the skin - abdomen - cervix - cavitas oris - chest [NN]
MeSH thesaurus .The MeSH headings are organized into 15 cardiovascular disease - coronary artery disease
categories and the MeSH trees are hierarchical - bronchitis - thrombophlebitides - flatulence
arrangements of headings with their associated tree - hyperuricemia - lower back pain
numbers, which include information about the category. - ulnar nerve palsies - brain hemorrhage
Notice that some headings are classified into more than one - obstructive jaundice [NV(Wo)]
category. extrasystole - bronchospasm - acute renal failure
- colitides - diabetic coma - pancreatitides [NV(Ga)]
We examined the distribution of terms in the MeSH hand - mouth - ear - finger [NV(Ni)]
categories for each word set and extracted word sets that do snake - praying mantis - scorpion [NV(Ha)]
not agree with the MeSH thesaurus as word sets with a
thematic relation. Table 1 shows the number of word sets
that agree and that disagree with the MeSH thesaurus. By Figure 4. Examples of taxonomically related word sets
way of exception, for example, we obtained the word set
“tree - forest - orangutan” from NN-data. “Tree” is
means, “a mouse gnaws a book,” we estimate the relation
classified into two categories “Organisms (B)” and
between the words ningen and nezumi with CSM. Therefore,
“Technology and Food and Beverages (J)”; “forest” is
we can surmise that the information we obtain from this
classified into “J” and “orangutan” is classified into “B.” In
data will not agree with a general thesaurus because we do
this case, we consider that a relation exists between “forest”
not limit the verbs that subjects and objects depend on.
and “orangutan” via “tree,” and we treat this word set as
Actually, the word sets we obtained from SO-data agreed
being distributed in one category.
little with the MeSH thesaurus.
In Table 1 we found that, for NN-data and NV-data, the
Figure 4 shows examples of taxonomically related word
ratio of CSM-based word sets that agreed with MeSH
sets, which agree with the MeSH thesaurus, that is, in
thesaurus was between 7.5% and 29.1%. Of the CSM-based
which all the composing terms in a word set are classified
word sets, Wo-data provided the highest agreement ratio.
into one category. The symbol in brackets represents the
The apparent reason for this is that the object case type of data from which each word set was obtained. As the
represented by the case-marking particle restricts result, we obtained the rest as word sets with a thematic
nouns more stringently than do the others. Also, comparing relation, that is, thematically related word sets, which are
the results of NN-data and NV-data, we found that the word 847 word sets.
sets extracted from NV-data agreed with the MeSH
thesaurus to a greater degree than did those extracted from VERIFICATION
NN-data. This suggests that we obtained more word sets In verifying the capability of our word sets to retrieve Web
having taxonomical relations among words from NV-data pages, we examined whether our word sets could help limit
than from NN-data. the search results to more informative Web pages with
Google as a search engine. To do this, in our obtained word
SO-data is based on a collocation between subject and sets with a thematic relation, we used 294 word sets in
object; that is, the word sets obtained comprise subjects which one of the terms is classified into one category and
followed by the case-marking particle that depend on the rest are classified into another category. Figure 5 shows
the same verb as the object for each object followed by the examples of the word sets. The terms with underline
case-marking particle . For example, when we have indicate ones in a different category.
“ningen (person) hon (book) yomu (read),”
which means, “a person reads a book,” and “nezumi
(mouse) hon (book) kajiru (gnaw),” which
Type3: With additional term in a different category Type1: With additional term in same category
ovary - spleen - palpation [NN] 100000000
variation - cross reactions - outbreaks - secretion
N u m b er o f W eb p a ge s retrie v ed w h en a te rm is a d de d to T y pe 2
10000000
[NV(Wo)]
bleeding - pyrexia - hematuria - consciousness disorder 1000000
- vertigo - high blood pressure [NV(Ga)]
space flight - insemination - immunity [NV(Ni)] 100000
cough - fetus
- bronchiolitis obliterans organizing pneumonia 10000
[NV(Ha)]
1000
Figure 5. Examples of word sets used to verify 100
10
We used the terms that composed such word sets as the key
words to input into the search engine and retrieved Web 1
1 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000
pages. We created three types of search terms from a word Number of Web pages retrieved with Type2 (base key words)
set. Suppose the word set is {X1, …, Xn, Y}, where Xi is
classified into one category and Y is classified into another. Figure 6. Fluctuation of number of Web pages retrieved by
The first type uses all terms except the one classified into a adding the high frequency term in same category (Type 1)
category different from the others: {X1, …, Xn}, removing and the term in a different category (Type 3)
Y. The second type uses all terms except the one in the
same category as the rest: {X1, …, Xk-1, Xk+1, …, Xn}
removing Xk and Y. In our verification, we removed the NV
Type of Data NN
term Xk with the highest or lowest frequency among Xi. Wo Ga Ni Ha
The third type uses terms in Type 2 and Y, i.e., terms in
another category: {X1, …, Xk-1, Xk+1, …, Xn, Y}. When we Number of word sets for
175 43 23 13 26
consider Type 2 as base key words, Type 1 is a set of key verification
words with the addition of one term having the highest or Number of cases in which
lowest frequency among the terms in the same category; i.e., Type 3 defeated Type 1 in 108 37 15 12 18
the additional term Xk has a feature related to frequency and retrieval
is taxonomically related to other terms. Type 3 is a set of
key words with the addition of one term in a category
different from those of the other component terms; i.e., the Table 2. Number of cases in which the term in different
additional term Y seems to be thematically related to other category decreases the number of Web pages retrieved
more than the high frequency term.
terms.
The retrieval results are shown in Figures 6 and 7 including Type3: With additional term in a different category Type1: With additional term in same category
the results for each the highest frequency and the lowest 100000000
frequency. The horizontal axis is the number of pages
N u m b er o f W eb p a ge s re trie v ed w h e n a te rm is a d de d to T y p e2
10000000
retrieved with Type 2 and the vertical axis is the number of
pages retrieved with Type 1 or Type 3 that a certain term Xk 1000000
or Y is added to Type 2. The circles show the results with
Type 1 and the crosses show the results with Type 3. The 100000
diagonal line in the graph shows that adding one term to
10000
Type 2 does not affect the number of Web pages retrieved.
1000
As shown in Figure 6, most crosses fall further below the
line. This graph indicates that adding a search term related 100
non-taxonomically tends to make a bigger difference than
adding a term related taxonomically and with high 10
frequency. This means that adding a term related non-
1
taxonomically to key words is crucial to retrieving 1 10 100 1000 10000 100000 1000000 10000000 100000000 1000000000 10000000000
informative pages, i.e., such terms are informative terms Number of Web pages retrieved with Type2 (base key words)
themselves.Table 2 shows the number of cases in which
term in different category decreases the number of hit pages Figure 7. Fluctuation of number of Web pages retrieved by
more than high frequency term. Then, we found that most adding the low frequency term in same category (Type 1)
of the additional terms with high frequency contributed less and the term in a different category (Type 3)
than additional terms related non-taxonomically to
frequency are not effective at reducing the number of pages
NV retrieved; qualitatively, low frequency terms may not
Type of Data NN
Wo Ga Ni Ha effective to direct users to informative pages.
Number of word sets for CONCLUSION
175 43 23 13 26
verification We introduced a mechanism that provides key words which
Number of cases in can make human-computer interaction (HCI) increase, by
which Type 3 defeated 61 18 7 6 13 using natural language processing technology and
Type 1 in retrieval mathematic measure for calculating degree of inclusion. We
showed what type of words should be added to the current
query, i.e. keywords which previously had been input, in
Table 3. Number of cases in which the term in different order to make HCI more creative.
category decreases the number of Web pages retrieved
more than the low frequency term. We extracted related word sets from documents by
employing case-marking particles derived from syntactic
decreasing the number of Web pages retrieved. This means analysis. Then, we verified which kind of related word is
that, in comparison to the high frequency terms, which more useful as an additional word for retrieval support.
might not be so informative in themselves, the terms in the That is, we found the additional term which is thematically
other category — related non-taxonomically — are related to other terms is effective at retrieving informative
effective for retrieving useful Web pages. pages by comparing the results retrieved with words related
only taxonomically and those retrieved with words that
Constantly, in Figure 7, most circles fall further below the included a word related non-taxonomically to the other
line. This indicates that adding a term related taxonomically words. This suggests that words with a thematic relation
and with low frequency tends to make a bigger difference can be useful to make the HCI more active.
than does adding a term with high frequency. Certainly,
additional terms with low frequency would be informative As for the future directions of this work, one of most crucial
terms, even though they are related taxonomically, because issues is evaluation. We will evaluate the effectiveness of
they may be rare terms on the Internet. Thus, the our method from human-centered viewpoints, possibly by
taxonomically related terms with low frequencies are human judgement.
quantitatively effective for information retrieval as the non- In the future, we can understand the contents of huge text
taxonomically related terms. data with higher natural language processing technology
Table 3 shows the number of cases in which term in and develop a system which makes it possible to expand the
different category decreases the number of hit pages more users’ ways of thinking.
than low frequency term. In comparing these numbers, we
REFERENCES
also found that the additional term with low frequency
1. Fellbaum, C. WordNet: An electronic lexical database.
helped to reduce the number of Web pages retrieved,
Cambridge, Mass.: The MIT Press, (1998).
making no effort to determine the kind of relation the term
had with the other terms. Thus, the terms with low 2. Geffet, M. and Dagan, I. The distribution inclusion
frequencies are quantitatively effective when used for hypotheses and lexical entailment. In Proc. ACL 2005,
retrieval. (2005), 107-114.
However, if we consider contents of the results retrieved 3. Girju, R. Automatic detection of causal relations for
with Type 1 and Type 3, it is clear that big differences exist question answering. In Proc. ACL Workshop on
between them. For example, consider “latency period - Multilingual summarization and question answering,
erythrocyte - hepatic cell” obtained from SO-data in Figure (2003), 76-114.
3. “Latency period” is classified into a category different 4. Girju, R., Badulescu, A., and Moldovan, D. Automatic
from the other terms and “hepatic cell” has the lowest discovery of part-whole relations. Computational
frequency in this word set. When we used all the three Linguistics, 32(1), (2006), 83-135.
terms, we obtained pages related to “malaria” at the top of 5. Hagita, N. and Sawaki, M. Robust recognition of
the results and the title of the top page was “What is degraded machine-printed characters using
malaria?” in Japanese. With “latency period” and complementary similarity measure and error-correction
“erythrocyte,” we again obtained the same page at the top, learning. In Proc. SPIE – The International Society for
although it was not at the top when we used “erythrocyte” Optical Engineering, 2442, (1995), 236-244.
and “hepatic cell” which have a taxonomical relation.
6. Hearst, M. A. Automatic acquisition of hyponyms from
As we showed above, the terms with thematic relations with large text corpora, In Proc. Coling 92, (1992), 539-545.
other search terms are effective at directing users to
informative pages. Quantitatively, terms with a high 7. Morris, J. and Hirst, G. Non-classical lexical semantic
relations. Workshop on Computational Lexical
Semantics, In Proc. Human Language Technology 9. Wisniewski, E. J. and Bassok, M. What makes a man
Conference of the NAACL, (2004). similar to a tie? Cognitive Psychology, 39, (1999), 208-
8. Pantel, P. and Pennacchiotti, M. Espresso: Leveraging 238.
generic patterns for automatically harvesting semantic 10. Yamamoto, E., Kanzaki, K., and Isahara, H. Extraction
relations In Proceedings of ACL 2006, (2006), 113–120. of hierarchies based on inclusion of co-occurring words
with frequency information. In Proc. IJCAI2005,
(2005), 1166-1172.