=Paper= {{Paper |id=None |storemode=property |title=Traitor: Associating Concepts using the World Wide Web |pdfUrl=https://ceur-ws.org/Vol-986/paper_29.pdf |volume=Vol-986 |dblpUrl=https://dblp.org/rec/conf/dir/DrijfhoutJWH13 }} ==Traitor: Associating Concepts using the World Wide Web== https://ceur-ws.org/Vol-986/paper_29.pdf
Traitor: Associating Concepts using the World Wide Web
                                On-line demonstrator at http://evilgeniuses.ophanus.net
                   Wanno Drijfhout                     Oliver Jundt                  Lesley Wevers          Djoerd Hiemstra
                                                                    University of Twente
                {wrc.drijfhout,oliver.jundt}@gmail.com                                  {l.wevers,d.hiemstra}@utwente.nl


Categories and Subject Descriptors                              have been used for similar purposes before. To do this min-
H.3.1 [Content Analysis and Indexing]: Linguistic pro-          ing, we have implemented a map-reduce program in Java
cessing; H.3.3 [Information Search and Retrieval]: Query with Hadoop. For simplicity, the program assumes that
formulation                                                     one word represents one concept (and vice versa), and for
                                                                every pair of words that co-occur in a sentence, the concepts
                                                                they represent are associated.
General Terms                                                     In a few steps our program transforms the raw HTML
Querying, Visualization, Experimentation                        responses from the Common Crawl data set to a list of key
                                                                value pairs (k, v), where the key k is a word pair (w1 , w2 )
Keywords                                                        that  co-occurred in some sentence on the page, and the
                                                                count v denotes the number of pages on which this co-
Information Extraction, Question Answering, MapReduce           occurrence has been found. We consider the count of co-
                                                                occurring word pairs a measure of the association’s strength.
ABSTRACT                                                        A more sophisticated approach such as a probabilistic[7]
We use Common Crawl’s 25TB data set of web pages to con-        association measure[1, 8], based on the concept of mutual
struct a database of associated concepts using Hadoop. The      information[3] could have been used instead—the limited
database can be queried through a web application with two      time to submit our solution to the Norvig Award jury pres-
query interfaces. A textual interface allows searching for      sured us to use simpler methods.
similarities and differences between multiple concepts using      We chose sentences as the semantic unit (rather than the
a query language similar to set notation, and a graphical       same paragraph, document or a sliding window[6]) for gen-
interface allows users to visualize similarity relationships of erating word pairs for two reasons. A technical reason is
concepts in a force directed graph.                             to constrain the result size. Pairing every non-trivial word
                                                                with every other produces results in the order O n2 . The
                                                                second reason is based on human language semantics[5]. We
1. INTRODUCTION                                                 suppose that words within a sentence are more likely to rep-
   What do Apple and Samsung have in common? What               resent actual concept associations than words that are far
does Java have that PHP does not have? Which terms              apart in a document (or in different sentences).
could refine the search query “lance armstrong”? What is
the context of the sentence “Food, games and shelter are        2.1 Implementation
just as important as health”? You may know the answers to         In the mapping phase we extract distinct word pairs from
these questions or know where to look for them—but can a        documents. We extract “interesting” text from the raw
computer answer these questions for you too? This paper         HTML documents using the BoilerPipe library; i.e., text
discusses Traitor, a tool that creates a database of asso-      in navigation menus, headers and footers is omitted. We
ciated concepts, and answers questions similar to our ex-       then split each sentence in the text into a set of words
amples. Traitor was created for submission to the Norvig        and normalize these words by converting them to lower-
                            1
Web Data Science Award , a research challenge organized         case ASCII characters (e.g., á and Á become a). This re-
by Common Crawl and SURFsara. We won the award.                 duces the number of generated word pairs and compensates
                                                                                     for small notation differences between words. Moreover,
2.     MINING CONCEPT ASSOCIATIONS                                                   we discard words from non-English sentences2 , words con-
  We use Common Crawl’s 25TB data set of web pages                                   taining non-alphabetic characters, and stop-words such as
to mine associated concepts. Other corpora (e.g., search                             “the” and “that”. For each normalized and filtered sentence
logs[4], USENET[6] and the British National Corpus[2])                               S, the mapper creates a word pair p for each3 pair of words
                                                                                     (w1 , w2 ) where w1 , w2 ∈ S, w1 < w2 (lexicographically).
1
    http://norvigaward.github.com                                                    Finally, for every web page, the mapper emits tuples (p, 1)
                                                                                     for each distinct word pair p on that page.
                                                                                       In the reduction phase, we sum the counts of the word
                                                                                     pairs produced by the mapper. Because the distribution of
Permission to make digital or hard copies of all or part of this work for            words follows a Zipf distribution, we find that the majority
personal or classroom use is granted without fee provided that copies are            of the resulting pairs have a very low count. To reduce
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to      2
republish, to post on servers or to redistribute to lists, requires prior specific     Our heuristic is rather crude: we check if a sentence con-
permission and/or a fee.                                                             tains at least two English ‘stop-words’.
                                                                                     3
DIR 2013, April 26, 2013, Delft, The Netherlands.                                      We limit the number of pairs produced for excessively long
.                                                                                    sentences.
storage costs of the final output, we can discard pairs with          apple             samsung                 java
a count less than some N ; e.g., if N = 2, we discard all            iphone              phone                  code
pairs that only occur once. Essentially, this allows us to             ipod              galaxy            application(s)
                                                                       ipad              mobile                games
reduce the output to any size that is practical for use in            store             battery               software
the presentation layer. Unfortunately, pairs containing rare           mac                  tv             programming
words are cut indiscriminately due to this policy, even if       apple samsung     lance armstrong       political party
these co-occurring words are still usable associations.               phone               tour              information
                                                                        tv               cancer                parties
3.    QUERYING CONCEPT ASSOCIATIONS                                  mobile              france                policy
                                                                     battery          foundation                third
  The resulting tuples from the map-reduce step are im-              iphone               team              government
ported into a database which can be queried through a web          java -php       java coffee -php          wii -xbox
application with two query interfaces.
                                                                    applet(s)              cup                balance
                                                                      alden               bean                   kart
3.1   Textual interface                                                jvm               beans                nunchuk
   Users of the textual interface can search for similarities         coffee               tea                 resort
and differences between multiple concepts using a query           marketplace           espresso            motionplus
language similar to set notation or boolean algebra. For
each search term in the query, associated words and co-              Table 1: Query results produced by Traitor.
occurrence counts are fetched from the database, and a
score is assigned to each associated word based on the struc-
ture of the query expression.                                   data to about 10 gigabytes of uncompressed word associa-
   A sequence of words separated by whitespace denotes a        tions by aggressive filtering of the input, and dropping all
conjunction and yields the words that are associated with       pairs with a count less than 100. By means of a query lan-
all words in the sequence. A sequence of words separated        guage and a simple scoring algorithm, we can express and
by plus-symbols, denotes a disjunction and yields the words     answer queries about the concepts and associations stored
that are associated with any word in the sequence. A word       in this database. A visualization interface allows for com-
preceded by a minus-symbol denotes the complement. To           parison of concepts by the similarity of their associations.
illustrate: the query (a + b) -c yields all words that are         Despite the simplistic methods used, Traitor can pro-
associated with a or b, and not with c.                         vide reasonable results thanks to the large data corpus. As
                                                                discussed in Section 2, we expect that a probabilistic scor-
3.2   Graphical interface                                       ing model can further improve the quality of our results.
   A graphical interface allows users to visualize similar-     Moreover, Traitor only supports ‘concepts’ described by
ity relationships of concepts in a force directed graph using   single words; one could extract n-grams from sentences to
D3.js. Users can enter a list of words which become labeled     identify concepts described by multiple words. Future work
nodes in the graph. The graphical size of the nodes indi-       can include further normalization of words; e.g., equating
cates the number of associations a concept has; the more        plural and singular words, or applying word stemming.
associations a concept has, the bigger the node. For each
word in the query the system retrieves the 50 strongest re-     6.    REFERENCES
lated concepts, which can be interpreted as an estimate of    [1] K. W. Church and P. Hanks. Word association norms,
the concept’s context. For each word pair we apply the            mutual information, and lexicography. Computational
RBO metric[9] to estimate the similarity. A link is cre-          linguistics, 16(1), 1990.
ated between two nodes if the similarity is more than 5%.
                                                              [2] Stefan Evert. The statistics of word cooccurrences.
The length and thickness of the link indicate the similar-
                                                                  PhD thesis, Dissertation, Stuttgart University, 2005.
ity. If two nodes are connected by a short thick line the
corresponding concepts share a very similar top 50 rank-      [3] Robert M. Fano. Transmission of Information: A
ing. Conversely, distant nodes and nodes without any link         Statistical Theory of Communication. The MIT Press,
between them have very few (or no) concepts in common.            March 1961.
                                                              [4] Y. Hu, Y. Qian, H. Li, D. Jiang, J. Pei, and Q. Zheng.
                                                                  Mining query subtopics from search log data. In Proc.
4. RESULTS                                                        of the 35th int. ACM SIGIR conf. on IR, 2012.
  Using our map-reduce program, we populated an associ-       [5] C. Lioma, B. Larsen, and W. Lu. Rhetorical relations
ation database with over 43 million distinct word pairs. We       for information retrieval. In Proc. of the 35th int. ACM
attempted to assess the quality of the Traitor’s query re-        SIGIR conf. on IR, 2012.
sults by answering questions from this paper’s introduction   [6] Kevin Lund and Curt Burgess. Producing
(and others). For the sake of brevity, table 1 shows a few        high-dimensional semantic spaces from lexical
query results produced by Traitor. Additionally, to illus-        co-occurrence. Behavior Research Methods,
trate the disjunctive operator, we could ‘deduce the context’     Instruments, & Computers, 28(2), June 1996.
of a sentence; the query “food + games + and + shelter
                                                              [7] Y. Matsuo and M. Ishizuka. Keyword extraction from
+ are + just + as + important + as + health” produces
                                                                  a single document using word co-occurrence statistical
the union of each word’s associations: care, insurance, in-
                                                                  information. International Journal on Artificial
formation, play, good. The reader is encouraged to try
                                                                  Intelligence Tools, 13(01), 2004.
Traitor on-line for more example queries (see the About-
                                                              [8] T. Sugimachi, A. Ishino, M. Takeda, and F. Matsuo. A
page) and for a live demonstration of the visualization in-
                                                                  method of extracting related words using standardized
terface.
                                                                  mutual information. In Discovery Science, 2003.
                                                              [9] William Webber, Alistair Moffat, and Justin Zobel. A
5. CONCLUSIONS AND FUTURE WORK                                    similarity measure for indefinite rankings. ACM Trans.
  In about 8 hours, the SURFsara Hadoop cluster of circa          Inf. Syst., 28(4), November 2010.
66 nodes reduced 25 terabytes of Common Crawl corpus