=Paper=
{{Paper
|id=None
|storemode=property
|title=Traitor: Associating Concepts using the World Wide Web
|pdfUrl=https://ceur-ws.org/Vol-986/paper_29.pdf
|volume=Vol-986
|dblpUrl=https://dblp.org/rec/conf/dir/DrijfhoutJWH13
}}
==Traitor: Associating Concepts using the World Wide Web==
Traitor: Associating Concepts using the World Wide Web
On-line demonstrator at http://evilgeniuses.ophanus.net
Wanno Drijfhout Oliver Jundt Lesley Wevers Djoerd Hiemstra
University of Twente
{wrc.drijfhout,oliver.jundt}@gmail.com {l.wevers,d.hiemstra}@utwente.nl
Categories and Subject Descriptors have been used for similar purposes before. To do this min-
H.3.1 [Content Analysis and Indexing]: Linguistic pro- ing, we have implemented a map-reduce program in Java
cessing; H.3.3 [Information Search and Retrieval]: Query with Hadoop. For simplicity, the program assumes that
formulation one word represents one concept (and vice versa), and for
every pair of words that co-occur in a sentence, the concepts
they represent are associated.
General Terms In a few steps our program transforms the raw HTML
Querying, Visualization, Experimentation responses from the Common Crawl data set to a list of key
value pairs (k, v), where the key k is a word pair (w1 , w2 )
Keywords that co-occurred in some sentence on the page, and the
count v denotes the number of pages on which this co-
Information Extraction, Question Answering, MapReduce occurrence has been found. We consider the count of co-
occurring word pairs a measure of the association’s strength.
ABSTRACT A more sophisticated approach such as a probabilistic[7]
We use Common Crawl’s 25TB data set of web pages to con- association measure[1, 8], based on the concept of mutual
struct a database of associated concepts using Hadoop. The information[3] could have been used instead—the limited
database can be queried through a web application with two time to submit our solution to the Norvig Award jury pres-
query interfaces. A textual interface allows searching for sured us to use simpler methods.
similarities and differences between multiple concepts using We chose sentences as the semantic unit (rather than the
a query language similar to set notation, and a graphical same paragraph, document or a sliding window[6]) for gen-
interface allows users to visualize similarity relationships of erating word pairs for two reasons. A technical reason is
concepts in a force directed graph. to constrain the result size. Pairing every non-trivial word
with every other produces results in the order O n2 . The
second reason is based on human language semantics[5]. We
1. INTRODUCTION suppose that words within a sentence are more likely to rep-
What do Apple and Samsung have in common? What resent actual concept associations than words that are far
does Java have that PHP does not have? Which terms apart in a document (or in different sentences).
could refine the search query “lance armstrong”? What is
the context of the sentence “Food, games and shelter are 2.1 Implementation
just as important as health”? You may know the answers to In the mapping phase we extract distinct word pairs from
these questions or know where to look for them—but can a documents. We extract “interesting” text from the raw
computer answer these questions for you too? This paper HTML documents using the BoilerPipe library; i.e., text
discusses Traitor, a tool that creates a database of asso- in navigation menus, headers and footers is omitted. We
ciated concepts, and answers questions similar to our ex- then split each sentence in the text into a set of words
amples. Traitor was created for submission to the Norvig and normalize these words by converting them to lower-
1
Web Data Science Award , a research challenge organized case ASCII characters (e.g., á and Á become a). This re-
by Common Crawl and SURFsara. We won the award. duces the number of generated word pairs and compensates
for small notation differences between words. Moreover,
2. MINING CONCEPT ASSOCIATIONS we discard words from non-English sentences2 , words con-
We use Common Crawl’s 25TB data set of web pages taining non-alphabetic characters, and stop-words such as
to mine associated concepts. Other corpora (e.g., search “the” and “that”. For each normalized and filtered sentence
logs[4], USENET[6] and the British National Corpus[2]) S, the mapper creates a word pair p for each3 pair of words
(w1 , w2 ) where w1 , w2 ∈ S, w1 < w2 (lexicographically).
1
http://norvigaward.github.com Finally, for every web page, the mapper emits tuples (p, 1)
for each distinct word pair p on that page.
In the reduction phase, we sum the counts of the word
pairs produced by the mapper. Because the distribution of
Permission to make digital or hard copies of all or part of this work for words follows a Zipf distribution, we find that the majority
personal or classroom use is granted without fee provided that copies are of the resulting pairs have a very low count. To reduce
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to 2
republish, to post on servers or to redistribute to lists, requires prior specific Our heuristic is rather crude: we check if a sentence con-
permission and/or a fee. tains at least two English ‘stop-words’.
3
DIR 2013, April 26, 2013, Delft, The Netherlands. We limit the number of pairs produced for excessively long
. sentences.
storage costs of the final output, we can discard pairs with apple samsung java
a count less than some N ; e.g., if N = 2, we discard all iphone phone code
pairs that only occur once. Essentially, this allows us to ipod galaxy application(s)
ipad mobile games
reduce the output to any size that is practical for use in store battery software
the presentation layer. Unfortunately, pairs containing rare mac tv programming
words are cut indiscriminately due to this policy, even if apple samsung lance armstrong political party
these co-occurring words are still usable associations. phone tour information
tv cancer parties
3. QUERYING CONCEPT ASSOCIATIONS mobile france policy
battery foundation third
The resulting tuples from the map-reduce step are im- iphone team government
ported into a database which can be queried through a web java -php java coffee -php wii -xbox
application with two query interfaces.
applet(s) cup balance
alden bean kart
3.1 Textual interface jvm beans nunchuk
Users of the textual interface can search for similarities coffee tea resort
and differences between multiple concepts using a query marketplace espresso motionplus
language similar to set notation or boolean algebra. For
each search term in the query, associated words and co- Table 1: Query results produced by Traitor.
occurrence counts are fetched from the database, and a
score is assigned to each associated word based on the struc-
ture of the query expression. data to about 10 gigabytes of uncompressed word associa-
A sequence of words separated by whitespace denotes a tions by aggressive filtering of the input, and dropping all
conjunction and yields the words that are associated with pairs with a count less than 100. By means of a query lan-
all words in the sequence. A sequence of words separated guage and a simple scoring algorithm, we can express and
by plus-symbols, denotes a disjunction and yields the words answer queries about the concepts and associations stored
that are associated with any word in the sequence. A word in this database. A visualization interface allows for com-
preceded by a minus-symbol denotes the complement. To parison of concepts by the similarity of their associations.
illustrate: the query (a + b) -c yields all words that are Despite the simplistic methods used, Traitor can pro-
associated with a or b, and not with c. vide reasonable results thanks to the large data corpus. As
discussed in Section 2, we expect that a probabilistic scor-
3.2 Graphical interface ing model can further improve the quality of our results.
A graphical interface allows users to visualize similar- Moreover, Traitor only supports ‘concepts’ described by
ity relationships of concepts in a force directed graph using single words; one could extract n-grams from sentences to
D3.js. Users can enter a list of words which become labeled identify concepts described by multiple words. Future work
nodes in the graph. The graphical size of the nodes indi- can include further normalization of words; e.g., equating
cates the number of associations a concept has; the more plural and singular words, or applying word stemming.
associations a concept has, the bigger the node. For each
word in the query the system retrieves the 50 strongest re- 6. REFERENCES
lated concepts, which can be interpreted as an estimate of [1] K. W. Church and P. Hanks. Word association norms,
the concept’s context. For each word pair we apply the mutual information, and lexicography. Computational
RBO metric[9] to estimate the similarity. A link is cre- linguistics, 16(1), 1990.
ated between two nodes if the similarity is more than 5%.
[2] Stefan Evert. The statistics of word cooccurrences.
The length and thickness of the link indicate the similar-
PhD thesis, Dissertation, Stuttgart University, 2005.
ity. If two nodes are connected by a short thick line the
corresponding concepts share a very similar top 50 rank- [3] Robert M. Fano. Transmission of Information: A
ing. Conversely, distant nodes and nodes without any link Statistical Theory of Communication. The MIT Press,
between them have very few (or no) concepts in common. March 1961.
[4] Y. Hu, Y. Qian, H. Li, D. Jiang, J. Pei, and Q. Zheng.
Mining query subtopics from search log data. In Proc.
4. RESULTS of the 35th int. ACM SIGIR conf. on IR, 2012.
Using our map-reduce program, we populated an associ- [5] C. Lioma, B. Larsen, and W. Lu. Rhetorical relations
ation database with over 43 million distinct word pairs. We for information retrieval. In Proc. of the 35th int. ACM
attempted to assess the quality of the Traitor’s query re- SIGIR conf. on IR, 2012.
sults by answering questions from this paper’s introduction [6] Kevin Lund and Curt Burgess. Producing
(and others). For the sake of brevity, table 1 shows a few high-dimensional semantic spaces from lexical
query results produced by Traitor. Additionally, to illus- co-occurrence. Behavior Research Methods,
trate the disjunctive operator, we could ‘deduce the context’ Instruments, & Computers, 28(2), June 1996.
of a sentence; the query “food + games + and + shelter
[7] Y. Matsuo and M. Ishizuka. Keyword extraction from
+ are + just + as + important + as + health” produces
a single document using word co-occurrence statistical
the union of each word’s associations: care, insurance, in-
information. International Journal on Artificial
formation, play, good. The reader is encouraged to try
Intelligence Tools, 13(01), 2004.
Traitor on-line for more example queries (see the About-
[8] T. Sugimachi, A. Ishino, M. Takeda, and F. Matsuo. A
page) and for a live demonstration of the visualization in-
method of extracting related words using standardized
terface.
mutual information. In Discovery Science, 2003.
[9] William Webber, Alistair Moffat, and Justin Zobel. A
5. CONCLUSIONS AND FUTURE WORK similarity measure for indefinite rankings. ACM Trans.
In about 8 hours, the SURFsara Hadoop cluster of circa Inf. Syst., 28(4), November 2010.
66 nodes reduced 25 terabytes of Common Crawl corpus