<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dewey Decimal Classi cation Based Concept Visualization for Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jae-wook Ahn</string-name>
          <email>jaewook.ahn@drexel.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xia Lin</string-name>
          <email>linx@drexel.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Khoo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Computing and Informatics Drexel University</institution>
          ,
          <addr-line>Philadelphia, PA</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Visual knowledge maps utilizing concepts have great potential to support interactive information retrieval. Unlike keyword-based visual information retrieval, concept-based knowledge maps can make the visualization easier to comprehend and manipulate. In this paper, we introduce our novel visual search interface based on Dewey Decimal Classi cation concept annotations. The web browser based interface visualizes search results initialized from user queries. Main functions of the interface include interactive manipulation, exploration, and ltering of concepts and links in di erent levels from overview to details. The visualization connects related concepts not apparent in conventional treelike hierarchical representations and it can promote discovery of novel concepts during the visual exploration of search space. A real use-case scenario is presented to highlight the advantages of the approach.</p>
      </abstract>
      <kwd-group>
        <kwd>Concept Visualization</kwd>
        <kwd>Knowledge Map</kwd>
        <kwd>Dewey Decimal Classi cation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Dewey Decimal Classi cation (DDC) is a popular document classi cation
system. It has been extensively employed by many traditional libraries to provide
users with e ective way to browse and search for library resources. It supports
a multi-level hierarchy of concept classes that can be used to express the
associations and hierarchical relationships of the entire set of resources within any
given library. The resource classi ed with DDC is given a number composed of
three or more digits (class, division, and section) that describe the nature of the
resource from broader to narrower categories. It is perceived as an e cient way
to organize not only in the traditional library settings but also in the modern
networked settings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        On the other hand, knowledge structure visualization has been actively
studied for creating knowledge maps to support interactive search of internet
resources. Various approaches have been studied in the literature to present the
actual knowledge structure as precise as possible [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: (1) visualize existing
knowledge structures such as tree-like hierarchies (e.g., TreeMap [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) or ontologies (e.g.,
User
query
Solr index
      </p>
      <p>Retrieve top</p>
      <p>N docs</p>
      <p>Extract DDC
from top N
docs</p>
      <p>Calculate
similarity
between DDC
(doc
cooccurrence)</p>
      <p>Create a graph
(DDC as nodes)</p>
      <p>
        Visualize
OntoViz [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), (2) visualize knowledge structures that need to be extracted and
learned using various text mining techniques such as automatic thesaurus
construction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or clustering multi-word terms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and (3) visualize knowledge
structures through visual metaphors (e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>This paper introduces our novel method to exploit DDC for visualizing
knowledge structure of search results. It makes use of existing knowledge structure
dynamically derived from live search against three digital library resources. We
de ne four design goals for the knowledge structure visualization: (1)
visualize the overview of topics in the search results, (2) visualize concept groups or
clusters within the visualization, (3) use DDC to represent concepts, and (4)
support discovery of new knowledge using the DDC and visualization. In the
following, we will discuss how these goals are achieved from a real use-case of
the implemented visualization system.
2</p>
      <p>
        Visualizing Concepts using DDC
The visualization task of this study is based on the Dewey Decimal Classi
cation1 classes assigned to a large number of digital library records. 263,550
records were collected from three digital libraries (Internet Public Library,
Intute, and NSDL)2 and weighted keywords form their title, description, and
subject metadata were used. The resources were matched to multiple DDC classes
by calculating the similarity between the resource keyword vectors and the DDC
description keyword vectors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore the 263,550 digital library resources
were assigned with multiple DDC classes that represent the relevant concepts of
the content in three levels { class, division, and section. For example, a web site
stored in one of the participating digital libraries is titled as \Olympic History"
and presents information about the history of Olympic games. From the
automatic DDC class assignment system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], it was given 10 DDC classes: 796, 943,
945, 942, 949, 941, 940, 948, 944, and 947. The rst digits (classes) 7 and 9
represent \Arts &amp; recreation" and \History &amp; geography" respectively. It is clear that
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://dewey.info 2 http://www.ipl.org, http://intute.ac.uk, http://nsdl.org</title>
      <p>the resource is understood by the system that it is about sports (i.e., recreation)
and history. The remaining two digits (divisions and sections) speci es the
detailed topics. For example, the DDC class 796 is labeled as \Athletic &amp; outdoor
sports &amp; games," which combines three concepts hierarchically: 7 (class: Arts &amp;
recreation), 9 (division: History &amp; geograph), and 6 (section: Technology).
2.1</p>
      <p>Visualization Process
By using the DDC classes assigned to the 263,550 records, we implemented
a search system that visualizes knowledge structure and concept relationships
within search results. Figure 1 depicts the visualization process. We indexed all
the documents using Apache Solr information retrieval system3 so that users
can instantly retrieve documents that match their queries. From the retrieved
documents, top N documents are selected to calculate the DDC relationships
related to them. Each document was assigned with 10 DDC classes following
the procedure describe above. Therefore maximum N * 10 (less than N * 10 due
to duplicates) DDC classes are retrieved from the database. Then the similarity
values between all the DDC class pairs are calculated. Jaccard coe cient is used
to calculate the similarity between the DDC class pairs by counting the number
of documents that the DDC classes are assigned to.</p>
      <p>SimJaccard(ClassA; ClassB) = jClassADoc \ ClassBDocj
jClassADoc [ ClassBDocj
(1)</p>
      <p>
        Finally a graph is constructed by connecting the DDC classes (nodes) that
have higher similarity than a threshold. Similar DDC classes are connected
within the graph (links). The resulting graph is visualized in the live search
interface using a JavaScript-based network visualization library called Sigma js4.
ForceAtlas2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] force-directed placement graph layout algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used to
calculate node locations and the shape of the entire graph.
      </p>
      <p>Figure 2 shows an overview visualization of 694 DDC classes extracted from
all documents in the collection. In Figure 3 a user enters a query \olympic AND
history" to search for documents that contain both keywords. Among the search
results 168 `unique' DDC classes associated with them are visualized. The nodes
(discs) represents the DDC 168 classes and the similar nodes (inter-similarity is
above a threshold) are linked by curves to each other. The node color re ects
the DDC class ( rst digit) and the legend is located on the left of the screen.
Because of the links, groups of similar nodes are clustered together and it can
be easily seen that same class nodes (same colors) are forming clusters. For
example, there is a large cluster at the top of the screen and is about \social
science (DDC=300)." All the search, DDC class retrieval, graph calculation, and
visualization functions are performed on the y when a query is entered.</p>
    </sec>
    <sec id="sec-3">
      <title>3 http://lucene.apache.org/solr/</title>
      <p>4 http://sigmajs.org</p>
      <p>Additional Visualization Features for Concept Exploration
There are visual features in addition to the basic concept visualization feature
to enable more e cient DDC-based search and exploration.
Interactive Dynamic Visualization. By using a mouse, users can pan the
graph across the screen and zoom in or out to see the details or the overview of
the graph. If one clicks on a node a list of documents associated to the concept
is displayed (Figure 5).</p>
      <p>Adaptive Concept Legend: DDC class-division level. Above the static
DDC class legend that show 10 DDC classes and associated colors there is
an adaptive DDC class-division legend that is automatically updated
following user's panning and zooming action. It counts the most frequent DDC
classdivision pairs (e.g., DDC=94* with color code purple-pink) of the nodes currently
displayed on the screen. As the distribution of the DDC class-division pairs can
change following the panning and zooming action it shows dynamic information
about the area that the user is currently examining (Figure 5).</p>
      <p>Filtering classes from legend. If a user clicks on a DDC class and division
from the legend, the system highlights the nodes that has the class and the
division in red color. It helps users to search for speci c group of concepts in a
larger concept map and enables targeted examination of concepts (Figure 4).
2.3</p>
      <p>DDC-based Visual Search Scenario { Search for \Olympic
History"
The design principles of this system is to promote search and browsing by
incorporating DDC concepts within a visual search environment. More speci cally
the following design goals are illustrated in an example search in this section.
1. Visually show the overview of concepts included in a search result.
2. Help users search for speci c concepts or concept groups easily within the
visualization.
3. Help users nd relevant documents by examining the DDC concepts or
concept groups mapped in the visualization.
4. Support users discover new information by following links between concept
groups. The concept groups are formed by linking homogeneous concepts but
heterogeneous groups are inter-linked as well.</p>
      <p>We will show how these goals are achieved in a real use-case information
retrieval scenario. Suppose a user was assigned a task to nd out as many DDC
classes and documents as possible to write a report about \Olympic history."
Using the search interface, she enters a query \olympic AND history." The query
is immediately transmitted to the backend Solr index and retrieves 103
documents from the entire collection (236,550 documents). From the 103 documents
168 DDC classes are identi ed by looking up the database that stores the
precalculated DDC classes for all documents in the dataset. The system then
calculates the Jaccard similarity values of 5,253 DDC class pairs (168 * 167 / 2) by
calculating the co-occurrence of the documents they are assigned to. The DDC
classes with higher similarity values above a speci c threshold are linked and
creates a graph (Figure 3).</p>
      <p>As described before the nodes are color-coded by their DDC classes and the
clusters are easily identi ed from the map (Goal 1 and 2). By clicking on the
adaptive legend (top-left of the screen) the selected DDC class-division ( rst and
second digits) classes are highlighted in the visualization. In Figure 4, the
adaptive legend shows that the most frequent DDC class-division pairs are
970:History of North America, 940:History of Europe, and so forth. A user clicks on
DDC 940 then the nodes starting with 94 are all highlighted in red and the
other nodes are de-highlighted. A node directly connected to the highlighted
cluster is not de-highlighted and retains its original color (purple, DDC=936),
which is a potentially relevant DDC class worth further examination (Goal 3).</p>
      <p>The visualization can recommend novel concepts that are di cult to discover
using conventional search and browsing methods. Figure 5 shows the new
discovery of concepts that are relatively further in the DDC hierarchy. It zooms
into the DDC=790: Recreation and performing arts area in the map, which is
intuitive enough to anticipate that sports related information should be under
this category. It shows 10 DDC=790 classes (DDC=790, 791, 792, 793, etc. in red
color) that the user might have intended to examine in the rst place. In addition
to them there are several connected classes that are not under the 790 category.
The system can lead users to examine DDC=600 concepts (i.e., 613, 617, and
636 in green) which are under the broader technology classes (DDC=600) but
connected to the initial 790 classes. By examining the documents under one of
the 600 concepts (Figure 5) it can be veri ed that one of the sites \SR: Olympic
Sports" contains relevant information about Olympic games statistics and
history. Because DDC=613 is under the technology category (DDC=600) that may
be seemingly unrelated to the search task, it may be challenging for a user to be
motivated to examine the category. Only after starting the visual exploration of
the chains of DDC classes from easier ones (i.e., 700: Recreation), more di cult
and novel categories can be discovered (Goal 4).
In this paper we introduced a visual information retrieval approach based on
DDC classi cation. We annotated 263,550 records from three digital libraries
with automatically generated DDC classes and implemented a information
retrieval system featuring a graph visualization that connects similar DDC
concepts based on the document co-occurrence between them. We showed the
advantages of our approach by a real use-case scenario. The example demonstrated
that the approach could support interactive and dynamic knowledge
visualization and could promote the discovery of concept clusters and unknown concepts.
Our future research plans include a full- edged user study to learn about the
advantages and disadvantages of the approach from real users.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Eades</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A heuristic for graph drawing</article-title>
          .
          <source>Congressus Numerantium</source>
          <volume>42</volume>
          ,
          <issue>149</issue>
          {
          <fpage>160</fpage>
          (
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Grefenstette</surname>
          </string-name>
          , G.:
          <article-title>Explorations in automatic thesaurus discovery</article-title>
          . Springer (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jacomy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heymann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venturini</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bastian</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Forceatlas2, a continuous graph layout algorithm for handy network visualization</article-title>
          .
          <source>Medialab center of research 560</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shneiderman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Tree-maps: a space- lling approach to the visualization of hierarchical information structures</article-title>
          .
          <source>In: VIS '91: Proceedings of the 2nd conference on Visualization '91</source>
          . pp.
          <volume>284</volume>
          {
          <fpage>291</fpage>
          . IEEE Computer Society Press, Los Alamitos, CA, USA (
          <year>1991</year>
          ), http://portal.acm.org/citation.cfm?id=
          <fpage>949654</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Khoo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tudhope</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Binding</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abels</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards Digital Repository Interoperability: The Document Indexing and Semantic Tagging Interface for Libraries (DISTIL)</article-title>
          . In: Zaphiris,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Buchanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Loizides</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . (eds.)
          <source>Theory and Practice of Digital Libraries, Lecture Notes in Computer Science</source>
          , vol.
          <volume>7489</volume>
          , pp.
          <volume>439</volume>
          {
          <fpage>444</fpage>
          . Springer Berlin Heidelberg (
          <year>2012</year>
          ), http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -33290-6_
          <fpage>49</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kleiberg</surname>
          </string-name>
          , E., Van De Wetering, H.,
          <string-name>
            <surname>Van Wijk</surname>
            ,
            <given-names>J.J.:</given-names>
          </string-name>
          <article-title>Botanical visualization of huge hierarchies</article-title>
          .
          <source>In: Information Visualization</source>
          , IEEE Symposium on. pp.
          <volume>87</volume>
          {
          <fpage>87</fpage>
          . IEEE Computer Society (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>wook Ahn</surname>
          </string-name>
          , J.:
          <article-title>Challenges of knowledge structure visualization</article-title>
          .
          <source>In: Internatinal UDC Seminar</source>
          <year>2013</year>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Saeed</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhry</surname>
            ,
            <given-names>A.S.:</given-names>
          </string-name>
          <article-title>Using dewey decimal classi cation scheme (ddc) for building taxonomies for knowledge organisation</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>58</volume>
          (
          <issue>5</issue>
          ),
          <volume>575</volume>
          {
          <fpage>583</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Ibekwe-SanJuan</surname>
          </string-name>
          , F.:
          <article-title>Text mining without document context</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>42</volume>
          (
          <issue>6</issue>
          ),
          <volume>1532</volume>
          {
          <fpage>1552</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Sintek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Ontoviz, http://protegewiki.stanford.edu/wiki/OntoViz
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>