Maxwell Fowler et al. MAICS 2016 pp. 55–62 Exploring Web-based Visual Interfaces for Searching Research Articles on Digital Library Systems Maxwell Fowler, Chris Bellis, Christopher Perry, Beomjin Kim Department of Computer Science Indiana University-Purdue University Fort Wayne, IN, U.S.A. maxfwlr@gmail.com, kimb@ipfw.edu Abstract authors, the publication the work appeared in if applicable, Previous studies that present information archived in digital librar- and other basic meta-data at first. No profile of underlying ies have used either document meta-data or document content. The document content is provided, which can make finding the current search mechanisms commonly return text-based results best sources a tedious task which requires reading through that were compiled from the meta-data without reflecting the un- the plaintext directly. Further, some search systems do not derlying content. Visual analytics is a possible solution for improv- ing searches by presenting a large amount of information, includ- adequately search document content, instead relying upon ing document content alongside meta-data, in a limited screen users to already know the document they wish to retrieve. space. This paper introduces a multi-tiered visual interface for searching research articles stored in Digital Library systems. The The lack of search depth caused by not searching docu- goals of this system are to allow users to find research papers about ment content is exacerbated by the use of non-intuitive, text their interests in a large work space, to see how document content based results. This is not an effective form of data represen- relates to a search terms, and to refine their search queries using tation. Displaying a large amount of text in a column does document content. The current, under development pilot system successfully presents graphical illustrations of search results pro- not provide an efficient way to traverse search results and duced from both meta-data and underlying content in an intuitive pinpoint desired content. At best, text based searches can visual interface that will assist user’s search activities. With minor prioritize results on the title that best matches the desired modification, the proposed system can be applied to a variety of search terms or upon a hidden document relevance score, other text-based data repositories. which does not help a user see why a given paper is the best choice. Further, many text based search systems on digital Keywords - Digital libraries; Visualization; Unstructured text con- libraries lack an intuitive way to determine the relationships tent; Visual analytics between titles, the content in documents, and the relation- ships between different documents. Introduction Visualizations allow data to be presented in manners that Academic paper writing leverages online corpora as one of are more interconnected and readily processable. This is ac- the sources for references to prior work and to build upon complished by leveraging users’ perceptual cognition. Stud- previous results. Most corpora are hosted on services aimed ies have already shown such leveraging leads to faster data at easing the search process; digital libraries such as the consumption and a higher quality of understanding (Card ACM Digital Library and the Library of Congress provide 1999, Veerasamy 1997). Such visualization work has al- books, articles, and other forms of media, while services ready been applied to some forms of digital libraries in the such as Google Scholar focus on journal papers. While val- past. University of Maryland’s GRIDL, for example, pre- uable as knowledge repositories, these services lack in their sents digital libraries using two hierarchical axes with topics ability to present information in a way that helps lead to eas- on one axis and publication years on another (Shneiderman ier, more informed decisions when determining which aca- 2000). The density of documents for that topic and publica- demic papers to read and reference. tion year are then displayed as bar graphs, split between the different kinds of digital media in the library. Visualizations, Current digital library systems suffer from limiting stand- such as GRIDL, allow large quantities of data to be dis- ards and provide only superficial information in their search played in a coherent format that is tailored for user ease and results. Most archiving systems display the title of a work, document content exploration. 55 Maxwell Fowler et al. MAICS 2016 pp. 55–62 Visualization has been used in the past in order to sim- While the work already done is valuable, we see a place plify searching document repositories. Most of these visual for future development. The current visualization work can approaches have used some form of graphical representation be applied to other data domains, such as social networking to better show links between papers within a document and data and unstructured text content. Unstructured content the overall document spread in a repository. presents a number of issues. Such documents can have dif- ferent layouts from one another. Even within a specific do- Some visualization work used a graphing approach with main, such as research articles, the structure can be different. axes. ActiveGraph, developed by Marks et al, used scatter While most articles contain similar sections, such as an in- plots with customizable axes (Marks 2005). These axes, the troduction and a method section, there is no guarantee arti- X, Y, and Z axes, could be set to any of the kinds of meta- cles use the same layout. Sacks-Davis and Ron et al. dis- data discussed earlier. ActiveGraph took a repository wide cussed the subject of structuring text content to be indexable approach; it did not get into underlying document content, and queryable, but did not consider visual approaches or but did allow an at-a-glance look at the entire repository building indexes for journal papers dynamically (Sacks-Da- based on specific meta-data. vis 1997). Development in this field, utilizing visual analytic techniques, will assist researchers in finding references for Others used different graphical representations. Rushall their work. et al. and Lin focused on self-organizing maps that could be directed at a document repository or single book to display To assist researchers, a visual search focused on research the types of documents in a workspace or the topics con- articles in Digital Library systems would be useful. This sys- tained in a repository (Rushall 1996, Lin 1996). These maps tem requires indices to exist for the content in the papers. were useful for quickly searching for documents in a visual These indices need to be searched in a way that will help fashion. The search showed the contents of a workspace in users make educated decisions on their paper selections. Pa- a visual form, allowing the user to quickly parse out the pers do not tend to have indices, which mandates that an in- kinds of documents provided. This system still lacked a link dex is built for papers in some fashion in order to be reason- between the superficial meta-data and document content, ably searched. This is work previously undone, as prior sys- though. While preferable to a text search, the work was still tems that used indices used pre-built ones, such as Greg et plagued by the limiting factor of judging a book by its cover al.’s work, and is a topic we need to address. - using the title, but not the actual content within the docu- ment. In addition, a good search term must support the ability for users’ queries to undergo search refinement. A search To address the limitations of only leveraging meta-data should not only find documents related to given topics, but and not document content, Short et al. developed a multi- should allow the user to refine their search using different tiered visual interface for digital libraries (Short 2014). This terms they discovered during the search. Short et al. ap- work used the indices of textbooks to index books based proached this subject by showing chapter content from upon their overall content and content by chapter. The mul- books. This is a limitation as current search refinement fo- tiple tiers focused on different representations. The first tier cuses specifically on content that is searched for. Related compared books to each other based on desired search content and words that may be synonymous to desired con- terms. This tier was similar to work such as ActiveGraph, tent are currently unexplored angles for search refinement. leveraging a similar overall interface with a more regi- mented coordinate system rather than a scatterplot. This al- We propose a system which will bring the visual aspect lowed for document screening based on meta-data like be- and automatic indexing aspect together into one, targeted at fore. Clearly unrelated titles, works that were too old to be assisting researchers in searching text corpora and refining useful, or works with bad reviews could be safely ignored. their searches through intermediate results. This paper will introduce an ongoing development of a system; a two-tiered Other tiers took a content based approach to visualization. visualization web application that displays research articles By leveraging the index of textbooks, Short et al. were able with titles and associated content in a graphical format. The to directly allow exploration of document content. The vis- first tier will provide a high level profile of the kinds of doc- ualizations showed the layout of the book’s index and the uments in a repository and how related these documents are presence of search terms on a by-chapter basis using the to desired search terms. These relationships will also show book’s index. Searching for topics showed not only books a relationship between papers, by proxy. The second tier on the subject, but also exposed the relevant content within. has been designed with the idea of search refinement in This allowed the search to be used to more easily select the mind. It displays the frequency of search terms in the paper, best sources, based on how much they covered the desired as well as synonyms, terms related to the search terms, and search topic. compound terms created by coexisting words. This paper 56 Maxwell Fowler et al. MAICS 2016 pp. 55–62 presents the current prototype of the system developed to the blue term. Each search term is shown in the visualization assist users’ searches on research articles in Digital Library in a circle of the term’s color. From this point forward, all systems. shapes in the visualization are referred to as nodes. Methodology The documents appear in the visualization as nodes as well. Only documents that include at least one of the afore- The prototype system consists of three major components: mentioned search terms are placed on the visualization. index generation, query processing, and visualization. The Node size is determined absolutely, with the largest docu- index generation module analyzes the underlying content of ment in a repository having the largest node and the smallest a Digital Library’s research papers and constructs an index document the smallest node. Node size is capped at 30 pix- for each of them. Query processing is an underlying process els, with any documents that would have a larger size being that connects the indices with the visualizations. The visual- set to 30. The number of documents displayed by the search ization itself is implemented in two tiers. Tier 1 presents an is a user defined number, using a slider to change for more overview of the document base and the high level relation- tightly focused or broader reaching searches. ships among the documents and the query’s search terms to guide user’ selection of documents. Tier 2 provides a con- The nodes are positioned to show correlation between tent analysis of a specific document from the Tier 1, show- each document and the search terms in the query. A force ing terms related to the search query for the sake of search directed graph is used for the layout, specifically d3’s im- refinement. The methodology section is designed around plementation of Dwyer’s algorithm (Bostock 2011, Dwyer looking for information about thread-based programming 2009). Each document has tension directed towards the and architectures. We used the terms “thread,” “process,” search query nodes. The tension force is directly linked to and “cpu” as our search terms. the relation between a document and a term. A search term that a document has no relation to will provide 0 tension. Documents that feature all three terms will tend to be pushed Index Generation into the middle of the visualization, while documents that only feature two terms will appear between only those two Our indexing system was developed using well known Lu- terms and not appear in the middle. We also use a simple cene libraries and is not a major focus of our research collision algorithm to prevent node overlap. The document (Apache 2015b). The documents are first extracted into is placed in the triangle defined by the search term nodes. plain text in order to ensure a consistent format. Using Lu- Documents with all three search terms equally weighted cene, common words and other characters deemed to be gar- within it will be placed in the middle, equidistantly. Papers bage are removed from the text. This is to prevent such more related to a specific term will be placed closer to them, words impeding the index searching process. The text is as they have a higher tension toward that search term than then stored into a data structure which maintains a word the others. count, as well as information on which sentences in each document contain which words. Together, these structures Document relevancy is determined using Lucene’s Term serve as a searchable index for the document base. Frequency-Inverse Document Frequency (TF-IDF) algo- rithm. TF-IDF is frequently used in data and text mining ap- Tier 1 Visualization plications. The score for a term increases if a term appears often in a document or if that particular word is uncommon The Tier 1 visualization provides a profile of the entire doc- (Apache 2015a). Overall, documents are favored for having ument base. The intent of Tier 1 is to show the best papers a large number of desired words. The score for words is used for a user’s search query in the digital library being used. for both document relevancy to a single search term, for the Tier 1’s search is based upon title info and the indices of node positioning, and overall document relevancy. each document. The aspects considered for each document are the length of the document, the relevancy the documents Overall document relevancy combines the relevancy have to specific terms in a search query, and the relevancy scores for all three search terms to give each document node the documents have to the entire search as a whole. a color. The most relevant paper, determined by having the highest overall score for all search terms, will be black. Less The user’s search query is directly represented in the vis- relevant papers appear white, with papers in the middle fall- ualization. Each search query is three terms, with each term ing somewhere on the grayscale in between. Black contrasts going into a different colored box. The colored boxes are well with lighter colored nodes around it, making it a good red, green, and blue, which are the primary additive colors color to indicate the best papers. Our basis for this decision used in computer science. In our examples below, “thread,” came from color theory and digital graphics design (Foley is the green term, “process,” is the red term, and “cpu,” is 1996). 57 Maxwell Fowler et al. MAICS 2016 pp. 55–62 When a node is selected on the visualization, the node is The formula scores the probability of a potential term be- highlighted and the paper’s supplementary data is shown in ing related to a base term by comparing how many hits the a tooltip. Figure 2.1 shows a sample of the Tier 1 visualiza- potential and the base have together over the number of hits tion with the best paper selected. Note that the best paper is only the base does in the document set. For each of the two not necessarily the largest, as in our sample query one of the terms that use the score, we will go into specific detail. smaller papers has the best overall search results. When a paper is selected, the title of the paper is shown, which acts Related terms are defined as non-synonyms that appear in as a link to the PDF. The author and conference are provided the same sentence as a search query’s term. These terms can as well. Finally, a link to the second tier visualization is pro- help refine a user’s search query by showing them words vided to link between the two tiers. that commonly appear together. Figure 2.1 Sample view of Tier 1 Tier 2 Visualization This can then be used in a new search to refine the docu- ments returned in a specific direction. In order to determine The Tier 2 visualization provides a closer look into specific related terms, all the documents in the database are first documents. It relates the search terms to the content in the stripped down to just contain the sentences containing a spe- document itself. This way, the user can see precisely how cific term. prevalent a given term is in a document. This serves to in- crease user confidence in the document they have selected Each of the remaining words is scored as the potential, as being a useful document. In addition to directly showing with the search term as the base, using PMI-IR 3. The higher term prevalence, the system provides coexisting words, re- the score, the more relevant a specific related term is deemed lated terms, and the synonyms for the search terms. The in- to be. The current system allows all related terms with a tent is to help users’ with the task of search refinement by score higher than 0 to appear in the visualization. selecting better words for their queries. Synonyms and com- pound words are a new consideration in this research. While Compound terms are similar to related terms, but are spe- prior studies did not consider them useful, both serve to al- cifically terms made up of two words; a query term and ei- low the user to phrase the same query in multiple ways to ther the term directly before or directly after the query term find the best results possible for their search. in a sentence. These terms are intended to expand a specific search term. For example, a search can be refined to use Both related terms and compound terms using a scoring “cloud computing,” rather than “cloud,” after finding the algorithm called Pointwise Mutual Information-Information former as a compound term of the latter. Given the query Retrieval (PMI-IR). PMI-IR was developed by Peter D. Tur- “machine,” one might get both, “machine learning,” and, ney for developing automatic indices of non-structured con- “autonomous machine.” The same PMI-IR 3 scoring is used tent (Turney 2001). Our algorithm specifically implements on compound terms as is used on related terms. PMI-IR 3, with some modifications: Synonyms are generated by searching through a synonym 𝐻𝑖𝑡𝑠(𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝐴𝑁𝐷 𝑏𝑎𝑠𝑒) database. Our algorithm uses WordNet for collecting syno- 𝑃(𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙|𝑏𝑎𝑠𝑒) = nyms (Fellbaum, C). WordNet returns synsets of potential 𝐻𝑖𝑡𝑠(𝑏𝑎𝑠𝑒) matches. As synonyms tend to be small in number, there is 58 Maxwell Fowler et al. MAICS 2016 pp. 55–62 no threshold number in place for limiting the number of syn- onyms displayed. Discussion The visualization is consistent between Tier 1 and Tier 2. The proposed system has made ground towards reaching the The same force graph rules still apply. However, data in this goals set for it. The visualization successfully functions on tier is only related to one term node. This means that docu- unstructured text content, such as the journal papers used in ment content nodes that tend toward the middle are weakly this study, which has yet to be done in this way. The Tier 1 related to their term. Content with a high relatedness to a visualization does provide a visually accessible look at the search term, though, appears close to the term’s node. Like- entire document repository. It manages to capture the legi- wise, node size remains consistent in that it shows size, but bility of previous systems while improving upon the visual- the size is the count of specific terms, rather than document ization’s ability to aid in selecting documents. Further, the size. Tier 2 visualization does make strides towards helping users refine their search in meaningful ways. The size of each object, including the search terms them- selves, are how relevant they are to the overall paper. This The Tier 1 visualization is quite strong at this point. It is is the word count from Lucene’s index. It is possible to have useful for finding papers that span across multiple related a paper where a search term has relevance 0, which would domains, as shown in the methodology section. The over- make the shape have 0 size. Likewise, it is possible to have view is scalable, allowing users to search for a large number related, compound, or synonym terms be larger than the of papers or select only a small subset refined to be the best search query nodes if they appear in the current document for a given search. Further, the white to black color scale for more than the query terms do. The largest nodes in Tier 2 least to most relevant allows the most relevant paper to stand are the terms that are most likely to help refine a search by out easily, making finding the best options in any sized replacing a search term. search an easy task. Each of the term types is represented with its own node Tier 1’s strength is obvious when we compare the visual shape. Synonyms are given circular nodes, to show they are search to a text based alternative. Figure 3.1 shows a search directly related in meaning to the search query terms. Re- for the terms “simulate”, “transform”, and “automata”. The lated terms and compound terms, meanwhile, are squares search provides the same information the visualization does, and triangles respectively. This decision was made to draw but the best paper’s relation to terms is shown as numeric distinction between term types. scores. This is less intuitive than the visualization’s black to white color scale and position algorithm. Figure 2.2 shows an example of a Tier 2 visualization, specifically from the last figure’s best document. The size of the three search term nodes shows that they are, in fact, all three prevalent in the paper. Thread is the most relevant, though, as shown by the size. We can see what the terms are by hovering over their nodes. The term will appear in a tooltip above the node. Figure 2.2 Sample view of Tier 2 59 Maxwell Fowler et al. MAICS 2016 pp. 55–62 The Tier 2 visualization succeeds in the goal of providing potential search refinement. It shows all the potentially use- ful related terms each of the search results have. The figure above directly shows the benefit of search refinement, as previously discussed. The choice of red, green, and blue for the search term nodes was retained for Tier 2 in order to allow RGB color combinations to show off terms related to multiple docu- ments. This was abandoned in practice, in part because meaningful terms related to two distinct, other terms were rare. This means the colorization here could be changed to Figure 3.1 Sample view of text based search designed for usability tests represent different information if a better way to show term relation is found. Further, some collision can occur in tier 2 Figure 3.2 shows two searches. The search on the left is term nodes, which needs addressed in future updates. This the same as the search in figure 3.1. It becomes readily ap- can be seen in Figure 3.3, especially around the term parent where the two best papers are and how they related “thread”. to the terms. It also becomes apparent that the world simu- late is fairly useless. Using the Tier 2 visualization, we re- fined the search to use, “grammar,” rather than, “simulate,” Figure 3.2 The visual form of Figure 3.1 and a refined version Figure 3.3 The best paper from 3.2 shows some Tier 2 overlap giving us the image on the right. This search refinement gives documents of much higher quality, according to the color scale and provides a better distribution in the visuali- zation’s center. Further, the refinement is easier to make in the visual system than a text based one, which would require reading the whole document to find useful terms. 60 Maxwell Fowler et al. MAICS 2016 pp. 55–62 FUTURE WORK Another room for improvement is the clustering algo- rithm, especially when it comes to the overlapping related The proposed system makes good strides at reaching the terms and the large clusters of low use papers. Some form goals set out in the introduction. Despite this, there are fu- of blobbing algorithm which combines closely related pa- ture work avenues to prove the system works, improve the pers into one node which can then be expanded into the full system, and potentially apply the system in other ways. node set should be considered to make the visualization sim- Some discussion of those angles follows below. pler and more user friendly in such instances. The paper uses searches on a paper database of our own Finally, future work could include applying our system to creation. It is built of papers freely accessible on Google’s other domains. So long as an index can be constructed for own research papers. There are approximately 1600 of them. the desired data, any form of text-based data could be In the future, we will apply our visualization to a paid-for searched and visualized using the above system. For exam- paper collection, such as the Text Retrieval Conference Pro- ple, social network posts could be used as documents to ceedings (TREC). This will give our system a more robust search. This would allow the system to search blogs discus- collection to be tested against. sions forums, and other forms of social media for the sake of determining user consensus or gathering data for market- Usability tests are needed to prove that the visualization ing purposes. above is better than a text based system. While we feel the visualization system stands on its own merit, usability tests CONCLUSION will add credence to that claim. Our next task will use a cus- tom made, text based search system and compare it to our This paper proposed a visual search on Digital Library sys- visualization. We will use a metric based approach to judge tems, specifically targeting journal papers and other re- effectiveness, as well as judge user preferences. This way, search publications. The proposed system targets two goals. the visualization’s greater effectiveness compared to tradi- The system's first goal is a visualization on an entire docu- tional text based interfaces can be proven. ment base, to help the user more easily see the best papers available for a given search. The second goal is to aid in The visualization is not without room for improvement. It search refinement, changing the original search to better suit would be ideal to take into account more than just the pres- the user’s needs. This unexplored element was addressed by ence of search terms in Tier 1. Ideally, the document ranking providing a visualization for various forms of related terms. algorithm can be altered to take into account all attributes of A preliminary indexing step allowed us to apply these visual a document. This means the document’s titles, reviews, elements to a collection of unstructured text data; while not word count, presence of search terms, and other data will all our primary research focus, this was still an interesting ele- contribute to a document’s relevance score. ment and is in contrast to prior visualizations of structured document content which did not require an indexing step. Currently, the search is related to three terms. This is left over from earlier work when we were considering using We developed a two tier system to meet our goals. Tier 1 RGB color combinations for paper quality, rather than the provides a visualization over the document base for a spe- current black and white scale. We retained the color usage cific set of three search terms. The papers are positioned and for Tier 2, but did not find such situations that would benefit colored based on their relevance to terms and the overall from color combination. We are considering allowing an ar- search respectively. Tier 2 provides a visualization of a doc- bitrary N-gon, which will free up RGB colors to be used for ument’s content. It shows how the search terms relate to the different visual elements. Such an N-gon’s size will be de- underlying content and show other related terms for the sake termined by the user, which a minimum size of two to allow of search refinement. the visualization to remain fully featured. By providing these two tiers, we help users with multiple The potential exists for a third tier: specific term expan- tasks. Tier 1 makes it easy to see if a given search is useful sion. This tier would allow the user to select a term and see or if a given search is skewed too far to one term. Tier 1 also more information about it, including all related terms to that makes it easy to find the best papers for a given search. Tier term in the repository, synonyms both in the repository and 2 allows us to confirm the best paper shown includes the outside of it, and definitional information. This has yet to be search terms with a high frequency. Tier 2 also lets us refine implemented as the usefulness is questionable. It may be our searches, allowing users to turn bad searches into good sufficient to augment the Tier 2 visualization with word def- searches by changing a search term or two. initions and leave it at that. By assisting users with these tasks, our system makes suf- ficient strides towards our goals. Our last step is fixing the 61 Maxwell Fowler et al. MAICS 2016 pp. 55–62 problems mentioned in the discussion section and investi- gating the improvements mentioned in the future work sec- tion. Once this is accomplished, our system will become practical and serve users in their searching of unstructured content in digital libraries. Shneiderman, B., Feldman, D., and Rose, A. (2000). Visu- alizing Digital Library Search Results with Categorical and Hierarchical Axes in Proc. 5th ACM International Confer- ence on Digital Libraries, pp. 57-66. References Short, G., and Kim, B. (2014). Multi-tiered Visual Interfaces Apache Software Foundation (2015). Class TFIDFSimilar- for Book Search with Digital Library Systems in Proceed- ity, Available: ings of the 6th International Conference on Multimedia, https://lucene.apache.org/core/5_2_1/core/org/apache/lu- Computer Graphics and Broadcasting, pp.21-24. cene/search/similarities/TFIDFSimilarity.html Turney, P. D. (2001). Mining the Web for Synonyms: PMI- Apache Software Foundation (2015). Lucene 5.2.1 core IR versus LSA on TOEFL in Proc. of the 12th European API, Available: Conference on Machine Learning (EMCL '01), pp. 491-502. https://lucene.apache.org/core/5_2_1/core/overview-sum- mary.html#overview_description Veerasamy, A., and Heikes, R. (1997). Effectiveness of a Bostock, M., Ogievetsky, V., Heer, J. (2011). D3: Data- graphical display of retrieval results in Proc. of the 20th Driven Documents in IEEE Trans. Visualization & Comp. Annu. Int. ACM SIGIR Conf. on Research and Development Graphics. of Information Retrieval, pp. 236-245. Card, S. K., Mackinlay, J. D., and Shneiderman, B. (1999). Information visualization in Readings in Information Visu- alization: Using Vision to Think, pp. 1-34. Dwyer, T. (2009). Scalable, Versatile and Simple Con- strained Graph Layout in IEEE-VGTC Symposium on Visu- alization. Fellbaum, C. (2005). What is WordNet. Princeton Univer- sity. Foley, J., van Dam, A., Feiner, S., Hughes, J. (1996). Com- puter Graphics: Principles and Practice, Addison-Wesley Publishing Company. Lin, X. (1996). Graphical table of contents in Proc. of the first ACM Int. Conf. on Digital Libraries, pp. 45-53. Marks, L., McMahon T., and Luce, R. (2005). ActiveGraph: a digital library visualization tool in International Journal on Digital Libraries, vol. 5, no. 1, pp. 57-69. Rushall, D., and Ilgen, M. (1996). A context vector-based self organizing map for information visualization in TIP- STER: Proc. of a Workshop on held at Vienna, Virginia, pp. 159-166. Sacks-Davis, R., Dao, T., Thom, J. A., Zobel J. (1997). In- dexing documents for queries on structure, content and at- tributes in Proc. of International Symposium on Digital Me- dia Information Base (DMIB). 62