Knowledge Graph for Discovery and Navigation Case of Interdisciplinary Ph.D. Program Stanislava Gardasevic [0000-0002-5758-6968] University of Hawai'i at Mānoa, Honolulu, HI, USA gardasev@hawaii.edu Abstract: This research is proposing the development of a methodology for eliciting and formalizing relationships that should be organized in a knowledge graph, in- tended for improved resource discovery and collaboration opportunities in a Ph.D. program. By taking a case of an interdisciplinary Ph.D. program, pro- posed steps will include participatory design method, text mining, and social network analysis, while reusing available models and vocabularies for the aca- demic domain. The proposed analysis will be based on intellectual outputs, re- search profiles, information on activities and other relevant data that is pro- duced by the given community. The expected outcome would account for the emphasis of actors’ roles in a community, which should result in enhanced op- portunities for quality cooperation. Keywords: Knowledge Graph, Scholarly Data, Social Network Analysis, Topic Modeling, Ph.D. Research, Interdisciplinarity, Knowledge Discovery. 1 Introduction and Relevancy Statement One of our professors would often ask us: What is the best dissertation? And by now, we all know and answer in unison- A done dissertation! The focus of this paper is based on our shared experience; something that Ph.D. students applying for this doctoral consortium are facing now, and something that you evaluat- ing our applications have already gone through (not to say survived). It is about creating a service that would facilitate information discovery and decision making for the Ph.D. students. Considering the growth in the numbers of Ph.D. students around the world [1], this topic is very relevant to the considerably sized population. Not only that, but the proposed tool could provide similar opportunities to students pursuing other uni- versity degrees, but also to other actors affiliated with a given program (professors, researchers, alumni, librarians, administration etc.). Therefore, the background theme is: Navigating academic space, while improving possibilities for quality cooperation, as well as information/knowledge discovery. The issue at hand becomes ever more complex in an interdisciplinary Ph.D. program, consisted of over 100 alumni, 40 affiliated faculty members, 30 students, 4 schools, and 1 university. This is the case of the program that I am attending as a 2nd-year student. It is called Interdisciplinary Ph.D. in Communication and Information Sciences (CIS), at University of Hawai'i at Mānoa. This program is taken as a case for examining, apply- ing different methodologies and developing the “pathfinder” tool. The intention here is to explore the problem of classification, and to create knowledge organization system (KOS) by focusing on utilizing different and interesting relations that might be relevant to members of the community. CIS program is taken here as an extreme case because it comprises of 4 disciplines/ 3 schools- School of Communication (COM), Information and Computer Sciences (ICS), Information and Technology Management (School of Business) (ITM), and Library and Information Science (LIS). Not only is this interdis- ciplinary combination an interesting phenomenon for examining potential intersections in topical and people relations, but the results might be very relevant for the science in general, considering that the research is becoming more and more interdisciplinary [2]. Although my background is in LIS, each discipline of CIS program contributes in its own way to the main goal of my work- facilitating information discovery and improv- ing its relevance. Being exposed to different ideas and paradigms is considered to be a great creativity amplifier, and that creative impulse is what I hope to be the driver of my research, as well as my contribution to the ISWC 2018 Doctoral Consortium. 2 Problem Statement Although most of the graduate students start with the web search as the first infor- mation-seeking activity, doctoral students often consult their faculty advisers, then li- brarians and peers [3]. People do play a significant role in all phases of Ph.D. research. But how can one find an appropriate person, one you can talk to and hopefully even work with? Collaborators one chooses for their thesis, especially committee chair and members, can be accounted for eventual problems or successes of the thesis research process, something that might influence entire career. Therefore, the problem this research is going to address is facilitating the discovery of relevant resources that are considered as necessary for the success of a Ph.D. student- e.g. finding an appropriate supervisor, thesis committee members, collaborators, courses, projects, information on conferences, seminars, etc. Relevant information can come from many sources. This research is aiming to develop technology based on a knowledge graph, envisioned to help with connecting people; pointing us to those around, who can potentially provide us with valuable pieces of information, thus help us with the decision-making process. The research will attempt to address the problem of establishing a methodology for a knowledge graph creation, based on combining methods already implemented in other solutions, and applying it to the case of CIS program and its pertinent domains. Still, the intention is for the methodology to be re- usable in any given academic program. The research will address the issues of i) choos- ing methods that could be applied to extract interesting relations from data produced by a community, ii) order in which they should be applied, and iii) discovering interesting relations that should be included in the graph/KOS by means of data science. Through addressing these issues and checking them with the participants from the community, the results will be used in developing a particular application, and then tested. 2 3 Related Work Academia and scholarly communication is an interesting area for developing recom- mendation-based systems. Due to its rather complex, yet relatively structured and well documented body of knowledge, it is offering a great testbed for developments in dif- ferent domains such as: information retrieval (IR), LIS (sciento/bibliometrics, KOS & classification, academic librarianship), social network analysis (SNA) and visualiza- tion, Semantic Web, including numerous ontologies developed for this purpose, and many other. This research is intended to re-use and mix relevant solutions, methodolo- gies and paradigms from these different domains, that are already validated. For example, one of the rather comprehensive schemas in this area is VIVO Ontology for Research Discovery1. This model comes as a part of a Semantic Web OpenVIVO platform that is freely available for use and upload of data, whether by an institution or an individual researcher [4]. Not only is the VIVO ontology well elaborated, but the mentioned implementation allows for different explorations/navigations of data- e.g. author-topic connections through the Capability Map2, co-authorship network, etc. An- other application, Rexplore [5], develops the possibility of scientific data exploration even further. The proposed solution, is combining many functionalities in facilitating expert search on a fine-grained level by treating research areas as semantic concepts, rather than syntactic (keywords- usually utilized in IR systems). Furthermore, the sys- tem offers an interesting exploration through the graph view, that can be interactively navigated based on different relations between authors, but also ranked based on vari- ous metrics, and filtered with respect to years, topics, venues of publishing activity. These are examples of good practice in facilitating the scientific information discov- ery- VIVO with focus on open and reusable scientific data, and Repox aiming at the eventual business processes and usages. Still, both cover the research data on the global scale. Contrarily, the research presented in this paper is more locally focused. Being strongly grounded in a particular place (geolocation, implying an organization end even more precise, unit(s) within), entails having local norms and requirements related to the research topic and practices. These norms will be taken as paramount of the Topic class modeling effort, hopefully resulting in the increased relevance and usage in that com- munity. Furthermore, the graph view is intended to be used beyond the visualization (sensemaking) purpose, but also for interactive navigation of the knowledge base. Research on expert profiling and recommendation has been popular lately. One of such endeavors has elicited a methodology that might be potentially reused here. STEP methodology [6] is incorporating extraction of concepts based on domain ontologies, and their consolidation by annotating lexically different but semantically similar enti- ties, in order to create the automatic and time-depended expert profiles. That method- ology was further extended with statistical methods: Topic modeling and N-Gram mod- eling in an attempt to improve results. Still, in cases where no semantic reasoning is applied as a method, and only proba- bilistic methods- such as topic modeling [7] or author-topic modeling [8] were used, a network science-based methodology presented by Paranyushkin [9] could be utilized, 1 https://bioportal.bioontology.org/ontologies/VIVO 2 http://openvivo.org/vis/capabilitymap# 3 by which one might run particular document or a subset of the corpus assigned to a particular topic, in order to validate results and/or name topics more adequately. Finally, there has not been much application of deeper SNA methods in the KOS design, beyond visualizing collaboration networks [4, 5], and recommendation systems based on similarity of user profiles [10]. The proposed research tends to explore this frontier further. Research presented by Kadriu [11] shows exactly how the network science metrics (in this case centrality metrics- degree, closeness, betweenness, and PageRank) can bring valuable insights of the state of topical expertise in an institution. Including such information in KOS could be a valuable asset, since it could inform on the roles that certain people might play in the community (e.g. high betweenness cen- trality would point out people who would be best to spread information, as they connect those in disparate parts of the network). Except for the centrality and the degree of separation (pointing out the connecting nodes), my plan is to apply (overlapping) com- munity detection algorithms, assortativity, affiliation and other SNA algorithms for fur- ther analysis [12]. 4 Research Questions Tentative research questions behind this proposal are: RQ 1 What are the information needs of a Ph.D. community? • What information is deemed as relevant for successfully fulfilling a pro- gram requirements? • What type of social support aspects are lacking in current tools? • How can people use novel technology to navigate the academic information space? RQ2 How do you organize the domain information in a coherent way, by means of creating and navigating knowledge graph? RQ3 What are the more appropriate methods for knowledge discovery- the cre- ated knowledge graph or the existing ones? • In which extent is new KOS improving the information discovery experience/ fulfillment of information needs for this community? 5 Hypotheses Considering that the proposed research is harnessing methodologies from different dis- ciplines, including social sciences (participatory design) and IS (design science), it is not possible to answer to all of the proposed questions by means of quantitative research methods. Still, several hypotheses could be posted in order to answer to the RQ3. v The created knowledge graph is a more appropriate (faster, relevant) method for information discovery than the already existing means (e.g. CIS website). v The created knowledge graph has positive impact on the lives of CIS students (e.g. it helps in finding more relevant courses, projects, mentors, etc.). 4 v Overall satisfaction with information discovery possibilities is higher when using the created knowledge graph, then the already available means. 6 Approach When designing an information tool, it is considered as a good practice to go the com- munity this tool should serve. Research has shown that through participatory design approach, research participants become responsible agents, deemed as partners rather than subjects of a research [13]. Not only that, but such agency can potentially make the underlying values more visible, and thus facilitate establishing a more comprehen- sive rationale for cooperation. For that reason, participatory design methodology is con- sidered as the most appropriate for i) answering to the RQ1, ii) informing the design of the knowledge graph, as well as iii) evaluating and improving it. The community that would be involved in this research are my CIS peers (group of about 30 students) and CIS committee (the core of 5 professors included in the decision-making processes). Several workshop sessions will be organized in order to elicit the valuable group expe- rience. Also, an online questioner will be conducted in in the same community, in order to capture the information that will be included in the social graph, i.e. important years in the program, classes/directed readings taken, committee members, topics of interests, estimated relationship with other students, research methodologies used, and other var- iables that might be interesting for the purpose of analysis, visualization and/or recom- mendation. Participatory design is in sync with yet another interesting theoretical approach to knowledge organization called- domain analysis. By looking at discourse communities [14], study of knowledge domains should be taking in consideration factors such as the structure of a knowledge organization, as well as its cooperation patterns, language and communication forms, information systems and other relevance criteria. This approach will inform the data collection and analysis, with the intention to inform the RQ2. Ex- cept for the intellectual outputs (publications, posters, research data, thesis, course ma- terial), it should include the community members’ activities (research, projects, teach- ing, supervising etc.), meta-information (research profiles) and other, often tacit and implied information that might be deemed relevant, still not equally available to all members of the community. The collected data will be analyzed by different means utilized in data science, including IR/text mining methods (LDA and author-topic mod- eling), SNA metrics and other methods that should allow for formalization of certain information that is pertinent, yet not apparent. Throughout this research, we will try to re-use the proven methods for the analysis and combine them with other methods that are not so frequently used for this purpose. This should result in a novel approach to the stated problem. 6.1. Modeling Much of the modeling efforts in this research will rely on already established sche- mas, e.g. People class will be in much informed by the VIVO ontology one, with the focus on people’s activities- such as publishing, co-authorship, mentorship, courses teaching/attending, projects, labs involvement, etc. 5 However, the modeling of the Topic class is going to be tackled in a slightly different way. It should include not only the topic of interest, but also notions such as application area, methodology used, as well as domains of expertise- both sought and obtained (possibly indicated by courses thought and/or taken). Finally, epistemological studies are considered as the crucial part in domain analysis approach [15], therefore different traditions that are due to epistemological schools should be part of the modelling effort. Such approach is intended to support the needs of a particular local community, since this level of granularity is usually not available in present discovery tools. Also, wher- ever possible, existing vocabularies will be reused (e.g. subset of the FAST thesaurus3, for the broader research domain), while keeping in consideration local trends. 6.2. Building the Graph The data (relations) deemed relevant should be stored and organized in a graph da- tabase system Neo4J4, set up for this purpose. This noSQL database is considered as appropriate for capturing relations between entities, serving recommendations, as well as allowing for more dynamic knowledge representation and data update. This database was successfully applied in the scholarship domain in a Research Graph5 project [16]. 6.3. Maintaining and Updating the Graph Methodology for the development of this tool will accommodate for the future maintenance and update of the graph, so the process is mostly automated. Considering that the topic modelling is dependent on the most recent publications, the actual imple- mentation of such tool would imply the stricter compliance with the institutional poli- cies- such as uploading the publications to the institutional repository (in this case ScholarSpace6), but also updating researcher profiles in the departmental website. While data for the paper co-authorship graphs can be automatically harvested and in- jected from DBLP7, the metadata on thesis and project would need manual curation (by a program Teaching Assistant or a designated librarian). Furthermore, the design of tool will aim to support interoperability with other systems, thus use of APIs for automatic ingest and update of data, as shown in the case of OpenVivo [4]. 6.4 Evaluation Plan In order to make sure the final product is indeed a useful tool, one or more means of evaluation will be utilized. As previously mentioned, participatory design approach will be used to indicate whether the attempt is going in the good direction, to advise the design, and possible functionalities of the tool. Also, the same group can be used in order to answer to the RQ3. However, possibly the more appropriate way to test the hypotheses, posted in section 5, would be by using the quasi-experimental method. 3 https://www.oclc.org/research/themes/data-science/fast.html 4 https://neo4j.com/ 5 http://researchgraph.org/ 6 https://scholarspace.manoa.hawaii.edu/ 7 http://dblp.uni-trier.de/ 6 Students, and preferably newly admitted students, would be asked to perform a set of tasks, both by using the new tool and the already existing one- the CIS website. Metrics that could be used to measure the eventual improvement in this case are: speed of discovery, relevance of results and overall satisfaction level with the new tool. 7 Pilot Project and Preliminary Results The pilot project was set up for the purpose of testing different methods of data analysis and establishing procedures and software solutions that will be used on the full dataset. The pilot project dataset comprises of 95 publications, out of which 20 are the theses produced by CIS Ph.D. candidates, 74 papers, and a single book. The chosen publica- tions are the most recent full text available and are including intellectual outputs by 30 professors (in average 3 papers per professor), 20 alumni, and 3 current students. Topic modeling approach- LDA was performed on the corpus, by using R language and tm and topicmodel libraries. The number of 45 topics was chosen as appropriate for this purpose, by using ldatuning library [17]. Interestingly, in 90% cases, theses can indeed be considered as interdisciplinary, because of their assignment to a disparate topic from the ones that were assigned to professors. Results have shown that most of the recent research in this community is related to civic activity in social networks. The result from this analysis will be further examined in order to inform the Topic class modeling effort (some of the attempts can be seen in obtained visualizations done by Gephi8 software, e.g. the those showing words’ assignment to topics9). Furthermore, the same dataset was used for the purpose of creating networks for the analysis- the co-authorship network and the thesis-mentorship network, latter made of 20 most recent CIS theses. The co-authorship network shows cliques of co-authors, usually based on a departmental/school setting; but also, cooperation between depart- ments and with students (see visualization10). Because of the small sample, not all of the relations are apparent, yet the method shows promising results for exploring com- munity/interdisciplinary overlaps. Also, the thesis-mentorship network shows who are the important actors when it comes to chairing or participating in dissertation commit- tee (see visualizations11). These visualizations make roles of individuals in the commu- nity more apparent than it is currently possible to notice on the CIS website. 8 Reflections The created knowledge graph could be utilized for the purpose of visual navigation and discovery of information, recommendations including, where different dimen- sion/granularity levels of the data can be explored in an interesting and intuitive way. It should allow the possibility to navigate through the graph in various directions. For example, starting from a particular topic of interest, one might get to a professor who is working on it, see her collaborators (co-authors and potential committee members), 8 https://gephi.org/ 9 https://stasha.net/vizualizations-topic-modeling-lda-results-2/ 10 https://stasha.net/visualizations-co-authorship-network/ 11 https://stasha.net/visualization-thesis-mentorship-graph/ 7 her students (one might ask for advice), activities (classes and projects), even publica- tions. All of this can be done in a seamless way and without visiting several different web pages, which is the case now. Also, the highest-level graph view would show the gestalt view of the community and activities in it. Furthermore, one of the main issues when developing novel technologies is related to attracting the critical mass to it. Most of the state of the art applications in this area, although having the impressive technology and complex modeling, are arguably scarcely used by researchers (with except for the Google Scholar and other corporate solutions). This opacity, beyond the circle of scientists that are interested in developing similar technology, is partly because the service is buried in the Open Web among many other similar tools. Although one might argue that the science knows no geographical boundaries and the wider coverage of the data is always better, the strategy of focusing on much smaller scale (in this case CIS program, or any given academic department) and offering such service on the website of the program or library where each student is bound to access it, would influence in much the actual usability of the technology, but also facilitate the discovery of relevant knowledge that might be directly obtained from the senior researcher in one’s vicinity. Finally, this research is having a sociotechnical focus, aiming to influence the changes and behavior that should be beneficial for academia by i) supporting the open access movement by enforcing the policies about submission of research papers and thesis into the institutional repository; ii) encouraging the university librarians to be better connected with faculty and their activities; iii) facilitating the valuable connec- tions within the community by promoting collaboration opportunities, and thus facili- tating education process and its quality; iv) promoting visibility of the people’s activity, therefore encouraging other to be more active in collaboration/mentoring. The last point is a rather interesting one, since visibility often calls for accountability [18], and this feature can have the twofold benefit. Firstly, in the case of the thesis-mentorship graph, professors can see their peers as prominent nodes, which might inspire them to take on more of mentoring themselves. Secondly, student mentoring activity can be a base for an alternative metrics that might help administrators evaluate the impact of a professor, rather than using only publications-based ones. Professors who are taking part in many students’ researches are investing their time and expertise in molding the future scien- tists, educators, and science as such. Thus, this metric should be more prominent in the current education system. Acknowledgments. I would like to thank Dr. Lipyeow Lim for his advising and support of my efforts in concurring new frontiers (IR), and for asking important and interesting questions that helped shape this proposal and envision the technology that should come out of it. Also, I would like to thank Dr. Dan Suthers for his fervor in teaching Network Science and his dedication to each individual student project, mine including. Finally, thanks to CIS program for allowing me the opportunity to peruse my dreams and sup- porting me to attend ISWC 2018 Doctoral Consortium. I hope this research will con- tribute to making the program even better for the students to come. 8 References: 1. Cyranoski, D., Gilbert, N., Ledford, H., Nayar, A., Yahia, M.: The PhD Factory, Nature 472, pp. 276–279 (2011). https://doi:10.1038/472276a 2. Porter, A. L., Rafols, I.: Is science becoming more interdisciplinary? Measuring and mapping six research fields over time, Scientometrics, 81(3), pp. 719–745 (2009). https://doi.org/10.1007/s11192-008-2197-2 3. Catalano, A.: Patterns of graduate students’ information seeking behavior: a meta- synthesis of the literature, Jour. of Doc., 69(2), pp. 243–274 (2013). https://doi.org/10.1108/00220411311300066 4. Ilik, V., Conlon, M., Triggs, G., Haendel, M. A., Holmes, K.: OpenVIVO: Trans- parency in Scholarship. Front. Res. Met. Ana., 2 (2017). https://doi.org/10.3389/frma.2017.00012 5. Osborne, F., Motta, E., Mulholland, P.: Exploring Scholarly Data with Rexplore. In In: Alani H. et al. (eds) The Sem. Web – ISWC 2013. Lecture Notes in Com. Sci., Springer, Berlin, Heidelberg 8218, (2013). https://doi.org/10.1007/978-3- 642-41335-3_29 6. Ziaimatin, H., Groza, T., Hunter, J.: Semantic and Time-Dependent Expertise Pro- filing Models in Community-Driven Knowledge Curation Platforms. Fut. Int., 5(4), 490–514 (2013). https://doi.org/10.3390/fi5040490 7. Blei, D. M.: Probabilistic topic models. Com. of ACM, 55(4), pp. 77–84 (2012). https://doi.org/10.1145/2133806.2133826 8. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in artificial intelligence, pp. 487–494, AUAI Press, Baniff (2004). https://mimno.in- fosci.cornell.edu/info6150/readings/398.pdf 9. Paranyushkin, D.: Identifying the pathways for meaning circulation using text net- work analysis. Berlin: Nodus Labs. (2011). https://noduslabs.com/research/path- ways-meaning-circulation-text-network-analysis/ on 10. Thiagarajan, R., Manjunath, G., Stumptner, M.: Finding Experts by Semantic Matching of User Profiles, In: 3rd Expert Finder Workshop on Personal Identifica- tion and Collaborations: Knowledge Mediation and Extraction (PICKME 2008) Innsbruck, Austria, (2008). http://ceur-ws.org/Vol-403/paper1.pdf 11. Kadriu, A.: Discovering value in academic social networks: A case study in Re- searchGate. In: ITI 2013, pp. 57–62. IEEE Press, Cavtat/Dubrovnik (2013). doi:10.2498/iyi.2013.0566 12. Barabasi, A. L.: Network Science. Cambridge University Press, Cambridge UK (2016). http://barabasi.com/networksciencebook/ 13. Carroll, J. M., Rosson, M. B.: Wild at Home: The Neighborhood as a Living Labor- atory for HCI. ACM Tran. Com.-Hum. Int., 20(3), pp. 1–28 (2013). https://doi.org/10.1145/2491500.2491504 14. Hjørland, B., Albrechtsen, H.: Toward a new horizon in information science: Do- main-analysis. Jour. Ame, Soc. Inf. Sci. 46(6), pp. 400–425 (1995). https://doi.org/10.1002/(SICI)1097-4571(199507)46:6<400::AID-ASI2>3.0.CO;2- Y 9 15. Hjørland, B.: Domain analysis in information science: Eleven approaches – tradi- tional as well as innovative. Journal of Documentation, 58(4), pp. 422–462 (2002). https://doi.org/10.1108/00220410210431136 16. Aryani, A., Wang, J., Zhang, H., Xiang, A., Zhou, Z., Wang, K.: Visualising Re- search Graph using Neo4j and Gephi, In: Open Repositories Conference, Brisbane (2017). doi:10.4225/03/58c8e8cc8a1ec 17. Nikita, M.: Select number of topics for LDA model (2016). https://cran.r-pro- ject.org/web/packages/ldatuning/vignettes/topics.html 18. Erickson, T.: ‘Social’ systems: designing digital systems that support social intelli- gence. Ai & Soc, 23(2), 147–166. (2009). https://doi.org/10.1007/s00146-007- 0140-3 10