From Keywords to Structured Summaries: Streamlining Scholarly Information Access Mahsa Shamsabadi, Jennifer D’Souza TIB Leibniz Information Centre for Science and Technology, Hannover, Germany Abstract This poster paper highlights the increasing importance of information retrieval (IR) engines in the scientific community, addressing the inefficiencies of traditional keyword-based search engines amid the growing volume of publications. Our proposed solution uses structured records, supported by advanced information technology (IT) tools such as visualization dashboards, to transform how researchers access and filter articles, moving away from a text-heavy approach. This vision is demonstrated through a proof of concept focused on the “reproductive number estimate of infectious diseases” research theme. We utilize a fine-tuned large language model (LLM) to automate the creation of structured records for a backend database, enhancing information access beyond simple keywords. The result is a next-generation information access system, available at https://orkg.org/usecases/r0-estimates. Keywords Structured scientific knowledge, Structured scientific information extraction (IE), Large Language Models, Visualization dashboards, Scientific information retrieval (IR) platforms 1. Introduction The rapid expansion of scientific literature necessitates a reevaluation of their information retrieval (IR) engines [1, 2]. Traditional keyword-based approaches are inadequate for tracking fast-paced scientific advancements. There is a growing demand for structured scientific content representations [3, 4] and advanced machine learning algorithms [5, 6] to enhance retrieval accuracy. Initiatives like the Open Research Knowledge Graph (ORKG) [7] drive this paradigm shift towards structured knowledge representations, enabling intelligent views and comparisons of research facets [8, 9]. Our goal is to simplify access to scientific articles and reduce cognitive load for researchers using information technology (IT). We propose dashboards as visual tools to represent structured scientific knowledge, enhancing research filtering and discovery processes [10]. Dashboards have been widely used, including during the Covid-19 pandemic, where they helped track cases, analyze trends, and support decision-making with data from sources like the WHO and Johns Hopkins [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]. In contrast, our approach focuses on applying IT to structure scientific knowledge itself, using information extraction (IE) mechanisms and large language models (LLMs) to power next-generation information systems. Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA † This work was supported by the German BMBF project SCINEXT (ID 01lS22070). Envelope-Open jennifer.dsouza@tib.eu (J. D’Souza) Orcid 0000-0002-6616-9509 (J. D’Souza) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In pursuit of our vision, this poster paper presents a proof of concept (POC) using the ORKG-R0 semantic model [37] to structure articles on the ”reproductive number estimate of infectious diseases” theme [38]. The model captures essential properties like disease name, study location, date, 𝑅0 value, % confidence interval values, and computation method, enabling effective comparison across studies. Four research questions (RQs) guide article search and exploration: RQ1 identifies maximum 𝑅0 estimates, RQ2 examines study counts by disease and location, RQ3 analyzes 𝑅0 value ranges by location for selected diseases, and RQ4 maps study locations globally on the world map. These RQs are visualized in a dashboard to enhance article filtering and provide researchers with concise insights into research progress. 2. Next-Generation Scientific Information Retrieval (IR) We introduce a next-generation IR platform for “reproductive number estimates of infectious diseases,” enhancing scientific article access with IT and four visual charted summaries tailored to four specific RQs alluded to earlier. In the following subsections, we will detail the LLM-based IE method, article collection, and platform workflow. 2.1. The Scientific Information Extraction (IE) Large Language Model (LLM) We employ the ORKG-FLAN-T5 R0 LLM [39]. This model is an instruction fine-tuned variant of FLAN-T5 Large (780 M) using the instruction-tuning paradigm introduced as FLAN (Finetuned Language Net) [40, 41, 42, 43]. It processes a paper’s title and abstract to produce structured summaries based on six key properties: disease name, location, date, 𝑅0 value, % confidence interval (CI) values, and method, related to the 𝑅0 estimate [39]. Table 1 The top 20 infectious disease names (and number of papers) in our initial dataset. covid-19 (1002) mers-cov (21) measles (15) hepatitis c (8) dengue (41) cholera (18) hepatitis b (12) tuberculosis (8) influenza (29) zika (18) zika virus (12) monkeypox (8) hiv (23) african swine fever (17) ebola (11) west nile virus (7) sars (22) ebola (17) hand, foot, and mouth disease (8) malaria (7) 2.2. The Scholarly Articles Collection The initial set of articles in our collection was sourced from keyword-based searches in the PubMed database, with the most recent search conducted on September 13, 2023. The search query used was: (basic reproduction number[TIAB] OR basic reproductive number[TIAB] OR basic reproduction ratio[TIAB] OR basic reproductive rate[TIAB] OR R0[TIAB]) NOT (R0 resection OR cancer) , targeting papers with any synonyms of 𝑅0 in the title or abstract. This yielded 7,127 articles. We leveraged the ORKG-FLAN-T5 R0 LLM [39] to filter articles that did not report an 𝑅0 value as unanswerable; or otherwise provide structured JSON descriptions for articles with 𝑅0 estimates. Virology Dashboard Front-end (a) Virology Dashboard Back-end Request Analytical Query Services Database WEB API Virology Dashboard Query Response Database Answer Database Update Module (b) Process Scheduler structured LLM Update Data summaries (c) PubMed Documents Query (d) PubMed API Figure 1: (Left image) A visual analytical dashboard in our next-generation information retrieval (IR) platform provides charts (a), (b), (c), (d) in a dashboard to help researchers make informed article filtering decisions. (Right image) The backend workflow, managed by a web API, handles database interactions for frontend rendering. It incorporates a schedule for database updates programmed to run monthly, with LLM queries supplying structured scientific knowledge before each update. After filtering out unanswerables, 2,051 articles remained, yielding 2,736 structured summaries. The processed data was imported to a PostgreSQL 16 database, serving as the backend storage. The top 20 most represented infectious diseases in our initial database is shown in Table 1. Notably, the LLM’s high precision confirmed that the top reported diseases are indeed ascertained infectious diseases. Our database covers studies from all seven continents. 2.3. The Information Retrieval (IR) Platform Workflow The platform is accessible as a web application at the following URL: https://orkg.org/usecases/ r0-estimates. The visualization dashboard widget and underlying workflow are displayed in Figure 1. In this workflow, the frontend communicates with the backend through a Web API for database queries and data retrieval. A Python script scheduler, programmed to run monthly, periodically updates the database with new articles querying PubMed and following the LLM processing cycle before updating the database with structured summaries. Our workflow maximizes the use of cutting-edge technology, including an optimized next-generation LLM. Figure 2: Chart (a) in Figure 1 displays maximum 𝑅0 values by disease to enhance scholarly publication filtering. The y-axis shows max 𝑅0 values, and the x-axis lists diseases. Users can filter by 𝑅0 range, and clicking a bar reveals underlying publication details with links to PubMed articles. 2.3.1. Charting the data: collating, summarizing, and reporting Our IR platform includes three main components: 1) a statistics snapshot showing total papers, structured knowledge, infectious diseases, and locations, 2) a standard paper listing in a keyword- based table, filtered as needed, built with the ag-grid JavaScript library, and 3) a visual analytical dashboard with four charts addressing our research questions. This process involves collating relevant properties, selecting the best chart from the React chart library to summarize the response, and creating a query to report the visual summary. Each RQ is represented by a visual chart. E.g., RQ1, “What are the maximum 𝑅0 estimates reported for diseases in our database?” is illustrated with a bar chart that plots diseases on the x-axis against their maximum 𝑅0 values on the y-axis. Hovering over a bar displays the disease and its max 𝑅0. This interactive chart, which can be adjusted for specific 𝑅0 ranges, simplifies the comparison of 𝑅0 estimates across numerous studies. Clicking on a bar provides a direct link to the contributing article on PubMed, thereby enhancing scholarly information retrieval significantly beyond traditional methods. 3. Conclusion In this poster paper, we present a POC for a new scholarly IR engine that enhances access and reduces the cognitive load of traditional, keyword-based searches. We address the inefficiencies of manual paper filtering in traditional IR systems, exacerbated by rapidly increasing publication volumes. Our approach models key research aspects for machine processing, paving the way for next-generation visual assistants that streamline scholarly research access. References [1] S. Fortunato, C. T. Bergstrom, K. Börner, J. A. Evans, D. Helbing, S. Milojević, A. M. Petersen, F. Radicchi, R. Sinatra, B. Uzzi, et al., Science of science, Science 359 (2018) eaao0185. [2] L. Bornmann, R. Haunschild, R. Mutz, Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases, Humanities and Social Sciences Communications 8 (2021) 1–15. [3] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza, S. Kabongo, H. B. Giglou, Y. Zhang, S. Auer, J. Kamps, Clef 2024 simpletext track: Improving access to scientific texts for everyone, Springer-Verlag, Berlin, Heidelberg, 2024, p. 28–35. URL: https://doi.org/10.1007/978-3-031-56072-9_4. doi:10.1007/978- 3- 031- 56072- 9_4 . [4] P. Fontelo, A. Gavino, R. F. Sarmiento, Comparing data accuracy between structured abstracts and full-text journal articles: implications in their use for informing clinical decisions, BMJ Evidence-Based Medicine 18 (2013) 207–211. [5] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Beltagy, M. Crawford, D. Downey, J. Dunkel- berger, A. Elgohary, S. Feldman, V. Ha, et al., Construction of the literature graph in semantic scholar, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), 2018, pp. 84–91. [6] D. Pride, M. Cancellieri, P. Knoth, Core-gpt: Combining open access research and large lan- guage models for credible, trustworthy question answering, in: International Conference on Theory and Practice of Digital Libraries, Springer, 2023, pp. 146–159. [7] S. Auer, A. Oelen, M. Haris, M. Stocker, J. D’Souza, K. E. Farfar, L. Vogt, M. Prinz, V. Wiens, M. Y. Jaradeh, Improving access to scientific literature with knowledge graphs, Bibliothek Forschung und Praxis 44 (2020) 516–529. [8] A. Oelen, M. Stocker, S. Auer, Smartreviews: towards human-and machine-actionable reviews, in: Linking Theory and Practice of Digital Libraries: 25th International Conference on Theory and Practice of Digital Libraries, TPDL 2021, Virtual Event, September 13–17, 2021, Proceedings 25, Springer, 2021, pp. 181–186. [9] A. Oelen, M. Y. Jaradeh, K. E. Farfar, M. Stocker, S. Auer, Comparing research contributions in a scholarly knowledge graph, in: CEUR workshop proceedings; 2526, volume 2526, Aachen: RWTH Aachen, 2019, pp. 21–26. [10] H. Santos, V. Dantas, V. Furtado, P. Pinheiro, D. L. McGuinness, From data to city indicators: A knowledge graph for supporting automatic generation of dashboards, in: The Semantic Web: 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017, Proceedings, Part II 14, Springer, 2017, pp. 94–108. [11] T. Khodaveisi, H. Dehdarirad, H. Bouraghi, A. Mohammadpour, F. Sajadi, M. Hosseini- ravandi, Characteristics and specifications of dashboards developed for the covid-19 pandemic: a scoping review, Journal of Public Health (2023) 1–22. [12] O. Lezhnina, G. Kismihók, M. Prinz, M. Stocker, S. Auer, A scholarly knowledge graph- powered dashboard: Implementation and user evaluation, Frontiers in Research Metrics and Analytics 7 (2022) 934930. [13] H. Santos, V. Dantas, V. Furtado, P. Pinheiro, D. L. McGuinness, From data to city indicators: A knowledge graph for supporting automatic generation of dashboards, in: The Semantic Web: 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017, Proceedings, Part II 14, Springer, 2017, pp. 94–108. [14] M. Salehi, M. Arashi, A. Bekker, J. Ferreira, D.-G. Chen, F. Esmaeili, M. Frances, A synergetic r-shiny portal for modeling and tracking of covid-19 data, Frontiers in public health 8 (2021) 623624. [15] D.-H. Yang, T.-W. Chien, Y.-T. Yeh, T.-Y. Yang, W. Chou, J.-K. Lin, Using the absolute advantage coefficient (aac) to measure the strength of damage hit by covid-19 in india on a growth-share matrix, European Journal of Medical Research 26 (2021) 1–11. [16] Z. Zhu, K. Meng, J. Caraballo, I. Jaradat, X. Shi, Z. Zhang, F. Akrami, H. Liao, F. Arslan, D. Jimenez, et al., A dashboard for mitigating the covid-19 misinfodemic, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2021. [17] D. Aristizábal-Torres, C. A. Peñuela-Meneses, A. M. Barrera-Rodríguez, An interactive web-based dashboard to track covid-19 in colombia. case study: five main cities, Revista de Salud Pública 22 (2023) 214–219. [18] L. E. Hodgson, T. Leckie, A. Hunter, N. Prinsloo, R. Venn, L. Forni, Covid-19 recognition and digital risk stratification, Future Healthcare Journal 7 (2020) e47. [19] R. Ravinder, S. Singh, S. Bishnoi, A. Jan, A. Sharma, H. Kodamana, N. A. Krishnan, An adaptive, interacting, cluster-based model for predicting the transmission dynamics of covid-19, Heliyon 6 (2020). [20] J. P. Ulahannan, N. Narayanan, N. Thalhath, P. Prabhakaran, S. Chaliyeduth, S. P. Suresh, M. Mohammed, E. Rajeevan, S. Joseph, A. Balakrishnan, et al., A citizen science initiative for open data and visualization of covid-19 outbreak in kerala, india, Journal of the American Medical Informatics Association 27 (2020) 1913–1920. [21] B. D. Wissel, P. Van Camp, M. Kouril, C. Weis, T. A. Glauser, P. S. White, I. S. Kohane, J. W. Dexheimer, An interactive online dashboard for tracking covid-19 in us counties, cities, and states in real time, Journal of the American Medical Informatics Association 27 (2020) 1121–1125. [22] A. S. Peddireddy, D. Xie, P. Patil, M. L. Wilson, D. Machi, S. Venkatramanan, B. Klahn, P. Porebski, P. Bhattacharya, S. Dumbre, et al., From 5vs to 6cs: Operationalizing epidemic data management with covid-19 surveillance, in: 2020 IEEE International Conference on Big Data (Big Data), IEEE, 2020, pp. 1380–1387. [23] Y. S. Bae, K. H. Kim, S. W. Choi, T. Ko, C. W. Jeong, B. Cho, M. S. Kim, E. Kang, Information technology–based management of clinically healthy covid-19 patients: lessons from a living and treatment support center operated by seoul national university hospital, Journal of medical Internet research 22 (2020) e19938. [24] H. Florez, S. Singh, Online dashboard and data analysis approach for assessing covid-19 case and death data, F1000Research 9 (2020). [25] I. Pathak, Y. Choi, D. Jiao, D. Yeung, L. Liu, Racial-ethnic disparities in case fatality ratio narrowed after age standardization: A call for race-ethnicity-specific age distributions in state covid-19 data, MedRxiv (2020). [26] H. Ibrahim, S. Sorrell, S. C. Nair, A. Al Romaithi, S. Al Mazrouei, A. Kamour, Rapid development and utilization of a clinical intelligence dashboard for frontline clinicians to optimize critical resources during covid-19, Acta Informatica Medica 28 (2020) 209. [27] A. Chande, S. Lee, M. Harris, Q. Nguyen, S. J. Beckett, T. Hilley, C. Andris, J. S. Weitz, Real-time, interactive website for us-county-level covid-19 event risk assessment, Nature human behaviour 4 (2020) 1313–1319. [28] V. Marivate, H. M. Combrink, Use of available data to inform the covid-19 outbreak in south africa: a case study, arXiv preprint arXiv:2004.04813 (2020). [29] A. Hohl, E. M. Delmelle, M. R. Desjardins, Y. Lan, Daily surveillance of covid-19 using the prospective space-time scan statistic in the united states, Spatial and spatio-temporal epidemiology 34 (2020) 100354. [30] R. Carroll, C. R. Prentice, Using spatial and temporal modeling to visualize the effects of us state issued stay at home orders on covid-19, Scientific Reports 11 (2021) 13939. [31] N. Marques da Costa, N. Mileu, A. Alves, Dashboard comprime_compri_mov: Multiscalar spatio-temporal monitoring of the covid-19 pandemic in portugal, Future Internet 13 (2021) 45. [32] M. Hyman, C. Mark, A. Imteaj, H. Ghiaie, S. Rezapour, A. M. Sadri, M. H. Amini, Data analytics to evaluate the impact of infectious disease on economy: Case study of covid-19 pandemic, Patterns 2 (2021). [33] F. Clement, A. Kaur, M. Sedghi, D. Krishnaswamy, K. Punithakumar, Interactive data driven visualization for covid-19 with trends, analytics and forecasting, in: 2020 24th International Conference Information Visualisation (IV), IEEE, 2020, pp. 593–598. [34] B. E. Dixon, S. J. Grannis, C. McAndrews, A. A. Broyles, W. Mikels-Carrasco, A. Wiensch, J. L. Williams, U. Tachinardi, P. J. Embi, Leveraging data visualization and a statewide health information exchange to support covid-19 surveillance and response: application of public health informatics, Journal of the American Medical Informatics Association 28 (2021) 1363–1373. [35] R. Arias-Carrasco, J. Giddaluru, L. E. Cardozo, F. Martins, V. Maracaja-Coutinho, H. I. Nakaya, Outbreak: a user-friendly georeferencing online tool for disease surveillance, Biological Research 54 (2021) 1–6. [36] R. Chauhan, P. Goel, V. Kumar, N. Soni, et al., Understanding covid-19 using data visualiza- tion, in: 2021 international conference on advance computing and innovative technologies in engineering (ICACITE), IEEE, 2021, pp. 555–559. [37] A. Oelen, J. D’Souza, M. Stocker, L. Vogt, K. E. Farfar, M. Haris, K. Fadel, M. Y. Jaradeh, V. Wiens, Covid-19 reproductive number estimates, 2020. URL: https://www.orkg.org/ orkg/comparison/R44930. doi:10.48366/R44930 . [38] L. Gordis, Epidemiology e-book, Elsevier Health Sciences, 2013. [39] M. Shamsabadi, J. D’Souza, S. Auer, Large Language Models for Scientific Information Extraction: An Empirical Study for Virology, in: Y. Graham, M. Purver (Eds.), Findings of the Association for Computational Linguistics: EACL 2024, Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 374–392. URL: https://aclanthology.org/2024. findings-eacl.26. [40] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551. [41] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021). [42] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. De- hghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). [43] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al., The flan collection: Designing data and methods for effective instruction tuning, arXiv preprint arXiv:2301.13688 (2023).