=Paper=
{{Paper
|id=Vol-3828/ISWC2024_paper_20
|storemode=property
|title=From Keywords to Structured Summaries: Streamlining Scholarly Information Access
|pdfUrl=https://ceur-ws.org/Vol-3828/paper20.pdf
|volume=Vol-3828
|authors=Mahsa Shamsabadi,Jennifer D'Souza
|dblpUrl=https://dblp.org/rec/conf/semweb/Shamsabadi024
}}
==From Keywords to Structured Summaries: Streamlining Scholarly Information Access==
From Keywords to Structured Summaries:
Streamlining Scholarly Information Access
Mahsa Shamsabadi, Jennifer D’Souza
TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
Abstract
This poster paper highlights the increasing importance of information retrieval (IR) engines in the
scientific community, addressing the inefficiencies of traditional keyword-based search engines amid the
growing volume of publications. Our proposed solution uses structured records, supported by advanced
information technology (IT) tools such as visualization dashboards, to transform how researchers access
and filter articles, moving away from a text-heavy approach. This vision is demonstrated through a
proof of concept focused on the “reproductive number estimate of infectious diseases” research theme.
We utilize a fine-tuned large language model (LLM) to automate the creation of structured records for a
backend database, enhancing information access beyond simple keywords. The result is a next-generation
information access system, available at https://orkg.org/usecases/r0-estimates.
Keywords
Structured scientific knowledge, Structured scientific information extraction (IE), Large Language Models,
Visualization dashboards, Scientific information retrieval (IR) platforms
1. Introduction
The rapid expansion of scientific literature necessitates a reevaluation of their information
retrieval (IR) engines [1, 2]. Traditional keyword-based approaches are inadequate for tracking
fast-paced scientific advancements. There is a growing demand for structured scientific content
representations [3, 4] and advanced machine learning algorithms [5, 6] to enhance retrieval
accuracy. Initiatives like the Open Research Knowledge Graph (ORKG) [7] drive this paradigm
shift towards structured knowledge representations, enabling intelligent views and comparisons
of research facets [8, 9]. Our goal is to simplify access to scientific articles and reduce cognitive
load for researchers using information technology (IT). We propose dashboards as visual tools to
represent structured scientific knowledge, enhancing research filtering and discovery processes
[10]. Dashboards have been widely used, including during the Covid-19 pandemic, where they
helped track cases, analyze trends, and support decision-making with data from sources like the
WHO and Johns Hopkins [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36]. In contrast, our approach focuses on applying IT to structure scientific
knowledge itself, using information extraction (IE) mechanisms and large language models
(LLMs) to power next-generation information systems.
Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA
†
This work was supported by the German BMBF project SCINEXT (ID 01lS22070).
Envelope-Open jennifer.dsouza@tib.eu (J. D’Souza)
Orcid 0000-0002-6616-9509 (J. D’Souza)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
In pursuit of our vision, this poster paper presents a proof of concept (POC) using the
ORKG-R0 semantic model [37] to structure articles on the ”reproductive number estimate of
infectious diseases” theme [38]. The model captures essential properties like disease name,
study location, date, 𝑅0 value, % confidence interval values, and computation method, enabling
effective comparison across studies. Four research questions (RQs) guide article search and
exploration: RQ1 identifies maximum 𝑅0 estimates, RQ2 examines study counts by disease
and location, RQ3 analyzes 𝑅0 value ranges by location for selected diseases, and RQ4 maps
study locations globally on the world map. These RQs are visualized in a dashboard to enhance
article filtering and provide researchers with concise insights into research progress.
2. Next-Generation Scientific Information Retrieval (IR)
We introduce a next-generation IR platform for “reproductive number estimates of infectious
diseases,” enhancing scientific article access with IT and four visual charted summaries tailored
to four specific RQs alluded to earlier. In the following subsections, we will detail the LLM-based
IE method, article collection, and platform workflow.
2.1. The Scientific Information Extraction (IE) Large Language Model (LLM)
We employ the ORKG-FLAN-T5 R0 LLM [39]. This model is an instruction fine-tuned variant of
FLAN-T5 Large (780 M) using the instruction-tuning paradigm introduced as FLAN (Finetuned
Language Net) [40, 41, 42, 43]. It processes a paper’s title and abstract to produce structured
summaries based on six key properties: disease name, location, date, 𝑅0 value, % confidence
interval (CI) values, and method, related to the 𝑅0 estimate [39].
Table 1
The top 20 infectious disease names (and number of papers) in our initial dataset.
covid-19 (1002) mers-cov (21) measles (15) hepatitis c (8)
dengue (41) cholera (18) hepatitis b (12) tuberculosis (8)
influenza (29) zika (18) zika virus (12) monkeypox (8)
hiv (23) african swine fever (17) ebola (11) west nile virus (7)
sars (22) ebola (17) hand, foot, and mouth disease (8) malaria (7)
2.2. The Scholarly Articles Collection
The initial set of articles in our collection was sourced from keyword-based searches
in the PubMed database, with the most recent search conducted on September
13, 2023. The search query used was: (basic reproduction number[TIAB] OR
basic reproductive number[TIAB] OR basic reproduction ratio[TIAB] OR basic
reproductive rate[TIAB] OR R0[TIAB]) NOT (R0 resection OR cancer) , targeting
papers with any synonyms of 𝑅0 in the title or abstract. This yielded 7,127 articles. We
leveraged the ORKG-FLAN-T5 R0 LLM [39] to filter articles that did not report an 𝑅0 value as
unanswerable; or otherwise provide structured JSON descriptions for articles with 𝑅0 estimates.
Virology Dashboard Front-end
(a) Virology Dashboard Back-end
Request
Analytical Query
Services Database
WEB API Virology
Dashboard
Query
Response Database
Answer
Database Update Module
(b)
Process
Scheduler structured
LLM Update Data
summaries
(c)
PubMed Documents
Query
(d)
PubMed API
Figure 1: (Left image) A visual analytical dashboard in our next-generation information retrieval
(IR) platform provides charts (a), (b), (c), (d) in a dashboard to help researchers make informed article
filtering decisions. (Right image) The backend workflow, managed by a web API, handles database
interactions for frontend rendering. It incorporates a schedule for database updates programmed to run
monthly, with LLM queries supplying structured scientific knowledge before each update.
After filtering out unanswerables, 2,051 articles remained, yielding 2,736 structured summaries.
The processed data was imported to a PostgreSQL 16 database, serving as the backend
storage. The top 20 most represented infectious diseases in our initial database is shown in
Table 1. Notably, the LLM’s high precision confirmed that the top reported diseases are indeed
ascertained infectious diseases. Our database covers studies from all seven continents.
2.3. The Information Retrieval (IR) Platform Workflow
The platform is accessible as a web application at the following URL: https://orkg.org/usecases/
r0-estimates. The visualization dashboard widget and underlying workflow are displayed in
Figure 1. In this workflow, the frontend communicates with the backend through a Web API for
database queries and data retrieval. A Python script scheduler, programmed to run monthly,
periodically updates the database with new articles querying PubMed and following the LLM
processing cycle before updating the database with structured summaries. Our workflow
maximizes the use of cutting-edge technology, including an optimized next-generation LLM.
Figure 2: Chart (a) in Figure 1 displays maximum 𝑅0 values by disease to enhance scholarly publication
filtering. The y-axis shows max 𝑅0 values, and the x-axis lists diseases. Users can filter by 𝑅0 range, and
clicking a bar reveals underlying publication details with links to PubMed articles.
2.3.1. Charting the data: collating, summarizing, and reporting
Our IR platform includes three main components: 1) a statistics snapshot showing total papers,
structured knowledge, infectious diseases, and locations, 2) a standard paper listing in a keyword-
based table, filtered as needed, built with the ag-grid JavaScript library, and 3) a visual analytical
dashboard with four charts addressing our research questions. This process involves collating
relevant properties, selecting the best chart from the React chart library to summarize the
response, and creating a query to report the visual summary. Each RQ is represented by a visual
chart. E.g., RQ1, “What are the maximum 𝑅0 estimates reported for diseases in our database?”
is illustrated with a bar chart that plots diseases on the x-axis against their maximum 𝑅0 values
on the y-axis. Hovering over a bar displays the disease and its max 𝑅0. This interactive chart,
which can be adjusted for specific 𝑅0 ranges, simplifies the comparison of 𝑅0 estimates across
numerous studies. Clicking on a bar provides a direct link to the contributing article on PubMed,
thereby enhancing scholarly information retrieval significantly beyond traditional methods.
3. Conclusion
In this poster paper, we present a POC for a new scholarly IR engine that enhances access and
reduces the cognitive load of traditional, keyword-based searches. We address the inefficiencies
of manual paper filtering in traditional IR systems, exacerbated by rapidly increasing publication
volumes. Our approach models key research aspects for machine processing, paving the way
for next-generation visual assistants that streamline scholarly research access.
References
[1] S. Fortunato, C. T. Bergstrom, K. Börner, J. A. Evans, D. Helbing, S. Milojević, A. M. Petersen,
F. Radicchi, R. Sinatra, B. Uzzi, et al., Science of science, Science 359 (2018) eaao0185.
[2] L. Bornmann, R. Haunschild, R. Mutz, Growth rates of modern science: a latent piecewise
growth curve approach to model publication numbers from established and new literature
databases, Humanities and Social Sciences Communications 8 (2021) 1–15.
[3] L. Ermakova, E. SanJuan, S. Huet, H. Azarbonyad, G. M. Di Nunzio, F. Vezzani, J. D’Souza,
S. Kabongo, H. B. Giglou, Y. Zhang, S. Auer, J. Kamps, Clef 2024 simpletext track: Improving
access to scientific texts for everyone, Springer-Verlag, Berlin, Heidelberg, 2024, p. 28–35.
URL: https://doi.org/10.1007/978-3-031-56072-9_4. doi:10.1007/978- 3- 031- 56072- 9_4 .
[4] P. Fontelo, A. Gavino, R. F. Sarmiento, Comparing data accuracy between structured
abstracts and full-text journal articles: implications in their use for informing clinical
decisions, BMJ Evidence-Based Medicine 18 (2013) 207–211.
[5] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Beltagy, M. Crawford, D. Downey, J. Dunkel-
berger, A. Elgohary, S. Feldman, V. Ha, et al., Construction of the literature graph in
semantic scholar, in: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume
3 (Industry Papers), 2018, pp. 84–91.
[6] D. Pride, M. Cancellieri, P. Knoth, Core-gpt: Combining open access research and large lan-
guage models for credible, trustworthy question answering, in: International Conference
on Theory and Practice of Digital Libraries, Springer, 2023, pp. 146–159.
[7] S. Auer, A. Oelen, M. Haris, M. Stocker, J. D’Souza, K. E. Farfar, L. Vogt, M. Prinz, V. Wiens,
M. Y. Jaradeh, Improving access to scientific literature with knowledge graphs, Bibliothek
Forschung und Praxis 44 (2020) 516–529.
[8] A. Oelen, M. Stocker, S. Auer, Smartreviews: towards human-and machine-actionable
reviews, in: Linking Theory and Practice of Digital Libraries: 25th International Conference
on Theory and Practice of Digital Libraries, TPDL 2021, Virtual Event, September 13–17,
2021, Proceedings 25, Springer, 2021, pp. 181–186.
[9] A. Oelen, M. Y. Jaradeh, K. E. Farfar, M. Stocker, S. Auer, Comparing research contributions
in a scholarly knowledge graph, in: CEUR workshop proceedings; 2526, volume 2526,
Aachen: RWTH Aachen, 2019, pp. 21–26.
[10] H. Santos, V. Dantas, V. Furtado, P. Pinheiro, D. L. McGuinness, From data to city indicators:
A knowledge graph for supporting automatic generation of dashboards, in: The Semantic
Web: 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017,
Proceedings, Part II 14, Springer, 2017, pp. 94–108.
[11] T. Khodaveisi, H. Dehdarirad, H. Bouraghi, A. Mohammadpour, F. Sajadi, M. Hosseini-
ravandi, Characteristics and specifications of dashboards developed for the covid-19
pandemic: a scoping review, Journal of Public Health (2023) 1–22.
[12] O. Lezhnina, G. Kismihók, M. Prinz, M. Stocker, S. Auer, A scholarly knowledge graph-
powered dashboard: Implementation and user evaluation, Frontiers in Research Metrics
and Analytics 7 (2022) 934930.
[13] H. Santos, V. Dantas, V. Furtado, P. Pinheiro, D. L. McGuinness, From data to city indicators:
A knowledge graph for supporting automatic generation of dashboards, in: The Semantic
Web: 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017,
Proceedings, Part II 14, Springer, 2017, pp. 94–108.
[14] M. Salehi, M. Arashi, A. Bekker, J. Ferreira, D.-G. Chen, F. Esmaeili, M. Frances, A synergetic
r-shiny portal for modeling and tracking of covid-19 data, Frontiers in public health 8
(2021) 623624.
[15] D.-H. Yang, T.-W. Chien, Y.-T. Yeh, T.-Y. Yang, W. Chou, J.-K. Lin, Using the absolute
advantage coefficient (aac) to measure the strength of damage hit by covid-19 in india on
a growth-share matrix, European Journal of Medical Research 26 (2021) 1–11.
[16] Z. Zhu, K. Meng, J. Caraballo, I. Jaradat, X. Shi, Z. Zhang, F. Akrami, H. Liao, F. Arslan,
D. Jimenez, et al., A dashboard for mitigating the covid-19 misinfodemic, in: Proceedings
of the 16th Conference of the European Chapter of the Association for Computational
Linguistics: System Demonstrations, 2021.
[17] D. Aristizábal-Torres, C. A. Peñuela-Meneses, A. M. Barrera-Rodríguez, An interactive
web-based dashboard to track covid-19 in colombia. case study: five main cities, Revista
de Salud Pública 22 (2023) 214–219.
[18] L. E. Hodgson, T. Leckie, A. Hunter, N. Prinsloo, R. Venn, L. Forni, Covid-19 recognition
and digital risk stratification, Future Healthcare Journal 7 (2020) e47.
[19] R. Ravinder, S. Singh, S. Bishnoi, A. Jan, A. Sharma, H. Kodamana, N. A. Krishnan, An
adaptive, interacting, cluster-based model for predicting the transmission dynamics of
covid-19, Heliyon 6 (2020).
[20] J. P. Ulahannan, N. Narayanan, N. Thalhath, P. Prabhakaran, S. Chaliyeduth, S. P. Suresh,
M. Mohammed, E. Rajeevan, S. Joseph, A. Balakrishnan, et al., A citizen science initiative
for open data and visualization of covid-19 outbreak in kerala, india, Journal of the
American Medical Informatics Association 27 (2020) 1913–1920.
[21] B. D. Wissel, P. Van Camp, M. Kouril, C. Weis, T. A. Glauser, P. S. White, I. S. Kohane, J. W.
Dexheimer, An interactive online dashboard for tracking covid-19 in us counties, cities,
and states in real time, Journal of the American Medical Informatics Association 27 (2020)
1121–1125.
[22] A. S. Peddireddy, D. Xie, P. Patil, M. L. Wilson, D. Machi, S. Venkatramanan, B. Klahn,
P. Porebski, P. Bhattacharya, S. Dumbre, et al., From 5vs to 6cs: Operationalizing epidemic
data management with covid-19 surveillance, in: 2020 IEEE International Conference on
Big Data (Big Data), IEEE, 2020, pp. 1380–1387.
[23] Y. S. Bae, K. H. Kim, S. W. Choi, T. Ko, C. W. Jeong, B. Cho, M. S. Kim, E. Kang, Information
technology–based management of clinically healthy covid-19 patients: lessons from a
living and treatment support center operated by seoul national university hospital, Journal
of medical Internet research 22 (2020) e19938.
[24] H. Florez, S. Singh, Online dashboard and data analysis approach for assessing covid-19
case and death data, F1000Research 9 (2020).
[25] I. Pathak, Y. Choi, D. Jiao, D. Yeung, L. Liu, Racial-ethnic disparities in case fatality ratio
narrowed after age standardization: A call for race-ethnicity-specific age distributions in
state covid-19 data, MedRxiv (2020).
[26] H. Ibrahim, S. Sorrell, S. C. Nair, A. Al Romaithi, S. Al Mazrouei, A. Kamour, Rapid
development and utilization of a clinical intelligence dashboard for frontline clinicians to
optimize critical resources during covid-19, Acta Informatica Medica 28 (2020) 209.
[27] A. Chande, S. Lee, M. Harris, Q. Nguyen, S. J. Beckett, T. Hilley, C. Andris, J. S. Weitz,
Real-time, interactive website for us-county-level covid-19 event risk assessment, Nature
human behaviour 4 (2020) 1313–1319.
[28] V. Marivate, H. M. Combrink, Use of available data to inform the covid-19 outbreak in
south africa: a case study, arXiv preprint arXiv:2004.04813 (2020).
[29] A. Hohl, E. M. Delmelle, M. R. Desjardins, Y. Lan, Daily surveillance of covid-19 using
the prospective space-time scan statistic in the united states, Spatial and spatio-temporal
epidemiology 34 (2020) 100354.
[30] R. Carroll, C. R. Prentice, Using spatial and temporal modeling to visualize the effects of
us state issued stay at home orders on covid-19, Scientific Reports 11 (2021) 13939.
[31] N. Marques da Costa, N. Mileu, A. Alves, Dashboard comprime_compri_mov: Multiscalar
spatio-temporal monitoring of the covid-19 pandemic in portugal, Future Internet 13
(2021) 45.
[32] M. Hyman, C. Mark, A. Imteaj, H. Ghiaie, S. Rezapour, A. M. Sadri, M. H. Amini, Data
analytics to evaluate the impact of infectious disease on economy: Case study of covid-19
pandemic, Patterns 2 (2021).
[33] F. Clement, A. Kaur, M. Sedghi, D. Krishnaswamy, K. Punithakumar, Interactive data
driven visualization for covid-19 with trends, analytics and forecasting, in: 2020 24th
International Conference Information Visualisation (IV), IEEE, 2020, pp. 593–598.
[34] B. E. Dixon, S. J. Grannis, C. McAndrews, A. A. Broyles, W. Mikels-Carrasco, A. Wiensch,
J. L. Williams, U. Tachinardi, P. J. Embi, Leveraging data visualization and a statewide
health information exchange to support covid-19 surveillance and response: application
of public health informatics, Journal of the American Medical Informatics Association 28
(2021) 1363–1373.
[35] R. Arias-Carrasco, J. Giddaluru, L. E. Cardozo, F. Martins, V. Maracaja-Coutinho, H. I.
Nakaya, Outbreak: a user-friendly georeferencing online tool for disease surveillance,
Biological Research 54 (2021) 1–6.
[36] R. Chauhan, P. Goel, V. Kumar, N. Soni, et al., Understanding covid-19 using data visualiza-
tion, in: 2021 international conference on advance computing and innovative technologies
in engineering (ICACITE), IEEE, 2021, pp. 555–559.
[37] A. Oelen, J. D’Souza, M. Stocker, L. Vogt, K. E. Farfar, M. Haris, K. Fadel, M. Y. Jaradeh,
V. Wiens, Covid-19 reproductive number estimates, 2020. URL: https://www.orkg.org/
orkg/comparison/R44930. doi:10.48366/R44930 .
[38] L. Gordis, Epidemiology e-book, Elsevier Health Sciences, 2013.
[39] M. Shamsabadi, J. D’Souza, S. Auer, Large Language Models for Scientific Information
Extraction: An Empirical Study for Virology, in: Y. Graham, M. Purver (Eds.), Findings of
the Association for Computational Linguistics: EACL 2024, Association for Computational
Linguistics, St. Julian’s, Malta, 2024, pp. 374–392. URL: https://aclanthology.org/2024.
findings-eacl.26.
[40] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of transfer learning with a unified text-to-text transformer, The
Journal of Machine Learning Research 21 (2020) 5485–5551.
[41] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,
Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021).
[42] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. De-
hghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint
arXiv:2210.11416 (2022).
[43] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph,
J. Wei, et al., The flan collection: Designing data and methods for effective instruction
tuning, arXiv preprint arXiv:2301.13688 (2023).