ceur-ws.org/Vol-2399/paper07.pdf


       Biomedical Data Categorization and Integration using
                  Human-in-the-loop Approach

                                                          Priya Deshpande
                                                  Supervised by Dr. Alexander Rasin
                                                                    DePaul University
                                                                    Chicago, IL, USA
                                                            pdeshpa1@depaul.edu

ABSTRACT                                                                        However, in the healthcare domain, datasets are often not shared
Digitized world demands data integration systems that combine                   because of security concerns, lack of integration, or limitations of
data repositories from multiple data sources. Vast amounts of exist-            retrieval engines. A data integration framework should make data
ing clinical and biomedical research data are considered a primary              available, accessible, and support fine-grained access control for
force enabling data-driven research toward advancing health re-                 different users [6]. It would also greatly reduce the need for man-
search and for introducing efficiencies in healthcare delivery. Data-           ual curation of data sources and data repositories. Data integra-
driven research may have many goals, including but not limited to               tion alone is insufficient without associated information retrieval
improved diagnostics processes, novel biomedical discoveries, epi-              mechanisms that would rank retrieved results based on relevancy.
demiology, and education. However, finding and gaining access to                From our discussions with University of Chicago (UofC) radiol-
relevant data remains an elusive goal. We identified different data             ogists, even the internal UofC commercial system lacks some of
integration challenges and developed an Integrated Radiology Im-                the Natural Language Processing (NLP) features (e.g., detecting
age Search (IRIS) framework that could be a step toward aiding                  synonyms and negation) and multimodal (text and image) search
data-driven research. We propose building a biomedical data cate-               capabilities. We studied publicly available radiology data sources
gorization and integration framework using human-in-the-loop and                MyPacs.net2 , EURORAD3 , and RSNA Medical Imaging Resource
developing data bridges to support search and retrieval of relevant             Community (MIRC)4 , that provide a collection of clinical reports
documents from the integrated repository.                                       and associated images, which are known as teaching files. Teaching
   My research focuses on biomedical data integration, indexing                 files contain information such as patient history, findings, diagno-
systems, and providing relevance-ranked document retrieval from                 sis, differential diagnosis, or discussion notes. While all of these
an integrated repository. Although we currently focus on integrat-              public data sources are available, most of them provide only basic
ing biomedical data sources (for medical professionals), we believe             search capabilities – not offering NLP support or ranked retrieval
that our proposed framework and methodologies can be used in                    mechanisms. Several studies highlighted the need to integrate clin-
other domains as well.                                                          ical reports and images into databases with advanced search ca-
                                                                                pabilities. Gutmark et al. [5] argued for building a system that
PVLDB Reference Format:                                                         reduces errors in radiological images interpretation using teaching
Priya Deshpande. Biomedical Data Categorization and Integration using           file databases. Talanow et al. [12] described reference radiological
Human-in-the-loop Approach. PVLDB, 12(xxx): xxxx-yyyy, 2019.
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
                                                                                image use for diagnosis, teaching needs, research, and the resulting
                                                                                need for an advanced reference search engine.
1.    INTRODUCTION                                                                 An integrated repository of teaching files can retrieve thousands
   A growing amount of available biomedical data poses new chal-                of results for a text search. A search can thus become effectively
lenges in data management. Data re-usability is a highly desir-                 useless without being able to show the most relevant results first.
able goal, both for advancing science as well as for replicating                Publicly available radiology teaching file search engines do not
or validating results of previous studies. Recognizing this need,               provide text relevance ranking or combined text-and-image search.
publishers and funding bodies may require researchers to submit                 Lack of such systems motivated us to build Integrated Radiology
data generated in their work and make it available to the research              Image Search (IRIS) and develop the ranking algorithm presented
community. For example, National Institutes of Health (NIH) is                  here. We presented IRIS at the annual Society for Imaging Infor-
encouraging funded investigators to use cloud computing to con-                 matics in Medicine (SIIM 2018) meeting (two posters: one focus-
duct research and make their work accessible to larger audiences1 .             ing on search and another on data integration) and received feed-
1 https://commonfund.nih.gov/strides/
                                                                                back from doctors indicating that this work would be useful for the
                                                                                medical domain practitioners.
This work is licensed under the Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For   2. BACKGROUND AND RELATED WORK
any use beyond those covered by this license, obtain permission by email-          In this section we discuss papers that addressed the need for data
ing info@vldb.org. Publication rights licensed to the VLDB Endowment.           integration and retrieval systems along with an overview of exist-
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los               ing medical data retrieval systems. Several studies have highlighted
Angeles, California. Copyright (C) 2019 for this paper by its authors.
Copying permitted for private and academic purposes                             2 https://www.mypacs.net/
Proceedings of the VLDB Endowment, Vol. 12, No. xxx                             3 https://www.myesr.org/eurorad
ISSN 2150-8097.                                                                 4 http://mirc.rsna.org/query
DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
                                                                         .
the need for integration of healthcare data [10]. Holzinger et al. [7]
talked about knowledge discovery and interactive data mining tech-           ID      Summary   Table 1: Research work summary
niques in bio-informatics, the challenges to integrating biomedical                  IRIS 1.0
data, and open research directions. Li et al. [8] proposed a hybrid                  Teaching file text pre-processing and indexing.
human-machine data integration approach that integrates records              1       Smart search through substitution of synonyms
from databases with similar data types (e.g., iphone users data).                    and interpreting negation. Query expansion using
However, healthcare domain data integration needs to combine het-                    RadLex through an exact term match. [1]
erogeneous data sources with different categories of data types.                     IRIS 1.1
Simpson et al. [11] proposed a multimodal image retrieval system                     Query synonym expansion. SNOMED CT ontology
that retrieves biomedical articles used in Open-i5 . Ling et al. [9]         2
                                                                                     integration, shown improved results compared with
designed GEMINI, an integrative healthcare analytics system, and                     other search engines [3].
studied problems related to healthcare data heterogeneity and data                   Data integration as an iterative process, showing how
integration in that context. From this literature survey, we con-            3
                                                                                     each integration step improved IRIS results [2].
cluded that healthcare needs are not met by the current search en-                   Cluster analysis and coverage analysis for
gines. The limitations of existing systems motivated us to design                    both ontologies and radiology data sources.
and develop a radiology multimodal search engine. IRIS integrates            4       Unsupervised machine learning to identify data source
two well-known public data sources MIRC and MyPacs and two                           properties – to identify best data sources and ontologies
medical ontologies RadLex6 and The Systematized Nomenclature                         for integration (Journal paper – under review).
of Medicine Clinical Terms (SNOMED CT)7 . RSNA MIRC: Pub-                            IRIS 1.2
licly available large repository with more than 2,500 teaching files                 Multimodal ranked retrieval for integrated
and more than 12,000 images.                                                 5       radiology data sources using context of search term by
Mypacs.net: Publicly available teaching file resource with more                      considering weighted ontology and category terms
than 35,000 cases and 200,000 images.                                                (Conference paper – under review).
RadLex: RadLex is an ontological system that provides a compre-                      Toward using FAIR Principles for Fine-Grained
hensive lexicon vocabulary for radiologists.                                 6
                                                                                     Access to aid Biomedical Data Driven Research [4].
SNOMED CT: ontology provides a standardized, multilingual vo-
cabulary of clinical terminology that is used by physicians and
other healthcare providers for the electronic exchange of clinical       For each phase we have identified a research question. Publications
health information.                                                      related to this work are briefly summarized in Table 1

3.    METHODOLOGY AND RESEARCH                                               3.2.1     Design an integrated smart database with het-
                                                                                       erogeneous data sources
      STEPS
                                                                            Research question #1: How to determine which data sources
   In this section, we discuss major biomedical data sources and
                                                                         and ontologies need to be integrated?
significant goals that we identified as a part of my PhD proposal.
                                                                            Most hospitals maintain a collection of teaching files, but many
3.1    Datasets                                                          public teaching file collections are also available through curated
                                                                         online sources (e.g., RSNA MIRC, MyPacs, and EURORAD). We
   We currently focus on three types of data a) Electronic health
                                                                         developed IRIS engine as a pilot for a data integration system for
records; b) Radiology teaching files or teaching files used by doc-
                                                                         the healthcare domain [1]. In IRIS, we captured heterogeneous data
tors and radiologists; c) Research datasets.
                                                                         from MIRC and MyPacs data sources, loading data into an inte-
Electronic Health Records (EHRs): An electronic health record is
                                                                         grated data repository. Using medical ontologies, we built our own
a digital version of a patient’s record. EHRs are maintained at hos-
                                                                         dictionary which maps terms to their synonyms from the datasets
pitals and provide patient information such as history of patient,
                                                                         and medical ontologies [3]. We designed an unsupervised machine
medical test results, allergies, immunization details, radiology im-
                                                                         learning technique that performs coverage analysis of data sources
ages, and clinical reports.
                                                                         and medical ontologies to learn properties of the data (e.g., topic
Medical Teaching Files: A radiology teaching files system is a
                                                                         coverage). By learning data repositories contents, one can decide
collection of important cases for teaching and clinical follow-up.
                                                                         which data sources need to be integrated or what repository con-
Teaching files share a similar overall structure but significant vari-
                                                                         tent is lacking. Thus, this coverage analysis algorithm benefits data
ations exist even within the same data sources and can include in-
                                                                         integration process by extracting knowledge about the repositories
formation such as patient history, findings, diagnosis, discussion,
                                                                         (addressing research question #1). Our analysis also confirmed that
comments, references, and images related to clinical reports.
                                                                         data integration is a continuous, iterative process [2].
Research datasets: From our survey with different research institute
datasets, we observed that most of the data in healthcare domain             3.2.2     Ranked retrieval search engine with multimodal
are images (e.g., CT, X-ray, MRI). Those images are most typi-                         text and image-based search capabilities
cally stored in formats such as JPEG, DICOM, or PNG and include
associated text data describing patient and case information.               Research question #2: How to find relevant documents given a
                                                                         keyword query or hybrid (text+image) query? Figure 1 shows the
3.2    Data integration and rank retrieval                               architecture of IRIS engine. When a user enters a text query, IRIS
   We have organized this project into three phases (I finished the      performs query expansion using relevant ontologies, and retrieves
first two phases and working on the last phase of my PhD work).          relevant results to the query term. Our database also stores accuracy
                                                                         feedback from users which is then used to evaluate and iteratively
5 https://openi.nlm.nih.gov/
                                                                         improve IRIS results.
6 http://www.radlex.org/                                                    An integrated search may result in thousands of matches; thus,
7 https://www.nlm.nih.gov/healthit/snomedct/                             we are designing a search algorithm that ranks results by incorpo-
                                                                        on defining standard data cleaning technique that would be applica-
                                                                        ble to the most of the similar data sources that we proposed in this
                                                                        work. Our data categorization module categorizes data items into
                                                                        different sets based on the usage of those data elements in search
                                                                        operation. We need support from a human to check the accuracy of
                                                                        data categorization, to set similarity thresholds between different
                                                                        data items, and apply additional domain knowledge to categorize
                                                                        these data items based on relevance between data objects. Our data
                                                                        categorization algorithm will differentiate data items based on di-
                                                                        agnostic relevance. For example, teaching cases with title, findings,
                                                                        and diagnosis would be treated as one sub-category in teaching
                                                                        cases (that would also integrate clinical reports) while another sub-
                                                                        category could integrate fields those are medically less relevant e.g.,
                    Figure 1: IRIS Architecture                         discussion, history, or comments. Based on data categorization we
                                                                        will be designing database schema and would also evaluate schema
                                                                        based on standard database schema benchmark techniques. Data
rating context computed through a weighted ontology terms. For          write bridges would be responsible for the extracting data from dif-
text-based search ranking evaluation we used Normalized Discounted      ferent data categories and loading data to the respective database
Cumulative Gain (NDCG)8 algorithm to measure the quality of             schema. This data categorization work is ongoing and we do not
search result ranking. Our analysis showed an improvement in            have any experimental results yet. We will address research ques-
ranked retrieval as compared to other search engines (addressing        tion #3 by implementing this module.
research question #2).

3.2.3     Data bridges and indexing mechanism to inte-                  4.    EXPERIMENTAL RESULTS
          grate biomedical data sources                                   In this section we briefly discuss the current results from pro-
   Research question #3: How data integration performance (time)        posed system.
and scalability (adding variety of data sources) can be improved us-
ing data bridges? In order to make our integration solution applica-    4.0.1     Text-based results
ble to other biomedical data sources (e.g., EHR’s, clinical reports),      We evaluated IRIS search ranking using a combination of queries
we plan to create data adapters that will serve as a bridge between     received from radiologists at a well-known hospital and other queries
data providers and data integration systems (this work was a part of    chosen from an extensive literature survey. We have initially tested
my internship at NIH). Data providers can share their data in any       a total of 28 text queries, out of which we picked a subset of 10
file format and bridges will interpret that data in a uniform manner.   queries (Q1:Cardiomegaly, Q2: ACL Tear, Q3: Annular Pancreas,
As shown in Figure 2, our data clustering indexing approach starts      Q4: Pseudocoxalgia, Q5: Varicocele, Q6: Angiosarcoma, Q7: Tra-
                                                                        cheal dilation, Q8: Appendicitis, Q9: Bronchus intermedius, Q10:
                                                                        Cystitis glandularis) to perform an in depth evaluation. Due to
                                                                        space constraints we briefly discuss text based results. We evalu-
                                                                        ated text-based results on a scale from 0 (“not relevant”) to 2 (“very
                                                                        relevant”). We defined five categories to score text search results:
                                                                        “not relevant” = 0 (when term and synonyms do not appear any-
                                                                        where in the results), “relevant” = 0.5 (if term or synonyms appear
                                                                        in any category of teaching file), “more relevant” = 1 (if term or
                                                                        synonyms appear in discussion category), “most relevant” = 1.5 (if
                                                                        term or synonyms appears in history or ddx category), and “very
                                                                        relevant” = 2 (if term or synonyms appears in title, findings, or di-
                                                                        agnosis categories).
                                                                           Comparison of IRIS and MIRC relevance rank algorithm us-
                                                                        ing same datasets:
                                                                        We compared IRIS relevance rank algorithm with MIRC using the
                                                                        same dataset. We considered top four teaching file results from
                                                                        IRIS, MIRC, and Google site search. We calculated relevance score
      Figure 2: Data Categorization with human-in-the-loop              by scoring top four teaching files from each engine, using weighted
                                                                        ontology ranking algorithm . Figure 3 shows an overall analysis
with collecting different biomedical data sources. From our liter-      of results from these 3 search engines. score for each search en-
ature survey we observed that data preparation accounts 80% of          gine shows that IRIS relevance rank algorithm performs better than
data scientist work. Data preparation includes finding relevant data    other two engines.
sources, extracting data from those data sources, data cleaning, and    Ranking evaluation of other medical search engines:
data integration. Our proposed data integration system would help         We also considered how other public medical radiology teach-
data scientists and researchers optimize and streamline data prepa-     ing file search engines rank their search results. We used the same
ration. We collected different biomedical data sources and working      query set and performed a search using MIRC, MyPacs, EURO-
                                                                        RAD, and Open-i search engines. We discuss only two queries
8 https://en.wikipedia.org/wiki/Discounted_                             (Q1:“cardiomegaly” and Q8:“appedicitus”) in detail and reporting
cumulative_gain                                                         scores for the top 10 search results. Figure 4 shows a comparative
                                                                          weight to ontology terms we show that teaching files can be better
                                                                          ranked in order of their relevance to a search query. Currently I
                                                                          am working on data write bridges and categorization algorithm to
                                                                          improve biomedical data integration process.


                                                                          6.    ACKNOWLEDGMENTS
                                                                             This research was supported in part by the Intramural Research
                                                                          Program of the National Institutes of Health (NIH), National Li-
                                                                          brary of Medicine (NLM), and Lister Hill National Center for Biomed-
                                                                          ical Communications (LHNCBC).

                                                                          7.    REFERENCES
     Figure 3: IRIS relevance rank results comparison with MIRC
                                                                           [1] P. Deshpande, A. Rasin, E. Brown, J. Furst, D. Raicu,
                                                                               S. Montner, and S. Armato III. An integrated database and
                                                                               smart search tool for medical knowledge extraction from
                                                                               radiology teaching files. In Medical Informatics and
                                                                               Healthcare, pages 10–18, 2017.
                                                                           [2] P. Deshpande, A. Rasin, E. Brown, J. Furst, D. S. Raicu,
                                                                               S. M. Montner, and S. G. Armato. Big data integration case
                                                                               study for radiology data sources. In 2018 IEEE Life Sciences
                                                                               Conference (LSC), pages 195–198. IEEE, 2018.
                                                                           [3] P. Deshpande, A. Rasin, E. T. Brown, J. Furst, S. M.
                                                                               Montner, S. G. Armato III, and D. S. Raicu. Augmenting
                                                                               medical decision making with text-based search of teaching
                                                                               file repositories and medical ontologies: Text-based search of
                                                                               radiology teaching files. International Journal of Knowledge
Figure 4: Rank retrieval score results from other medical search               Discovery in Bioinformatics (IJKDB), 8(2):18–43, 2018.
engines                                                                    [4] P. Deshpande, A. Rasin, J. Furst, D. Raicu, and S. Antani.
                                                                               Diis: A biomedical data access framework for aiding data
                                                                               driven research supporting fair principles. Data, 4(2):54,
analysis of ranked results from these four engines using the rele-             2019.
vance scores based on our metric described above. Open-i can rank          [5] R. Gutmark, M. J. Halsted, L. Perry, and G. Gold. Use of
search results based on different categories (e.g., based on diag-             computer databases to reduce radiograph reading errors.
nosis or based on teaching file date) – we used a diagnosis based              Journal of the American College of Radiology, 4(1):65–68,
search in Open-i. MIRC ranks results based on the date of modifi-              2007.
cation with no other option available. Our analysis shows that none
                                                                           [6] J. R. Hemler, J. D. Hall, R. A. Cholan, B. F. Crabtree, L. J.
of the search engines return the most relevant results first. Interest-
                                                                               Damschroder, L. I. Solberg, S. S. Ono, and D. J. Cohen.
ingly, top results are often less relevant than the subsequent search
                                                                               Practice facilitator strategies for addressing electronic health
results. For example for “cardiomegaly” MyPacs fourth result is
                                                                               record data challenges for quality improvement:
more relevant than the top three results. EURORAD does not re-
                                                                               Evidencenow. The Journal of the American Board of Family
trieve any results for “cardiomegaly” but we checked “appendicits”
                                                                               Medicine, 31(3):398–409, 2018.
results – and those were also not ranked based on the relevance of
the search term.                                                           [7] A. Holzinger, M. Dehmer, and I. Jurisica. Knowledge
                                                                               discovery and interactive data mining in
4.0.2      Hybrid Text and Image based results                                 bioinformatics-state-of-the-art, future challenges and
   IRIS hybrid algorithm augments the text search with image search            research directions. BMC bioinformatics, 15(6):I1, 2014.
and re-ranks results based on the relevance to the query. Due to           [8] G. Li. Human-in-the-loop data integration. Proceedings of
space constraints we briefly discuss hybrid search result. IRIS text-          the VLDB Endowment, 10(12):2006–2017, 2017.
based and hybrid search results scored an score of 0.83 out of 1.          [9] Z. J. Ling, Q. T. Tran, J. Fan, G. C. Koh, T. Nguyen, C. S.
Image search scored only about 0.53 out of 1, validating our use of            Tan, J. W. Yip, and M. Zhang. Gemini: an integrative
the image search as an enhancement to the text search (rather than             healthcare analytics system. Proceedings of the VLDB
a standalone search). Hybrid search scored 0.84 out of 1 because of            Endowment, 7(13):1766–1771, 2014.
text results were augmented with image-based results. For hybrid          [10] I. Merelli, H. Pérez-Sánchez, S. Gesing, and D. DAgostino.
search Some of the results were noticeable better than text-based              Managing, analysing, and integrating big data in medical
search.By combining text search with image results, we are striv-              bioinformatics: open problems and future perspectives.
ing to get a text-based match that also includes a similar image.              BioMed research international, 2014, 2014.
                                                                          [11] M. S. Simpson, D. Demner-Fushman, S. K. Antani, and
5.     CONCLUSIONS                                                             G. R. Thoma. Multimodal biomedical image indexing and
                                                                               retrieval using descriptive text and global feature mapping.
  The ranking approach presented in this paper is significant be-              Information retrieval, 17(3):229–264, 2014.
cause it enables IRIS to present the user with top relevant refer-        [12] R. Talanow. Radiology teacher: a free, internet-based
ence cases first. Through integrating term frequency, adding more         radiology teaching file server. JACR, 6(12):871–875, 2009.