Greg, ML: Automatic Diagnostic Suggestions
                           Humanity is Overrated. Or not.

              Paola Lapadula 1 , Giansalvatore Mecca 1 , Donatello Santoro 1 ,
                         Luisa Solimando 2 , and Enzo Veltri 2
                            1
                               Università della Basilicata – Potenza, Italy
                   2
                       Svelto! Big Data Cleaning and Analytics – Potenza, Italy
                                     (Discussion Paper)

         Abstract. Recently machine-learning techniques have been applied in a variety
         of fields. One of the most promising and challenging is handling medical records.
         In this paper we present Greg, ML, a machine-learning tool for generating auto-
         matic diagnostic suggestions based on patient profiles. At the core of our system
         there are two machine learning classifiers: a natural-language module that han-
         dles reports of instrumental exams, and a profile classifier that outputs diagnostic
         suggestions to the doctor. After discussing the architecture we present some ex-
         perimental results based on the working prototype we have developed. Finally,
         we examine challenges and opportunities related to the use of this kind of tools in
         medicine, and some important lessons learned developing the tool. In this respect,
         despite the ironic title of this paper, we underline that Greg should be conceived
         primarily as a support for expert doctors in their diagnostic decisions, and can
         hardly replace humans in their judgment.


1     Introduction
The larger availability of digital data related to all sectors of our everyday lives has cre-
ated opportunities for data-based applications that would not be conceivable a few years
ago. One example is medicine: the push for the widespread adoption of electronic med-
ical records [9, 5] and digital medical reports is paving the ground for new applications
based on these data.
     Greg, ML [8] is one of these applications. It is a machine-learning tool for generat-
ing automatic diagnostic suggestions based on patient profiles. In essence, Greg takes
as input a digital profile of a patient, and suggests one or more diagnosis that, according
to its internal models, fit the profile with a given probability. We assume that a doctor
inspects these diagnostic suggestions, and takes informed actions about the patients.
     We notice that the idea of using machine learning for the purpose of examining
medical data is not new [7, 11, 10]. In fact, several efforts have been taken in this di-
rection [1, 6]. To the best of our knowledge, however, all of the existing tools concen-
trate on rather specific learning tasks, for example identifying a single pathology – like
    Copyright © 2019 for the individual papers by the papers’ authors. Copying permitted for
    private and academic purposes. This volume is published and copyrighted by its editors. SEBD
    2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
heart disease [14, 12], or pneumonia [13], or cancer, where results of remarkable quality
have been reported [15]. On the contrary, Greg has the distinguishing feature of being
a broad-scope diagnostic-suggestion tool. In fact, at the core of the tool stands a generic
learning model that allows to suggest large numbers of pathologies, currently several
dozens, and in perspective several hundreds.
    Greg is a research project developed by Svelto!, a spin-off of the data-management
group at University of Basilicata.
    The rest of the paper is devoted to introduce Greg, as follows. We discuss the in-
ternal architecture of the tool in Section 2. Then, we introduce the methodology and
the additional tools in Section 3. We introduce some experimental results based on the
current version of the tool in Section 4.
    Finally, in Section 5 we conclude by discussing the possible applications we envi-
sion for Greg, and discuss a few crucial lessons learned with the tool, which, in turn,
have inspired the title of this paper.


                                                        patient profile
                numerical
               lab results


                textual                    pathology
                reports                    indicators
                                                                                doctor


            biographic data
              and medical
 patient        history

                                                                              diagnostic
                                                                             suggestions


                              Fig. 1: Architecture of Greg.


2   Architecture of Greg

The architecture and the overall flow of Greg is depicted in Figure 1.
     As we have already discussed, at the core of Greg stands a classifier for patient pro-
files that provides doctors with diagnostic suggestions. Profiles are entirely anonymous,
i.e., Greg does not store nor requires any personal data about patients, and are composed
of three main blocks:
 – anonymous biographical data, mainly age and gender, and medical history of the
   patient, i.e., past medical events and pathologies, especially the chronic ones;
 – result of lab exams, in numerical format;
 – textual reports from instrumental exams, like RX, ultrasounds etc.

These items compose the patient profile that is fed to the profile classifier in order to
propose diagnostic suggestions to doctors. Notice that, while biographic data, medical
history and lab exam results are essentially structured data, and therefore can be easily
integrated into the profile, reports of instrumental exams are essentially unstructured.
As a consequence, Greg relies on a second learning module to extract what we call
pathology indicators, i.e., structured labels indicating anomalies in the report that may
suggest the presence of a pathology.
    The report classifier is essentially a natural-language processing module. It takes the
text of the report in natural language and identifies pathology indicators that are then
integrated within the patient profile.
    The report classifier is, in a way, the crucial module for the construction of the
patient profile. In fact, reports of instrumental exams often carry crucial information for
the purpose of identifying the correct diagnostic suggestions. At the same time, their
treatment is language-dependent, and learning is labor-intensive, since it requires to
label large set of reports in order to train the classifier.
    Once the profile for a new patient has been built, it is fed to the profile classifier that
outputs diagnostic suggestions to the doctor. There are a few important aspects to be
noticed here.

 – First, Greg is trained to predict only a finite set of diagnoses. This means that it is
   intended primarily as a tool to gain positive evidence about pathologies that might
   be present, rather than as a tool to exclude pathologies that are not present. In other
   terms, the fact that Greg does not suggest a specific diagnosis does not mean that
   that can be excluded, since it might only be the case that Greg has not be trained for
   that particular pathology. It can be seen that handling a large number of diagnoses
   is crucial, in this respect.
 – Second, Greg associates a degree of probability with each diagnostic suggestion,
   i.e., it ranks them with a confidence measure. This is important, since the tool may
   provide several different suggestions for a given profile, and not all of them are to
   be considered as equally relevant.

It can be seen how a tool like Greg has an effective as seamless integration with the
everyday procedures of a medical institution is. To foster this kind of adoption, Greg
can be used as a stand-alone tool, with its own user-interface, but it has been devel-
oped primarily as an engine-backed API, that can be easily integrated with any medical
information system that is already deployed in medical units and wards. Ideally, with
this kind of integration, accessing medical suggestions provided by Greg should cost
no more than clicking a button, in addition of the standard procedure for patient-data
gathering and medical-record compilation.
3     The Greg Workflow and Ecosystem
As we have discussed in the previous sections, the effectiveness of a system like Greg is
strongly related to the number of pathologies which it can provide suggestions for. We
therefore put quite a lot of effort in structuring the learning workflow in order to make
it lean and easily reproducible. In this section we summarize a few key findings in this
respect, that led us to the development of a number of additional tools, which compose
the Greg ecosystem.
     A first important observation we make is that a system like Greg needs to make ref-
erence to a standardized set of diagnosis. As it is common, we rely on the international
classification of diseases, ICD-10 (DRG) 3 . This, however, poses a challenge when deal-
ing with large and heterogeneous collections of medical records coming from disparate
sources, which do not necessarily are associated with a DRG. This poses a standardiza-
tion problem for diagnosis labels. In fact, standardizing the vocabulary of pathologies
and pathology indicators is crucial in the early stages of data preparation. To this end,
we leveraged the consolidated suite of data-cleaning tools developed by our research
group over the years [2–4].
     A second important observation is that we need to handle large and complex
amounts of data gathered from medical information systems, including biographical
data, admissions and patient medical history, medical records, multiple lab exams, and
multiple reports. These data need to be explored, selected and prepared for the purpose
of training the learning models. In order to streamline the data-preparation process, we
decided to develop a tool to explore the available data. The tool is called Caddy and is
essentially a data warehouse build on top of the transactional medical databases. This
allowed us to adopt a structured approach to data exploration and data selection, that
proved essential in the development of the tool.


                        Fig. 2: DAIMO, the ML Labeling Tool.


   However, the tool that proved to be the most crucial in the development of Greg is
DAIMO, our instance labeler. DAIMO stands for Digital Annotation of Instances and
 3
     http://www.who.int/classifications/icd/icdonlineversions/en/
Markup of Objects. It is a tool explicitly conceived to support the labeling phase of
machine learning projects. A snapshot of the system is shown in Figure 2.
     DAIMO is a semi-automated tool for data labeling. It provides a simple and effective
interface to explore pre-defined collections of samples to label. Samples may be either
textual, or even structured – for example, in tabular format– or even of mixed type.
Users that are tasked with labeling can cooperatively explore the samples, pick them,
explore existing labels and add more. Figure 2 shows the process of labeling one report.
Labels associated with the report are on the right. Each corresponds to a colored portion
of the text.
     We believe that even only the availability of an intuitive tool to support cooperative
labeling-work significantly increases productivity. In addition to this, DAIMO provides
additional functionalities that further improve the process.
     First, it allows to define label vocabularies, in order to standardize the way in which
labels are assigned to samples. Users usually search labels within the vocabulary, and
add new ones only when the ones they need are not present. When dealing with complex
labeling tasks with many different labels, such a systematic approach is crucial in order
to get good-quality results.
     Second, DAIMO is able to learn labeling strategies from examples. After some ini-
tial training, it does not only collects new labels from users, but actually suggests them,
so that users need only to accept or refuse DAIMO’s suggestions. This approach really
transforms the labeling process from the inside out, since after a while it is DAIMO, not
the user to do most of the work.
     In fact, in our experience, working with DAIMO may lower text-labeling times up
to one order of magnitude with respect to manual, unassisted labeling.

4   Experimental Results


Fig. 3: Experimental Results prior and after Diagnosis Review in terms of F-Measure


   We developed an advanced prototype of Greg, used to conduct a number of exper-
iments to assess the feasibility of the overall approach.
   We conducted a first preliminary experimental evaluation using 200 medical records
over a small set of diagnosis (pneumonia, cirrhosis, anemia, urological infection) [8].
     Lately. we used a bigger dataset, containing a total of 22160 medical records and we
were able to learn about 50 diagnosis. We used the discharge letters for each medical
record as labels. We called this annotated dataset GT - D.L. (Ground Truth - Discharge
letters). As usually data were split in training, cross-validation and test sets and we
measured the F-Measure of Greg predictions. On GT - D.L. we obtained poor results,
not as good as we would have expected. Some of the results are shown in Figure 3,
in terms of F-Measure. Our investigation of the data, however, suggested that in many
cases the quality of the results have been underestimated. In essence, in several cases
Greg suggested a more thorough set of diagnoses than the one indicated by the doctor
in discharge letters. As an example, this happened frequently with patients suffering
from anaemia, which is often associated with cirrhosis, even though doctors had not
explicitly mentioned that specific diagnosis in the discharge letter.
     We therefore conducted a second experiment. We asked our team of doctors to re-
view the set of diagnoses associated with patient profiles used for the test. In essence,
our doctors made sure that all relevant diagnoses were appropriately mentioned, in-
cluding those that the hospital doctors had omitted in the discharge letter. We called
this manually annotated dataset GT - Doct. (Ground Truth - Doctors). Figure 3 reports
Greg’s results over this revised dataset. As it can be seen, we obtained an F-Measure
for each diagnosis always above the 95%.
     To summarize, our preliminary tests show that Greg can effectively achieve high
accuracy in its predictions. In addition, it may effectively assist doctors in formulating
their diagnoses, by providing systematic suggestions.


5     Conclusions: Opportunities and Lessons Learned
We believe that Greg can be a valid and useful tool to assist doctors in the diagnostic
process. Given its ability to learn diagnostic suggestions at scale, we envision several
main scenarios of use for the system in a medical facility:
    – We believe Greg can be of particular help in ER, during the triage and first diag-
      nostic phase; in particular, based on first evidences about the patient, it may help
      the ER operator to identify a few pathologies to it is worth exploring, perhaps with
      the help of specialized colleague.
    – Then, we envision interesting opportunities related to the use of Greg in the di-
      agnosis of rare pathologies; these are especially difficult to capture by a learning
      algorithm, because, by definition, there are only a few training examples to use,
      and therefore a special treatment is required. Still, we believe that supporting doc-
      tors – especially younger ones, that might have less experience in diagnosing these
      pathologies – in this respect is an important field of application.
    – In medical institutions that rely on standardized clinical pathways or integrated care
      pathways (ICPs) – PDTAs in Italy – Greg may be used to quickly suggest which
      parts of a pathway need to be explored, and which ones can be excluded based on
      the available evidence.
    – Finally, Greg may be used as a second-opinion tool, i.e., after the doctor has for-
      mulated her/his diagnosis, for the purpose of double checking that all possibilities
      have been considered.
While in our opinion all of these represent areas in which Greg can be a valid support
tool for the doctor, we would like to put them in context by discussing what we believe
to be the most important lessons we have learned so far.
    On the one side, the development of Greg has taught us a basic and important les-
son: in many cases, probably the majority, the basic workings of the diagnostic process
employed by human doctors is indeed reproducible by an automatic algorithm.
    In fact, it is well known that doctors tend to follow a decision process that looks for
specific indicators within the patient profile – e.g., values of laboratory tests, or specific
symptoms – and decides to consider or excludes pathologies based on them. As fuzzy
as this process may be, as any other human-thinking process, to our surprise we learned
that for a large number of pathologies this process provides a perfect opportunity for
the employment of a machine learning algorithm, which, in turn, may achieve very
good accuracy in mimicking the human decision process, with the additional advantage
of scale – Greg can be trained to learn very high numbers of diagnostic suggestions.
In this respect, ironically quoting Gregory House, we might be tempted to state that
“Humanity is overrated”, indeed.
    However, our experiences also led us to find that there are facets of the diagnostic
process that are inherently related to intuition, experience, and human factors. These
are, by nature, impossible to capture by an automatic algorithm. Therefore, our ulti-
mate conclusion is that humanity is not overrated, and that Greg can indeed provide
useful support in the diagnostic process, but it cannot and should not be considered as
a replacement of an expert human doctor.


References
 1. R. C. Deo. Machine learning in medicine. Circulation, 132(20):1920–1930, 2015.
 2. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In Proceedings of
    the IEEE International Conference on Data Engineering - ICDE, 2014.
 3. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. That’s All Folks! LLUNATIC Goes Open
    Source. In Proceedings of the International Conference on Very Large Databases - VLDB,
    2014.
 4. J. He, E. Veltri, D. Santoro, G. Li, G. Mecca, P. Papotti, and N. Tang. Interactive and deter-
    ministic data cleaning. In Proceedings of the 2016 International Conference on Management
    of Data, SIGMOD Conference 2016, pages 893–907, 2016.
 5. T. Heinis, A. Ailamaki, et al. Data infrastructure for medical research. Foundations and
    Trends in Databases, 8(3):131–238, 2017.
 6. A. Holzinger. Machine learning for health informatics. In Machine Learning for Health
    Informatics, pages 1–24. Springer, 2016.
 7. I. Kononenko. Machine learning for medical diagnosis: history, state of the art and perspec-
    tive. Artificial Intelligence in medicine, 23(1):89–109, 2001.
 8. P. Lapadula, G. Mecca, D. Santoro, L. Solimando, and E. Veltri. Humanity Is Overrated.
    or Not. Automatic Diagnostic Suggestions by Greg, ML. In New Trends in Databases and
    Information Systems, pages 305–313. Springer International Publishing, 2018.
 9. R. H. Miller and I. Sim. Physicians’ use of electronic medical records: barriers and solutions.
    Health affairs, 23(2):116–126, 2004.
10. O. Mohammed and R. Benlamri. Developing a semantic web model for medical differential
    diagnosis recommendation. Journal of medical systems, 38(10):79, 2014.
11. N. Peek, C. Combi, R. Marin, and R. Bellazzi. Thirty years of artificial intelligence
    in medicine (aime) conferences: A review of research themes. Artificial intelligence in
    medicine, 65(1):61–73, 2015.
12. P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn, and A. Y. Ng. Cardiologist-level
    arrhythmia detection with convolutional neural networks. arXiv preprint arXiv:1707.01836,
    2017.
13. P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz,
    K. Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with
    deep learning. arXiv preprint arXiv:1711.05225, 2017.
14. J. Soni, U. Ansari, D. Sharma, and S. Soni. Predictive data mining for medical diagnosis:
    An overview of heart disease prediction. International Journal of Computer Applications,
    17(8):43–48, 2011.
15. I. Steadman. IBM’s Watson is better at diagnosing cancer than human doctors. WIRED,
    2013.