<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating Reward Functions using IRL Towards Individualized Cancer Screening</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>yiotis P</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tousis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon X. H</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Willi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>m Hsu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>x A. T. Bui</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>UCLA Bioengineering Department</institution>
          ,
          <addr-line>Los Angeles, CA, 90095</addr-line>
          ,
          <institution>USA UCLA Department of Radiological Sciences</institution>
          ,
          <addr-line>Los Angeles, CA, 90095</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Cancer screening is a large, population-based intervention that would bene t from tools enabling individually-tailored decision making to decrease unintended consequences such as overdiagnosis. The heterogeneity of cancer screening participants advocates the need for more personalized approaches. Partially observable Markov decision processes (POMDPs) can be used to suggest optimal, individualized screening policies. However, determining an appropriate reward function can be challenging. Here, we propose the use of inverse reinforcement learning (IRL) to form rewards functions for lung and breast cancer screening POMDP models. Using data from the National Lung Screening Trial and our institution's breast screening registry, we developed two POMDP models with corresponding reward functions. Speci cally, the maximum entropy (MaxEnt) IRL algorithm with an adaptive step size was used to learn rewards more e ciently; and combined with a multiplicative model to learn state-action pair rewards in the POMDP. The lung and breast cancer screening models were evaluated based on their ability to recommend appropriate screening decisions before the diagnosis of cancer. Results are comparable with experts' decisions. The lung POMDP demonstrated an improved performance in terms of recall and false positive rate in the second screening and post-screening stages. Precision (0:02 0:05) was comparable to experts' (0:02 0:06). The breast POMDP has excellent recall (0:97 1:00), matching the physicians and a satisfactory false positive rate (&lt; 0:03). The reward functions learned with the MaxEnt IRL algorithm, when combined with POMDP models in lung and breast cancer screening, demonstrate performance comparable to experts.</p>
      </abstract>
      <kwd-group>
        <kwd>Cancer screening</kwd>
        <kwd>maximum entropy inverse reinforcement learning</kwd>
        <kwd>partially-observable Markov decision processes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Annually, millions of people undergo screening for disease prevention and
surveillance. From these tests, physicians aim to make decisions based on the patient's
past results and most current observations, determining a subsequent action
(e.g., further diagnostic testing, increased monitoring, following regular screening
schedules, etc.) that optimizes early detection of health problems while
balancing other (pragmatic) concerns (e.g., patient quality of life, resource utilization,
cost). Choosing the \best" next step and tailoring screening for each person is
challenging: selecting an action of bene t in the immediate future may not be
optimal over the long-term, given the particulars of an individual (i.e., a locally
greedy approach vs. a global optimization).</p>
      <p>
        Sequential decision making methods provide a potential solution. Such
approaches can integrate and analyze multiple sources of patient data, while
handling issues related to temporal credit assignment. In particular, partially
observable Markov decision processes (POMDPs) have been applied to cancer
screening (e.g., breast, colorectal, prostate [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) to determine policies based on
patients' risk factors and prior screening results. Markedly, POMDP models used in
medicine typically use a reward function adopted from cost-e ectiveness studies
or are posed in terms of quality-adjusted life years (QALYs). While such
functions are informative about general populations, they do not necessarily re ect
how an experienced clinician would make a decision, especially given a speci c
individual's medical history and preferences. Indeed, little work has been done
in designing reward functions that emulate experts' decision processes.
      </p>
      <p>
        Here, we propose using the Maximum Entropy Inverse Reinforcement
Learning (MaxEnt IRL) algorithm [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to establish reward functions from retrospective
screening data, learning how an expert physician may select a given action based
on observed test results. We use an adaptive step size to expedite the
convergence rate of MaxEnt IRL. Importantly, we present how to use the MaxEnt
IRL learned rewards to generate state-action pair rewards that can be used in
POMDPs. We demonstrate this work using two real-world clinical datasets for
lung and breast cancer screening, mimicking how clinicians made decisions
regarding patients. We evaluate the resultant POMDP policies using the MaxEnt
IRL reward functions, comparing model performance to experts' actions. We
conclude that the MaxEnt IRL algorithm is an e cient and accurate method in
estimating sensible reward functions for cancer screening.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Although Markov decision processes (MDPs) and POMDPs are used in a
number of domains, their application in healthcare is limited and few strategies
exist for estimating the associated reward functions that drive agent behavior
in clinical settings. Classic examples include: Bennet et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], who proposed a
cost-e ectiveness metric based on the cost required to obtain one unit of outcome
change (CPUC); Hauskrecht et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], who designed a reward model that
combines economic cost and patient quality of life measures; and Tusch et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
who predicated rewards on 30-day mortality risk for a surgical procedure. In
contrast, we take advantage of growing amounts of longitudinal data, using recorded
information and actions from electronic health records (EHRs) and other
observational data sources to learn a POMDP reward function that imitates expert
physicians' behavior for desired health outcomes. Speci cally, IRL is proposed
for this task.
      </p>
      <p>
        Brie y, IRL addresses the problem of obtaining a reward function given an
agent's optimal behavior over time towards a stated goal. A reward function for
the environment is unknown and is hence learned through empirical
investigation of sensory inputs (i.e., observations) that progressively change the agent's
selection of di erent actions. Two families of IRL algorithms exist: 1) linear
programming (LP) methods [
        <xref ref-type="bibr" rid="ref1 ref13">1,13</xref>
        ]; and 2) probabilistic IRL algorithms [
        <xref ref-type="bibr" rid="ref18 ref3">3,18</xref>
        ]. While
potentially more computationally complex, probabilistic IRL approaches have
two advantages: they guarantee a unique solution for deterministic MDPs; and
compared to LP methods, they can handle stochasticity in the data.Vroman et
al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] developed a maximum likelihood IRL algorithm using clusters of experts'
data trajectories to characterize di erent intentions. Applying the maximum
likelihood IRL algorithm to each cluster subsequently derives a reward function
representing the experts' behavior. Ziebart et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] describe a probabilistic IRL
algorithm that employs the principle of maximum entropy, dealing with noise
and imperfect behavior as it normalizes globally over behaviors. In this approach,
demonstrated for modeling routing preferences of vehicle drivers, behaviors with
higher rewards are exponentially preferred by the algorithm when learning the
reward function. Here, we build on and adapt this approach to obtain reward
functions for cancer screening POMDPs.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Materials and Methods</title>
      <p>
        NLST Dataset
The National Lung Screening Trial (NLST) is a multi-site randomized controlled
trial that demonstrated a 20% mortality reduction in lung cancer screening
using low-dose computed tomography (LDCT) relative to plain chest
radiography [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. For this work, we used data from the NLST's LDCT arm, comprising
approximately 25,500 participants that underwent three annual screenings and
follow-up post screening. We further lter this dataset to those subjects who had
a reported pulmonary nodule based on imaging. Unfortunately, preprocessing of
the NLST data is not straightforward, as longitudinal tracking of the nodules
was not considered at the time of the study. Thus, to use imaging-related
information, we made the assumption that an imaging nding in individuals with
only one reported nodule and in the same anatomical location over time is the
same nodule across the three screening points of the trial. This criterion
further constrained our dataset to 5,694 LDCT subjects. From this subgroup, we
learned a reward function, then trained and tested a POMDP. Note that for the
reward function we made use of the recorded diagnostic follow-up variables (e.g.,
recommendation for other procedures) to inform actions.
3.2
      </p>
      <p>
        Athena Dataset
The Athena Breast Health Network [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a University of California (UC)-wide
initiative around breast cancer screening and treatment. The e ort started in
2009 and includes women who underwent breast screening at ve academic
medical centers. The portion available at our institution (UCLA) consists of 49,244
patients, with follow-ups of up to 4.8 years; this subset represents 96,515
screening and diagnostic mammograms (MGs), and 2,713 diagnostic biopsies. MG
results are reported in Breast Imaging Reporting and Data System (BI-RADS)
scores [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We selected patients with initial risk (Gail) scores, four consecutive
screenings, and valid BI-RADS scores, which, along with biopsy results, per
breast side (i.e., left, right). 2,095 patients with left breast MGs and 2,036
patients with right breast MGs (4,131 total cases) were used in this study.
3.3
      </p>
      <p>Partially Observable Markov Decision Processes
An MDP is represented by a tuple of states, actions, rewards, action-dependent
state transition dynamics (i.e., transition probabilities), and a discount factor.
A POMDP is an extension to MDPs with two additional components:
observations and state-dependent observation dynamics (i.e., observation probabilities).
The state of the agent in POMDPs is partially observable. As such, its state is
modeled as a probability distribution over the states, called the belief state, that
gets updated over time based on the observations experienced by the agent.</p>
      <p>
        We designed and evaluated two separate POMDPs for lung and breast cancer
screening. Each model consists of three states and two actions. The observations
of each POMDP are domain based: in the lung model, they represent ndings
obtained from LDCT imaging studies, including nodule size, consistency, location,
and margins; in the breast model, they represent BI-RADS scores derived from
MG interpretations. Given the nature of each dataset, both the lung and breast
models have a horizon of three and four years, respectively, with 6-month and
1-year epochs. Each epoch represents time points for which we have information
on the cancer status of patient (diagnosed with cancer or not). Transition and
observation probabilities for each POMDP model are learned using the
expectation maximization (EM) algorithm, for learning dynamic Bayesian networks,
from each dataset. Both models were solved using the QMDP approximation
solver [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Lung cancer screening POMDP Figure 1 (left) depicts the lung POMDP,
illustrating the state space and allowed transitions between states, as well as
the observations of each state. The state space consists of three states: the
nocancer (NC) state that represents any case with no suspicious abnormalities (i.e.,
no pulmonary nodules &gt; 4 mm). The uncertain (U) state that represents any
case with a noted nding (i.e., nodules 4mm or larger) but not yet a lung cancer.
Lastly, the invasive-cancer (IC) state is any case with a con rmed lung cancer
diagnosis through the use of additional diagnostic tests. The IC state is terminal
such that any individual who enters it leaves the screening process for treatment.
An LDCT action implies continuation of screening, whereas an intervention
action refers to any diagnostic procedure (e.g., thoracotomy, biopsies, diagnostic
CT, positron emissions tomography (PET) scan). Observations represent LDCT
ndings (nodule size, consistency, margins, and anatomic location) and the
occurrence of an intervention. To generate initial belief states for each individual in
our dataset we used the Tammemagi PLCOM2012 model with demographic and
clinical features at baseline to predict the risk of cancer. Demographic features
used include age, education, race, and body mass index. Clinical features used
were COPD, family history of lung cancer, personal history of cancer, smoking
status, smoking intensity, and duration of smoking.</p>
      <p>Breast cancer screening POMDP The breast POMDP model also consists
of three states: the no-cancer (NC) state in which no abnormalities are seen,
the benign (B) state in which benign breast disease diagnosis follows the MG,
and the malignant (MA) cancer state in which the disease is con rmed through
biopsy. MA is similarly a terminal state in which the patient leaves the
screening process for treatment. Figure 1 (right) shows the breast cancer screening
POMDP, transitions, observations (BI-RADS scores 1, 2, 3, 4A, 4B, 4C, 5), and
actions. Though an intervention (biopsy in the breast cancer context) is possible
after each MG, in practice biopsies are only performed after an MG of BI-RADS
4 or higher. For an initial belief, we used the patient's Gail score. The Gail score
is an absolute risk estimate derived using age, age at menarche, age at rst birth,
the number of rst-degree relatives with breast cancer, the number of previous
breast biopsies, and race.
3.4</p>
      <p>
        Maximum Entropy IRL
In IRL, the reward function, r, is assumed to be a linear combination of feature
vectors fs and weights ( T is the transpose of ):
r( ; ) = T f = X T fs (1)
s2
A feature count, (f ), is the sum of feature vectors of the states visited along
a trajectory, where fs represents binary vectors indicating state values. Inputs
to the MaxEnt IRL algorithm are an MDP and a set of trajectories (D) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
A path or a trajectory ( ) represents the sequence of states (s) and ensuing
actions followed by an agent in an MDP. For example, in the NLST dataset, a
trajectory comprises three epochs (i.e., the three annual screening exams) with
state-action pairs describing the lung cancer states and the actions taken (e.g.,
NC-LDCT, U-LDCT, and IC-IBiopsy ). The probability of a trajectory occurring in
our set of trajectories is proportional to the exponential of the reward/cost of
the trajectory [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
p( ; ) / exp (r( ; ))
      </p>
      <p>As such, trajectories of equal reward are equally likely to be executed by the
expert, whereas trajectories of less reward are less likely. The probability
distribution over paths with maximum information entropy is parameterized over .
Z( ) is the partition function, where Z( ) = P 2D exp r( ; ).</p>
      <p>1</p>
      <p>Z( )
p( ; ) =
exp (r( ; ))
The log likelihood of the trajectories (loss function) is shown in Equation 4, M
is the number of trajectories:</p>
      <p>L =
The gradient r L represents the di erence of feature expectations and sum over
state visitation frequencies multiplied with feature vectors:
r L = f~</p>
      <p>X</p>
      <p>
        Dsi fsi
si
A feature expectation, (f~), is de ned as the average of all feature counts across
all trajectories. The frequency of state visitation, Dsi , can be computed using a
dynamic programming algorithm; see [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ] for more information regarding this
algorithm. The pseudocode of the MaxEnt IRL algorithm can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
(3)
(4)
(5)
(6)
3.5
      </p>
      <p>
        Adaptive step size
To improve the convergence of the MaxEnt IRL algorithm, we introduce an
adaptive learning rate approach for the update rule of the gradient descent. The
idea behind making the step size adaptive is to calculate the inner product of
r L, the gradient, in the current step, i.e., r Li with r Li 1, its value from
the previous step. If the two are in the same direction then the step size can be
increased, otherwise it is decreased. Following [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] we de ne the learning rate
= (t+A) , where t is dependent on the gradient inner product (which becomes
the dot product in higher dimensions); and A are constants. The role of t is
to regulate the learning rate:
ti+1 = max(ti + f (h r Li; r Li 1i); 0)
(7)
In this de nition, f ( ) represents the following sigmoidal function where f (x) =
fmin + 1 ffmmainx efxmpin x . In the above expressions, , A, fmin, fmax, and ! are
fmax
      </p>
      <p>
        !
user-de ned constants obtained from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. With fmin &lt; 0, fmax &gt; 0, and ! &gt; 0.
We assumed that given the outcome of a known cancer diagnosis for each
individual over time, partial observability was no longer a problem while training,
so learning the rewards of state-action pairs of an MDP instead of a POMDP
was su cient and computationally more e cient. However, the MaxEnt IRL
algorithm computes the rewards of each state of an MDP, not state-action pair
rewards (r(s; a)). To estimate rewards for each state-action pair combination,
we designed two MDPs:
1. A state MDP model. The states of this MDP are the states depicted in Figure
2, for the lung and breast models. The transition matrix of the state MDP
is the same transition matrix used in its respective POMDP model.
2. An action MDP model. In the action MDP, the states are de ned by the
previous action of the agent. These states model the options for screening
(e.g., continue annual screening) and intervention (e.g., biopsy), in which the
agent enters after performing each action. The action MDP transition model
represents the probability of transitioning from the LDCT/MG state to the I
state.
A strati ed 5-fold cross validation study design was used to evaluate the POMDP
models built from the NLST and the Athena datasets. The training set of each
fold is used to learn the transition and observation matrices of the POMDPs, as
well as the rewards using the MaxEnt IRL algorithm.
      </p>
      <p>Comparison of MaxEnt IRL with &amp; without adaptive step size</p>
      <p>(a) Lung cancer states' re- (b) Lung cancer states'
rewards. wards.
(c) Lung cancer actions' (d) Lung cancer actions'
rewards. rewards.</p>
      <p>
        Fig. 3. State and action rewards computed using the MaxEnt IRL and normalized by
range. Left: Using an adaptive step size. Right: Without using an adaptive step size.
The adaptive step size MaxEnt IRL algorithm converges to a solution signi cantly
faster than the MaxEnt IRL without an adaptive step size.
with an adaptive step size. We compare the MaxEnt IRL with and without the
adaptive step size and assess the speed of convergence. Figure 3 depicts the
computed rewards for states and actions for the lung POMDP over the number
of iterations of gradient descent in the MaxEnt IRL algorithm, with and without
an adaptive step size. A similar convergence trend is observed with the breast
POMDP. As shown, the adaptive step size method converges to the correct
solution more quickly than the standard MaxEnt IRL implementation. For the
evaluation of the two models we use a reward function derived from rewards
normalized in the [
        <xref ref-type="bibr" rid="ref1">-1,1</xref>
        ] range.
      </p>
      <p>Lung and breast POMDP results
We used the longitudinal observations from the NLST and Athena datasets as
input to POMDPs such that each sequential observation updates the belief state
of the agent. The belief state of the POMDP, at each epoch, is then used to
select the next (optimal) action, with the objective of early detection of cancer.
The POMDP models can suggest to continue screening (i.e., MG, LDCT) or to
perform an intervention (i.e., biopsy or diagnostic imaging). If an intervention
is performed, the individual is removed from further consideration. Evaluation
of the POMDP is posed as a binary problem: if the POMDP suggests continued
screening (LDCT/MG) then the patient is classi ed as a negative cancer; if it
suggests an intervention, then the patient is classi ed as a positive cancer. Based
on this de nition, if the model suggests a LDCT/MG and the patient did not
have a con rmed diagnosis of cancer in a given epoch, it is considered a true
negative (TN); if the patient had a con rmed diagnosis of cancer then it is a false
negative (FN). Conversely, if the model suggests an intervention and the patient
did not have cancer in a given epoch, then it is considered a false positive (FP); if
the patient had a diagnosis of cancer then it is considered a true positive (TP).
Performance metrics were estimated for each epoch of the screening process.
Any subject diagnosed with cancer is removed from the subsequent epoch. The
POMDP models are compared against the equivalent physician decisions
(recommendations) at each epoch, applying a similar framework for TN/FN/FP/TP
to the experts, given the known cancer outcomes from each dataset (e.g., if the
physicians suggested an LDCT/MG and the patient did not have a con rmed
diagnosis of cancer, it is considered a true negative, etc.). Table 2 shows the
performance of the lung and breast POMDPs and the corresponding
performance of physicians on the same dataset. Notably, both POMDP models show
performance comparable to experts. The lung cancer screening model has worse
performance in terms of recall in the rst and third screening epochs, but an
improved performance in terms of recall and false positive rate in the second
screening and post-screening. The breast cancer screening model demonstrates
excellent recall (as do the expert physicians) but slightly worse false positive
rate. The Cohen's kappa coe cient of agreement was used to assess the
concordance between the POMDP models and physicians. The kappa score of the
lung POMDP and physicians decreases over time due to the large number of
false positives. A large portion of di erent cases are classi ed as false positives
between the lung POMDP and physicians. The breast POMDP has a high kappa
score demonstrating strong agreement with physicians in terms of false positives
and true positives. For both lung and breast models, the variance of kappa per
screening is less than 0:03.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        POMDPs, through the use of beliefs and a hidden state space, can overcome some
of the limitations seen in other sequential decision making models used in cancer
screening. For instance, we modeled a hidden cancer state space in three parts
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]: no-cancer, benign/indeterminate, and malignant/invasive cancer. Modeling
the cancer state space with an additional state rather than a binary state space
allows the distinction of lower risk individuals (i.e., no abnormalities) { who
constitute a large portion of screening cases and thus result in highly imbalanced
datasets { over medium (i.e., benign growth) and high risk individuals (i.e.,
malignant abnormality).
      </p>
      <p>
        Driven by the need to de ne the reward function in these screening POMDPs,
we explored the use of the MaxEnt IRL algorithm towards generation of
stateaction reward pairs. As noted earlier, cost and utility estimation are frequently
adopted as reward functions in healthcare models. However, cost has certain
limitations as it does not generalize to the whole population equally, and does
not re ect the importance of quality outcomes. Additionally, QALY data are
scarce, and arguably expensive to collect [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In contrast, a reward function
learned using the MaxEnt IRL algorithm aims to maximize the objective of
state-action trajectories. We introduced a multiplicative model for representing
state-action pairs as products of state rewards and action rewards. The
multiplicative model has the advantage to clearly demonstrate the di erence in utility
between rewards of di erent actions, which is what drives decision
recommendation. Rewards are thus learned based on the state-visitation frequency of each
trajectory. In this context, states with fewer visitations across each trajectory
earn the lowest reward (e.g., invasive or malignant cancer state), which is why
only cancer and non-cancer cases with a complete trajectory are used to learn
rewards in our framework. Modeling the expert's decisions with the MaxEnt IRL
algorithm resulted in reward functions for the POMDP models with performance
comparable to experts. We noticed that when using aggressive reward functions
(i.e., identifying all cancer cases), the true positive rate exceeded physicians' true
positive rate but at the expense of a higher false positive rate, which in clinical
practice can translate into higher costs and unnecessary psychological burden
on the patient. Including more observational variables, derived from medical
images, in the screening process can overcome this trade-o between true positive
and false positive rate. The overall true positive rate and false positive rate using
our learned reward functions in the POMDPs is comparable to experts.
Nonetheless, in some cases the experts had false negative cases, which is also captured
by our approach. When compared with other machine learning algorithms at
the baseline of the lung and breast paradigms the POMDP models demonstrate
improved performance.
      </p>
      <p>The rst limitation of using MaxEnt IRL in this study is the fact that more
than one combination of rewards can de ne the same problem to overcome this
problem. To overcome this, a policy iteration algorithm can be used rather than
value iteration as the policy space is nite in comparison to the rewards space
(hence the policy iteration algorithm is guaranteed to optimally converge). A
second limitation is the assumption that reward functions are only based on state
visitation frequencies. To assess the quality of these reward functions a
comparison of suggested recommendations with patient satisfaction could be used. Other
limitations are around assumptions about the nature of our datasets. While lung
and breast cancer screening tests occurred roughly at one year intervals, we
assumed that screening occurs annually (i.e., at xed frequency). Moreover, data
imbalance is a function of time, as at each screening point the number of cancer
and non-cancer cases changes (i.e., at the outset of a screening period, more
cancers are found at the beginning of a dataset). We did not account for this
dynamic nature of the dataset during training. Given the small number of cancer
cases across each screening point of both datasets, we utilized a strati ed 5-fold
cross-validation to obtain an unbiased estimate of model performance. To
simplify modeling, our lung POMDP model considered only cases reporting a single
pulmonary nodule over the course of the trial; this represents only a subset of the
screened individuals, as many subjects have more than one such nding. Lastly,
for the Athena dataset, in breast cancer screening, patients with BI-RADS 1, 2,
or 3 rarely undergo biopsy, thus the true FN rate is likely underestimated.
Future work involves the exploration of MaxEnt IRL in transfer learning between
other datasets and domains, by reusing learned weights.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The authors thank the National Cancer Institute (NCI) for access to the National
Lung Screening Trial data and Dr. Arash Naeim for access to the Athena Breast
Health Network data collected at our institution. This material is based upon
work supported by the National Science Foundation under Grant No. 1722516
and the Department of Radiological Sciences under the Data-Driven Diagnostic
Decision Support (D4S) initiative.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>Apprenticeship learning via inverse reinforcement learning</article-title>
          .
          <source>In: Twenty- rst international conference on Machine learning - ICML '04</source>
          . p.
          <volume>1</volume>
          (
          <year>2004</year>
          ). https://doi.org/10.1145/1015330.1015430
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Deep Inverse Reinforcement Learning</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2016</year>
          ), https:// matthewja.com/pdfs/irl.pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Babes-Vroman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marivate</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Littman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Apprenticeship learning about multiple intentions</article-title>
          .
          <source>Proceedings of the 28th International Conference on Machine Learning</source>
          , ICML 2011 pp.
          <volume>897</volume>
          {
          <issue>904</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hauser</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Arti cial intelligence framework for simulating clinical decision-making: A Markov decision process approach</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>57</volume>
          (
          <issue>1</issue>
          ),
          <volume>919</volume>
          (
          <year>2013</year>
          ). https://doi.org/10.1016/j.artmed.
          <year>2012</year>
          .
          <volume>12</volume>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Chelsea</given-names>
            <surname>Finn</surname>
          </string-name>
          :
          <source>Deep RL Bootcamp Lecture 10B Inverse Reinforcement Learning - YouTube</source>
          (
          <year>2017</year>
          ), https://www.youtube.com/watch?v=d9DlQSJQAoI&amp;t=1012s
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D</given-names>
            <surname>'Orsi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.J.: ACR</surname>
          </string-name>
          <article-title>BI-RADS atlas: breast imaging reporting and data system</article-title>
          . American College of Radiology (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hiatt</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The Athena breast health network: Developing a rapid learning system in breast cancer prevention, screening, treatment, and care</article-title>
          , vol.
          <volume>140</volume>
          . Springer US (7
          <year>2013</year>
          ). https://doi.org/10.1007/s10549-013-2612-0
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hauskrecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fraser</surname>
          </string-name>
          , H.:
          <article-title>Planning treatment of ischemic heart disease with partially observable Markov decision processes</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>18</volume>
          (
          <issue>3</issue>
          ),
          <volume>221</volume>
          {
          <fpage>244</fpage>
          (
          <year>2000</year>
          ). https://doi.org/10.1016/S0933-
          <volume>3657</volume>
          (
          <issue>99</issue>
          )
          <fpage>00042</fpage>
          -
          <lpage>1</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hauskrecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milos</surname>
          </string-name>
          , H.:
          <article-title>Dynamic decision making in stochastic partially observable medical domains: Ischemic heart disease example</article-title>
          .
          <source>In: Lecture Notes in Computer Science</source>
          , pp.
          <volume>296</volume>
          {
          <fpage>299</fpage>
          . springerlink. https://doi.org/10.1007/bfb0029462
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pluim</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viergever</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Adaptive stochastic gradient descent optimisation for image registration</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>81</volume>
          (
          <issue>3</issue>
          ),
          <volume>227</volume>
          {
          <fpage>239</fpage>
          (
          <year>2009</year>
          ). https://doi.org/10.1007/s11263-008-0168-y
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Maillart</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ivy</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ransom</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diehl</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Assessing Dynamic Breast Cancer Screening Policies</article-title>
          .
          <source>Operations Research</source>
          <volume>56</volume>
          (
          <issue>6</issue>
          ),
          <volume>1411</volume>
          {
          <fpage>1427</fpage>
          (
          <year>2008</year>
          ). https://doi.org/10.1287/opre.1080.0614
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. National Lung Screening Trial Research Team, Aberle,
          <string-name>
            <given-names>D.R.</given-names>
            ,
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.M.</given-names>
            ,
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.D.</given-names>
            ,
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.C.</given-names>
            ,
            <surname>Clapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.D.</given-names>
            ,
            <surname>Fagerstrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            ,
            <surname>Gareen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.F.</given-names>
            ,
            <surname>Gatsonis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.M.</given-names>
            ,
            <surname>Sicks</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          :
          <article-title>Reduced lung-cancer mortality with low-dose computed tomographic screening</article-title>
          .
          <source>The New England journal of medicine 365(5)</source>
          ,
          <volume>395</volume>
          {
          <issue>409</issue>
          (8
          <year>2011</year>
          ). https://doi.org/10.1056/NEJMoa1102873
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Algorithms for inverse reinforcement learning</article-title>
          .
          <source>Proceedings of the Seventeenth International Conference on Machine Learning</source>
          pp.
          <volume>663</volume>
          {
          <issue>670</issue>
          (
          <year>2000</year>
          ). https://doi.org/10.2460/ajvr.67.2.
          <fpage>323</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Petousis</surname>
            , P., Han,
            <given-names>S.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aberle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Prediction of lung cancer incidence on the low-dose computed tomography arm of the National Lung Screening Trial: A dynamic Bayesian network</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>72</volume>
          ,
          <volume>42</volume>
          {
          <fpage>55</fpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1016/j.artmed.
          <year>2016</year>
          .
          <volume>07</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Schaefer</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bailey</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shechter</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          :
          <article-title>Modeling medical treatment using Markov decision processes</article-title>
          .
          <source>Operations Research</source>
          and Health Care pp.
          <volume>597</volume>
          {
          <issue>616</issue>
          (
          <year>2005</year>
          ). https://doi.org/10.1007/1-4020-8066-2 23
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Thrun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgard</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Probabilistic robotics (</article-title>
          <year>2006</year>
          ). https://doi.org/10.1145/504729.504754
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tusch</surname>
          </string-name>
          , G.:
          <article-title>Optimal sequential decisions in liver transplantation based on a POMDP model</article-title>
          .
          <source>In: ECAI</source>
          . pp.
          <volume>186</volume>
          {
          <issue>190</issue>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ziebart</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagnell</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Maximum Entropy Inverse Reinforcement Learning</article-title>
          .
          <source>In: AAAI Conference on Arti cial Intelligence</source>
          . pp.
          <volume>1433</volume>
          {
          <issue>1438</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>