=Paper= {{Paper |id=Vol-3208/paper4 |storemode=property |title=Machine Coaching with Proxy Coaches |pdfUrl=https://ceur-ws.org/Vol-3208/paper4.pdf |volume=Vol-3208 |authors=Vasileios Theodoros Markos,Marios Thoma,Loizos Michael |dblpUrl=https://dblp.org/rec/conf/comma/MarkosTM22 }} ==Machine Coaching with Proxy Coaches== https://ceur-ws.org/Vol-3208/paper4.pdf
Machine Coaching with Proxy Coaches
Vassilis Markos1 , Marios Thoma1 and Loizos Michael1,2
1
    Open University of Cyprus, Nicosia, Cyprus
2
    CYENS Centre of Excellence, Nicosia, Cyprus


                                         Abstract
                                         We evaluate the Machine Coaching paradigm, a human-in-the-loop machine learning methodology,
                                         according to which a human coach and a machine engage in an iterative bidirectional exchange of
                                         explanations, towards improving the machine’s ability to reach conclusions and justify them in a way
                                         that is acceptable to the human coach. To support the systematic empirical investigation of the efficacy
                                         and efficiency of Machine Coaching, we adopt proxy (algorithmic) coaches in the stead of human ones.

                                         Keywords
                                         machine coaching, proxy coaching, explainable AI




1. Introduction
Learning in humans proceeds in many and diverse ways. Under what could be called the
autodidactic paradigm [1, 2], the human learner utilizes whatever information is available. A
human supervisor, if present, may complete / enrich that information, so that the human learner
faces a more benign / informative environment for learning.
   Another, significant, part of human learning takes place under what could be called the
coaching-based paradigm [3, 4], where a human supervisor, or coach, shares with the human
learner not only what the case is in a certain state of the environment, but chiefly why that
case is. This happens whenever parents warn their children to “not run while holding scissors,
because they will hurt themselves”, whenever teachers explain to their students to “conclude
that two triangles are similar because they have congruent angles”, when chess instructors
teach their pupils to “place pieces on squares from which they cannot be easily deflected”, and
when managers direct their assistants to “book a hotel close to the meeting venue for same-day
trips”.
   Such pieces of advice from the coach do not typically come unprompted, but as a reaction
to a wrong or wrongly-justified decision by the learner. A coach offering advice is, in effect,
proactively completing / enriching missing information in states of the environment that the
learner might encounter, by providing conditions on states under which a certain decision should
be reached. Compared to the autodidactic paradigm, the substantial amount of information
communicated under the coaching-based paradigm is conducive to more efficient, albeit coach-
specific, learning, while the reactive nature of advising entails only marginally extra effort
ArgML 2022: 1st International Workshop on Argumentation and Machine Learning @COMMA 2022, September 13, 2022,
Cardiff, Wales, UK
$ vasileios.markos@st.ouc.ac.cy (V. Markos); marios.thoma@st.ouc.ac.cy (M. Thoma); loizos@ouc.ac.cy
(L. Michael)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
from the coach, as humans are known to be competent at identifying counter-arguments when
challenged [5].
   Machine Coaching 1 [3, 4] was proposed as an argumentation-based learning-theoretic frame-
work under which the coaching-based paradigm of learning can be studied. It formalizes the
computational resources available to the coach and the learner, the protocol and language of
their interaction, and the expected quality guarantees on what is learned. Under that framework,
one considers a human coach, with access to a target policy, interacting with a machine learner,
and formally establishes that if the human coach offers appropriate, in a defined-sense, advice,
then the machine learner can utilize that advice to learn a hypothesis policy that approximates
the target policy with the expected quality guarantees.
   Undertaking a study with human participants in the role of the coach is, clearly, the ultimate
way to empirically validate the Machine Coaching framework. Yet there are three major
challenges to overcome in such an envisioned study.
   The first challenge stems from the lack of fluency of average humans in formal logic, making
it hard for them to exchange explanations in the native (logic-based) language of Machine
Coaching. This issue could be alleviated by adopting a natural language interface for exchanging
explanations, which would be automatically translated to/from the native language. Although
there is some work in this direction — even using the Machine Coaching framework itself at a
meta-level [6] — the problem is sufficiently complex, and currently lacks a robust solution.
   The second challenge stems from the inaccessibility of the target policy that humans have in
mind when coaching. Without a target policy, one cannot empirically evaluate the quality of the
learned hypothesis policy. If, in addition, no policy agrees with all the advice offered by the
human coach, then the empirical study would confound the validation of the Machine Coaching
framework with the determination of the expressivity of the framework’s native language, and
with the compatibility of the interaction with human cognition.
   The third challenge stems from the physical and mental limitations of humans in prolonged
interactions. A systematic evaluation of the Machine Coaching framework would require
substantial rounds of interactions between the coach and the machine, across diverse target
policies. It is unlikely that humans would maintain efficient and consistent behavior over time,
either due to fatigue, an unconscious adaptation to the empirical setting, or a suppression of
their natural reactions when coaching with an externally-given target policy.
   To eschew these challenges, we resort to the use of a proxy (algorithmic) coach that can
communicate in the native language of Machine Coaching, cope with a given explicit target
policy, and provide advice efficiently and consistently. These features need to be balanced
against the desire to stay close to human coaches, who have only approximate access to their
target policy — conceivably as a compilation, in some intensional or extensional form, of their
relevant life experiences — and are unable to divulge the target policy on cue, but can still react
if some part of it is “challenged” by an external position and associated argument [5]. Though
the proxy coaches introduced below do not claim to serve as a comprehensive simulation of
human ones, they aim to capture the interaction aspects that are important for a sound in vitro
assessment of Machine Coaching. Consequently, we consider our work as a stepping stone
between the theoretically proven efficiency of Machine Coaching and an empirical assessment

1
    Resources available at: https://cognition.ouc.ac.cy/prudens.
of that through a study involving actual human participants.
   Accordingly, we consider proxy coaches with access to an exemplar set of data, labeled with
the predictions of a fixed (but hidden to the proxy coaches) target policy. The key technical
question, then, is how to extract appropriate pieces of advice from the exemplar set. In the sequel
we seek to answer this question by demonstrating how to develop proxy coaches with both
an intensional and an extensional compilation of the exemplar data, and continue to evaluate
them empirically. We shall remark at this point that our focus in this paper is mostly shifted
towards investigating the various alternatives of proxy coaches and their qualitative differences.
Consequently, while we do present and discuss quantitative results for all chosen proxy coaches,
they mostly serve as a means to stress the effects each proxy coaching protocol has on the
coaching process.


2. Background and Related Work
The process whereby an agent learns under the supervision of a more experienced tutor / coach,
who offers advice as a reaction to the learner’s decisions or actions, appears as a suggestion
for the development of AI systems in John McCarthy’s seminal work on the “Advice Taker”
[7]. Recent work [3, 4] has sought to formalize this process, termed Machine Coaching, in
learning-theoretic terms, as a variant of the Probably Approximately Correct (PAC) model [8].
    The eXplainable Interactive Learning (XIL) framework [9] also considers a human supervisor
in the role of a learning “coach”, providing advice either in the form of a (more) correct label,
or a correct explanation. Unlike Machine Coaching, which adopts learning and reasoning
semantics compatible with the language of formal argumentation, XIL assumes black-box access
to an active learning algorithm [10] and a local post-hoc explainer [11]. Correspondingly, the
explanations provided by the “coach” in XIL cannot be directly and elaboration-tolerantly [12]
embedded into the learned model, but are rather used to create additional labeled data for the
further training of an opaque learned model.
    The idea of online ingestion of human advice in Machine Learning processes is present in
other works as well. A way to incorporate user advice into Support Vector Machines is proposed
in [13], using the advice to reduce the number of data points that need to be labeled. In [14], an
interactive framework is proposed that allows human users to label selected data points for the
purpose of training a document classifier. Similarly, in [15], a learning paradigm is proposed
that systematically utilizes user-provided explanations to reduce learning complexity, stressing
that data richness (i.e., including explanations along with labels) over data volume (i.e., having
access to many labeled data without explanations) leads, in certain cases, to equally good or
better performance.
    Beyond speeding up learning, more recent works examine how human knowledge may
benefit the entire learning process, by being an integral part of it. Coactive Learning, proposed
in [16], is an interactive Machine Learning paradigm where a learner receives sub-optimal user
advice that is, inherently, only “slightly better” than the learner’s current prediction. However,
it is shown that several existing supervised algorithms can be altered to accommodate this type
of human-machine interaction.
    Although our work herein focuses on the empirical evaluation of Machine Coaching, the
technique of proxy coaches that we follow would also seem to be applicable, to varying extents,
to the frameworks above. The rest of the works that we briefly review below are not tied to
coaching per se, but relate to our chosen implementation of the proxy coaches.
   In [17], the authors attempt to offer post-hoc explanations for a black-box model by training
a random forest, extracting rules from it, and using them to construct an argumentation theory.
One of our considered proxy coach types also adopts the view that a random forest can be the
source of arguments, but in our case these arguments are not the end product itself, but are
rather the pieces of advice that are given from the proxy coach to the ultimate learner.
   On the other hand, we also consider a proxy coach type that does not compile the available
data into an explicit learned model, but uses them only implicitly. This approach relates to the
works in [18, 19], which investigate a paradigm of implicit learning, where one can reason and
respond to queries by directly consulting the data. Whereas the emphasis of those works is on
answering given queries, our emphasis is on constructing appropriate pieces of advice for the
proxy coach to offer to the ultimate learner.
   To facilitate the choice of appropriate advice in this implicit learning setting, we use natural
selection over an evolutionary process. The formalization of evolution that we adopt is that in
[20], where evolution is cast as a learning problem that seeks to approximate a hidden target
function. The evolutionary process in our case attempts to identify the next appropriate piece of
advice, and to integrate it into the next generation of organisms, with each organism encoding a
collection of rules aimed to approximate the entire target policy. Thus, our approach resembles
a Pittsburgh-style Learning Classifier System [21].


3. From Direct to Proxy Coaches
Following the Machine Coaching framework, a (direct) coach and a learner hold, respectively,
a fixed target policy and a revisable hypothesis policy, each represented in the form of a
partially-ordered set of “if-then” rules. Upon perceiving a new context, the learner returns the
predictions of the current hypothesis policy, and an explanation in the form of hypothesis
policy rules that support those predictions. The coach responds by offering advice in the form
of target policy rules that support why the learner’s predictions were incorrect, incomplete, or
improperly-justified according to the coach. The learner revises the hypothesis policy, and the
process repeats.
   While arguments within Machine Coaching are generally considered to be arbitrary trees
of rules, for the sequel, we restrict our attention to shallow propositional policies, whose
rules have a special atom (or its negation) as their head, so that rules are not chained during
reasoning — and, thus, each argument comprises of a single rule. Relatedly, we restrict our
attention to complete contexts, which specify truth-values for all remaining atoms. Without
loss of generality, we hence let the special atom be output.
   Thus, if output were the ability to fly, and given a context {penguin, bird}, the learner
could predict output by offering the following explanation: “bird implies output”, and
the coach could react by offering the following piece of advice: “penguin implies -output”,
which would be integrated with higher priority in the revised hypothesis policy.
   Even with these restrictions, policies can still vary along other dimensions, which affect the
            R1 :: a implies output
                                                             policy size = 6 rules
 Stack      R2 :: a, b implies -output
                                                             rule width = 11 atoms / 6 rules
            R3 :: a, c implies -output                       flip count = 4 batch transitions
            R4 :: x implies output                           batch size = 6 rules / 5 batches
 Stack      R5 :: x, y implies -output                       stack depth = 5 batches / 2 stacks
            R6 :: x, y, z implies output

Figure 1: A shallow propositional policy over a set of atoms 𝐴 ⊇ {𝑎, 𝑏, 𝑐, 𝑥, 𝑦, 𝑧}. Rules increase in
priority from top to bottom. Colored regions indicate batches; circular arrows indicate flips.


conceptual and computational complexity of learning (through Machine Coaching, or otherwise)
and reasoning. Policies can vary in terms of their policy size, which counts the number of rules
in the policy, their rule width, which counts the average number of atoms in rule bodies, their
flip count, which counts the flips / transitions between batches of consecutive rules with the
same head polarity (positive or negative), the batch size, which counts the average number of
rules across batches, and the stack depth, which counts the average number of consecutive
batches such that rules in a batch have logically more specific conditions than the rules in the
preceding batch (cf. Figure 1).
   Proxy coaches have only indirect knowledge of some target policy 𝑝, by being given access
to an exemplar set 𝑝̂︀ of contexts labeled according to 𝑝. In broad terms, then, a proxy coach
faces a context 𝑥 ∈ 𝒞 from some coaching set 𝒞, receives the prediction / explanation ℎ(𝑥) of
the learner’s current hypothesis policy ℎ on 𝑥, and then uses 𝑝̂︀ to generate a piece of advice
            ̂︀ to be offered to the learner. An epoch concludes whenever a piece of advice is
𝛼(𝑥, ℎ(𝑥); 𝑝)
offered, noting that the proxy coach need not offer advice for each 𝑥 ∈ 𝒞.
   The precise manner on how 𝛼(𝑥, ℎ(𝑥); 𝑝)  ̂︀ is generated depends on the proxy coach type, but
it generally seeks to consider the anticipated effect that each candidate piece of advice would
have on the learner’s hypothesis policy, in terms of: (i) improving its coverage, by reducing the
number of contexts on which it abstains; and (ii) improving its accuracy, by reducing the ratio
of wrong over correct predictions it makes.

3.1. Intensional Proxy Coaches
The first class of proxy coaches that we consider encode intensionally their knowledge of the
target policy, by using the exemplar set 𝑝̂︀ once, before the coaching phase, to supervise the
training of a white-box model 𝑚. The ante-hoc explainability of 𝑚 makes it, then, natural to
seek to “extract” a piece of advice 𝛼(𝑥, ℎ(𝑥); 𝑝)
                                                ̂︀ from the explanation of 𝑚(𝑥), given a context
𝑥 ∈ 𝒞. The exemplar set, therefore, takes the role of a training set. Coaching proceeds, then, as in
Algorithm 1.
   Our chosen white-box models are those of decision trees and random forests, from which we
have identified three natural ways of extracting advice as part of Step 6 of Algorithm 1.
   Our first approach considers a white-box model 𝑚 based on a single decision tree. Given the
current context 𝑥 ∈ 𝒞, it identifies the active path of the tree, from which it computes, as usual,
Algorithm 1 Intensional Proxy Coaching
 1: input: exemplar set 𝑝;
                        ̂︀ coaching set 𝒞.
 2: Train white-box model 𝑚, using 𝑝
                                   ̂︀ as training instances.
 3: for the next context 𝑥 ∈ 𝒞 do
 4:    Ask for learner’s prediction / explanation ℎ(𝑥).
 5:    if predictions of 𝑚(𝑥) and ℎ(𝑥) differ then
 6:        Get advice from explanations of 𝑚(𝑥) and ℎ(𝑥).
 7:        Give advice to revise current hypothesis policy ℎ.
 8:    end if
 9: end for


                                                               Context: {a, -b, -c}
                                   a
                                 37/63                         Prediction: output
                                                               Explanation: a implies output
                             N           Y
                                               b
                     ···                     41/22


                                         N           Y
                                   c
                                 27/14                   ···
                                                               “Min Prefix” Advice:
                             N           Y                     a, -b implies -output
                       0                       1
                      23/4                   4/10              “Full Path” Advice:
                                                               a, -b, -c implies -output

Figure 2: Examples of two ways of extracting advice from a path in a decision tree. Internal nodes indicate
the atom based on which a split is made, and leaf nodes indicate the prediction of the corresponding
path. The two numbers at the bottom of a node correspond to the negatively / positively labeled training
instances that reach that node.


the prediction of 𝑚(𝑥). The full path itself acts as the explanation of 𝑚(𝑥), and is returned
as a piece of advice. By construction, the advice so generated is “cautious”, in that a full path
in decision trees tends to have high accuracy on the exemplar set, as otherwise the decision
tree learning algorithm would have chosen to expand the path further. For the same reason,
however, the advice is also “specific”, in that it applies only on a few contexts because of the
path length. Overall, then, this type of a proxy coach chooses to provide advice based on the
expectation that the hypothesis policy will improve its coverage only marginally, but it will be
highly effective in improving and maintaining its accuracy.
   Our second approach is a variant of the first, where instead of considering the full active
path of the decision tree as the (unique candidate for a) piece of advice, it considers all of
its prefixes as candidates. Among them, it chooses the minimal one that is simultaneously
more specific than the explanation of ℎ(𝑥), and such that the prediction of the decision tree
would have remained the same had the active path been pruned to be that particular prefix.
By construction, the advice so generated is more “loose”, in that a prefix of a path in decision
Algorithm 2 Extensional Proxy Coaching
 1: input: exemplar set 𝑝;
                        ̂︀ coaching set 𝒞.
 2: Initialize the current hypothesis policy ℎ to be empty.
 3: for the next context 𝑥 ∈ 𝒞 do
 4:    Create mutations of types (M0), (M+), (M↓).
 5:    Compute fitness score 𝑓𝑗 of each mutation ℎ𝑗 .
 6:    Select “good” beneficial or neutral mutation ℎ𝑖 .
 7:    Set the current hypothesis policy to be ℎ𝑖 .
 8: end for


trees tends to have lower accuracy on the exemplar set, which is why the decision tree learning
algorithm had chosen to expand the prefix further. For the same reason, however, the advice is
also more “general”, in that it applies on more contexts because of the shorter prefix length.
Overall, then, this type of a proxy coach chooses to provide advice based on the expectation
that the hypothesis policy will improve its coverage measurably, but at the cost of being less
effective in improving and maintaining its accuracy.
   Our third approach considers a white-box model 𝑚 based on a random forest. Given the
current context 𝑥 ∈ 𝒞, it computes, as usual, the prediction of 𝑚(𝑥). It then proceeds to identify
the active path from each individual tree whose prediction matches that of 𝑚(𝑥), and considers
those full paths as candidates. There are many strategies for choosing one of the candidates, but
for concreteness, our particular approach chooses the one that is most “specific” (breaking ties
randomly), in that it is activated by as few contexts as possible in the exemplar set. Given that
paths in trees in random forests are typically short and hence “general”, the approach tends
to favor the generation of a “maximally cautious” advice among “typically loose” candidates.
Overall, then, this type of a proxy coach chooses to provide advice based on the expectation of
striking a balance between the improvement of the coverage and accuracy of the hypothesis
policy.

3.2. Extensional Proxy Coaches
The second class of proxy coaches that we consider retain extensionally their knowledge of
the target policy, without compiling it into another form. Rather, they use the exemplar set 𝑝̂︀
repeatedly, during the coaching phase, to decide whether a candidate piece of advice should
become 𝛼(𝑥, ℎ(𝑥); 𝑝), ̂︀ given a context 𝑥 ∈ 𝒞. The exemplar set, therefore, takes the role of a
validation set. Coaching proceeds, then, as in Algorithm 2.
   This extensional perspective suggests a trial-and-error view of coaching, where the coach
tests candidate pieces of advice to measure their effect. Since a systematic testing of all possible
pieces of advice is infeasible, the challenge for the coach is to select the set of candidates, and to
identify how to test the effect of each candidate on the hypothesis policy.
   Our approach adopts an evolutionary mechanism, where, roughly, the process of mutation
corresponds to the generation of the candidate pieces of advice, and the process of natural
selection corresponds to their testing towards selecting 𝛼(𝑥, ℎ(𝑥); 𝑝).̂︀ We discuss below in more
detail the nuances that come from the fact that the same evolutionary mechanism needs to
simulate both the proxy coach and the learner.
   At the start of a generation, the population comprises the current hypothesis policy ℎ. Given
the current context 𝑥 ∈ 𝒞, certain candidate pieces of advice are considered, and each is
“provisionally” offered as advice to a different copy of ℎ, giving rise to its offspring. Specifically:
(M0) an offspring is created by offering no advice; (M+) an offspring is created by offering
the advice “𝑥 implies output”, if ℎ(𝑥) predicts -output or abstains; (M+) an offspring is
created by offering the advice “𝑥 implies -output”, if ℎ(𝑥) predicts output or abstains;
(M↓) an offspring is created, for each 𝑗, by offering the advice “body−𝑗 implies head”, where
“body implies head” is the latest advice that was integrated in ℎ, and body−𝑗 is body minus
its 𝑗-th literal.
   These aforementioned pieces of advice are said to be offered “provisionally” in the sense that
the proxy coach may, at a subsequent generation, seek to refine a previously given piece of
advice by means of type (M↓) mutations. A piece of advice is “conclusively” given at the end of
a streak of (zero or more) type (M0) or (M↓) mutations that follow a type (M+) mutation. This is
the point at which an epoch concludes.
   At each generation, exactly one of the offspring is chosen to survive to initialize the next
generation. To support this selection process, each offspring ℎ𝑖 is evaluated in terms of its
improvement in accuracy and coverage relative to its parent against each exemplar 𝑒 ∈ 𝑝.       ̂︀ The
change from ℎ(𝑒) to ℎ𝑖 (𝑒) is considered: positive, if a wrong prediction changes to a correct
one or an abstention, or an abstention changes to a correct prediction; negative, if a correct
prediction changes to a wrong one or an abstention, or an abstention changes to a wrong
prediction. By giving a +1 or −1 fitness point to offspring ℎ𝑖 for each respective positive or
negative change across the exemplar set 𝑝,  ̂︀ we end up with an intuitive metric 𝑓𝑖 that aggregates
the improvement effect of the advice given to offspring ℎ𝑖 on both its coverage and its accuracy.
   The offspring are, then, grouped based on the relation of their relative fitness to a fixed
threshold parameter 𝑡. An offspring ℎ𝑖 is detrimental, neutral, beneficial if its relative
fitness 𝑓𝑖 belongs in (−∞, −𝑡), in [−𝑡, +𝑡], in (+𝑡, +∞), respectively. Among      ∑︀ the  beneficial
ones, if available, the offspring ℎ𝑖 is selected to survive with probability 𝑓𝑖 / 𝑗 𝑓𝑗 , where the
                                                                                𝑘        𝑘

exponent 𝑘 is a fixed non-linearity parameter. Otherwise, a neutral offspring (whose existence
is guaranteed by a type (M0) mutation) is selected uniformly at random.
   The proposed approach searches greedily for advice that is as “cautious” and as “general”
as possible, while starting the greedy search from a fully “cautious” and “specific” seed advice
selected by a type (M+) mutation. Overall, then, this type of a proxy coach chooses to provide
advice based on the expectation of striking a balance between the improvement of the coverage
and of the accuracy of the hypothesis policy.


4. Empirical Investigation
We present below the key aspects and results of our empirical investigation. Additional details
are given in the appendices. All related materials may be found at https://github.com/VMarkos/
proxy-coaching-argml-2022.
4.1. Empirical Setting
We first fixed a set 𝐴 of 20 atoms (other than output) for use in contexts and rule bodies.
We then constructed 10 groups with 20 policies each, with each pair of groups corresponding
to high versus low values in one of the variability dimensions discussed in Section 3, giving
rise to a set 𝑃 of 132 distinct target policies. For each target policy 𝑝 ∈ 𝑃 , we constructed an
associated exemplar set 𝑝̂︀ by uniformly at random sampling 1000 contexts from 2𝐴 , labeling
each context according to 𝑝, filtering out any contexts on which 𝑝 abstained, and keeping 70%
of the labeled contexts. The other 30% was used to populate an evaluation set ℰ. An additional
500 (unlabeled) contexts were sampled to populate a coaching set 𝒞.
   We ran an experiment for each pairing of a target policy 𝑝 ∈ 𝑃 with a proxy coach type
discussed in Section 3: (E1) a decision tree with “full path” advice, (E2) a decision tree with “min
prefix” advice, (E3) a random forest, and (E4) an evolutionary mechanism. Decision trees in
experiments (E1), (E2), and (E3) were trained on the exemplar set 𝑝̂︀ using ID3 [22]. Random
forests in experiments (E3) comprised 20 trees, each trained on a 10% random fraction of 𝑝.        ̂︀
The evolutionary mechanism in experiment (E4) used a threshold parameter 𝑡 = 0, and a
non-linearity parameter 𝑘 = 2.
   The current hypothesis policy was examined at the end of each epoch in each experiment.
Performance values recorded how many of the predictions of the current hypothesis policy on
the evaluation set ℰ were correct against 𝑝, wrong against 𝑝, or abstentions, and the accuracy
of the definite predictions (i.e., the number of correct predictions over the number of correct
or wrong predictions). Relative Size values recorded the size of the current hypothesis policy
(relative to the size of 𝑝), and the size of the epoch (relative to the size of 𝒞), the size of each
given piece of advice (relative to the size of 𝐴).
   The above values for each proxy coach type were aggregated across 𝑃 , and were plotted
against the epochs. To account for different numbers of epochs across experiments, the set
of epochs of each experiment was uniformly distributed over the interval [0, 1], using spline-
interpolation to fill-in the values between the points that corresponded to the epochs in that
experiment. In particular, a value of 0 corresponds to the pre-coaching state of affairs, where
the current hypothesis policy is empty, and a value of 1 corresponds to the post-coaching state
of affairs, where the current hypothesis policy is the one after the integration of the final piece
of advice.

4.2. Results and Analysis
Figure 3 presents plots for the four sets of experiments.

Machine Coaching Efficacy
A first general observation is that experiments agree qualitatively on the efficacy of Machine
Coaching. As more pieces of advice are given, the performance of the current hypothesis policy
improves in terms of increased correct predictions and decreased abstentions. The size of the
current hypothesis policy grows linearly with the given pieces of advice, due to the uniform
distribution of epochs on the 𝑋 axis. The sizes of the epochs and the given pieces of advice
          1.00                                                                                                                                                             2.0


          0.75                                                                                                                                                             1.5




                                                                                                                                                                             Relative Size
Performance


          0.50                                                                                                                                                             1.0


          0.25                                                                                                                                                             0.5


          0.00                                                                                                                                                             0.0
                 0.00   0.25    0.50    0.75   1.00 0.00   0.25    0.50    0.75    1.00 0.00     0.25    0.50    0.75   1.00 0.00   0.25        0.50    0.75        1.00
                    Coaching Progress (%)              Coaching Progress (%)                 Coaching Progress (%)              Coaching Progress (%)
   Figure 3: Results for experiments (E1)–(E4), shown from left to right. Performance values show correct
   predictions ( ), wrong predictions ( ), abstentions ( ), and accuracy ( ) on the evaluation set,
   against the target policy. Relative Size values show the relative sizes of the current hypothesis policy
   ( ), the epoch ( ), and the given piece of advice ( ). Results are aggregated across all target policies,
   with solid lines representing the mean values, dashed lines representing the median values, and shaded
   areas representing the 𝑄1 to 𝑄3 quartile interval.

          1.00                                                                                                                                                              2.0


          0.75                                                                                                                                                              1.5




                                                                                                                                                                             Relative Size
Performance




          0.50                                                                                                                                                              1.0


          0.25                                                                                                                                                              0.5


          0.00                                                                                                                                                              0.0
                   0 10 20 30 40 50 60 70              0    10     20      30     40     0     20 40 60 80 100 120 140          0   10     20    30    40      50    60
                               Epochs                             Epochs                                Epochs                              Epochs
   Figure 4: Results for experiments (E1)–(E4), as in Figure 3, with the exception that the 𝑋 axis shows the
   epoch count instead of the coaching progress. The number of experiments for the various target policies
   that reach each epoch and participate in the analysis are also plotted ( ). The lines are shown faded
   out once the number of participating experiments drops below 50% of the total number of experiments
   in each set.


   remain steady or increase, confirming the intuition that advice is given less frequently and
   becomes more specific as coaching progresses.

   Experiments (E1) and (E2)
   The first set of experiments cleanly demonstrates the effect of offering “cautious” but “specific”
   advice. As expected, wrong predictions remain effectively zero throughout the coaching phase,
   while abstentions are very gradually replaced with correct predictions, maintaining an effectively
   perfect accuracy.
      Analogously, the second set of experiments demonstrates the effect of offering more “loose”
   and “general” advice. Correct predictions increase slightly faster compared to experiments (E1),
   while abstentions decrease at an even faster pace, causing the introduction of wrong predictions,
   and a sub-perfect accuracy. Interestingly, most wrong predictions are corrected over time,
   leading to sufficiently high accuracy.
      In both sets of experiments, advice becomes more specific over time and is given less fre-
quently, with the epoch size increasing significantly during the end of the coaching phase.
These considerations, along with the continual improvement of performance, suggest that the
given advice is indeed beneficial. The quality of the given advice is further supported by the
size of the hypothesis policy remaining distinctly smaller than the size of the target policy,
indicating that the given advice is able to compress parts of the target policy. The more “loose”
and “general” advice in experiments (E2) seems to correspond to a less frequent need for advice,
and a more concise hypothesis policy, at the expense of occasionally making wrong predictions,
but also effectively never abstaining.

Experiments (E3) and (E4)
The third and fourth sets of experiments demonstrate the effect of offering advice chosen
greedily through bounded local searching, and without consulting the explanations of the
current hypothesis policy on its predictions. Accordingly, performance is measurably worse
than in the first two sets of experiments, with fewer correct predictions, more wrong predictions,
higher number of abstentions, and lower accuracy.
   Although the last two sets of experiments demonstrate a more aggressive take on decreasing
abstentions and increasing correct predictions earlier on, this seems to come at the expense of
introducing a considerable number of wrong predictions that persist across epochs, indicating a
lower quality of the given advice. This indication is further corroborated by the persistent size
of the advice, which fails to become more “cautious” and “specific” over time, and likely leads
to the introduction of nearly equally-many new wrong predictions in the place of any of the
existing wrong predictions it corrects.
   Relatedly, the epoch size does not increase as rapidly as in the first two sets of experiments,
leading to a larger hypothesis policy size, to the extent, in fact, that the hypothesis policy ends
up being larger than the target policy, showing an inability for compression, and induction, in
the given advice.
   Comparatively, the size of the given piece of advice in experiments (E4) is multiple times
larger than that in experiments (E3). This can be directly attributed to the initialization of the
search used in each case, with the former constructing advice by starting from a fully “cautious”
and “specific” seed advice, and the latter selecting a piece of advice from typically “loose” and
“general” candidate pieces of advice.

Machine Coaching Efficiency
The primary metric on efficiency for Machine Coaching is the number of epochs required to
get to a certain degree of performance. Figure 4 shows the same information as Figure 3, but
with aggregation happening over each particular epoch. Since different target polices lead to
different numbers of resulting epochs, the aggregation happens over a diminishing set of target
policies which eventually becomes unrepresentative of the initial set, and should not form the
basis for conclusions.
   Looking, therefore, at the parts of the plots that include at least 50% of the entire set of target
policies being considered, we observe qualitatively that a few (around 10–20) pieces of advice
seem to suffice to get relatively high performance.
  Although reporting absolute computation times is not informative, we do remark that ex-
periments (E4) are the most time-intensive ones, as a result of the repeated testing against the
entire exemplar set, for each mutation in each generation.


5. Conclusions
We have put forward proxy (algorithmic) coaches as a means to provide empirical evidence that
complements prior theoretical work on the efficacy and efficiency of Machine Coaching. In
ongoing work, we continue to investigate coaching-based learning further with proxy coaches
in more expressive settings (e.g., relational policies, richer reasoning semantics with rule
chaining, and, relatedly, without assuming complete contexts), and contrast its performance
and scalability against autodidactic learning algorithms that map exemplar data directly into a
form of prioritized rules [23, 24, 25, 26].
   Ultimately, our plan is to undertake empirical studies with humans in the role of direct
coaches, by identifying meaningful solutions to the challenges discussed in Section 1, perhaps
by pairing human coaches with proxy (algorithmic) coaches. Whether human coaches will offer
advice in a manner analogous to one of the proxy coaches considered herein, whether they will
remain consistent across multiple interactions, and whether they will find the coaching protocol
to be cognitively light, are all questions that we wish to investigate. In turn, we expect that
answers to these questions will offer guidelines towards improving the cognitive compatibility
of the language, semantics, and protocol of Machine Coaching.

Acknowledgements This work was supported by funding from the European Regional
Development Fund and the Government of the Republic of Cyprus through the Research and
Innovation Foundation under grant agreement no. INTEGRATED/0918/0032, from the EU’s
Horizon 2020 Research and Innovation Programme under grant agreements no. 739578 and no.
823783, and from the Government of the Republic of Cyprus through the Deputy Ministry of
Research, Innovation, and Digital Policy.


References
 [1] L. Michael, Autodidactic Learning and Reasoning, Ph.d. thesis, School of Engineering and
     Applied Sciences, Harvard University, USA, 2008. URL: https://dl.acm.org/doi/abs/10.5555/
     1467943.
 [2] L. Michael, Partial Observability and Learnability, Artificial Intelligence 174 (2010) 639–669.
     doi:10.1016/j.artint.2010.03.004.
 [3] L. Michael, The Advice Taker 2.0, in: Proceedings of the 13th International Symposium
     on Commonsense Reasoning, volume 2052, London, U.K, 2017. URL: http://ceur-ws.org/
     Vol-2052/#paper13.
 [4] L. Michael, Machine Coaching, in: IJCAI 2019 Workshop on Explainable Artificial
     Intelligence, Macau, China, 2019, pp. 80–86. URL: https://cognition.ouc.ac.cy/loizos/papers/
     Michael_2019_MachineCoaching.pdf.
 [5] H. Mercier, D. Sperber, Why Do Humans Reason? Arguments for an Argumentative The-
     ory., Behavioral and Brain Sciences 34 (2011) 57–74. doi:10.1017/S0140525X10000968.
 [6] C. Ioannou, L. Michael, Knowledge-Based Translation of Natural Language into Symbolic
     Form, in: Proceedings of the 7th Linguistic and Cognitive Approaches To Dialog Agents
     Workshop - LaCATODA 2021, Montreal, Canada, 2021, pp. 24–32. URL: http://ceur-ws.org/
     Vol-2935/#paper3.
 [7] J. McCarthy, Programs with Common Sense, in: Proceedings of the Teddington Conference
     on the Mechanization of Thought Processes, London, U.K, 1959, pp. 75–91. URL: http:
     //jmc.stanford.edu/articles/mcc59/mcc59.pdf.
 [8] L. G. Valiant, A Theory of the Learnable, Communications of the ACM 27 (1984) 1134–1142.
     doi:10.1145/1968.1972.
 [9] S. Teso, K. Kersting, Explanatory Interactive Machine Learning, in: Proceedings of the 2019
     AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, Association for Computing
     Machinery, New York, NY, USA, 2019, pp. 239–245. doi:10.1145/3306618.3314293.
[10] B. Settles, Active Learning Literature Survey, Technical Report, University of Wisconsin-
     Madison Department of Computer Sciences, 2009. URL: https://minds.wisconsin.edu/
     handle/1793/60660.
[11] M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I Trust You?": Explaining the Predictions
     of Any Classifier, in: Proceedings of the 22nd ACM SIGKDD International Conference on
     Knowledge Discovery and Data Mining, KDD ’16, Association for Computing Machinery,
     New York, NY, USA, 2016, pp. 1135–1144. doi:10.1145/2939672.2939778.
[12] J. McCarthy, Elaboration Tolerance, in: Common Sense 98, London, U.K, 1998. URL:
     http://www-formal.stanford.edu/jmc/elaboration.html.
[13] H. Raghavan, J. Allan, An Interactive Algorithm for Asking and Incorporating Feature
     Feedback into Support Vector Machines, in: Proceedings of the 30th Annual International
     ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
     ’07, Association for Computing Machinery, New York, NY, USA, 2007, pp. 79–86. doi:10.
     1145/1277741.1277758.
[14] B. Settles, Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries
     on Features and Instances, in: Proceedings of the 2011 Conference on Empirical Methods
     in Natural Language Processing, Association for Computational Linguistics, Edinburgh,
     Scotland, UK., 2011, pp. 1467–1478. URL: https://aclanthology.org/D11-1136.
[15] O. Zaidan, J. Eisner, C. Piatko, Using “Annotator Rationales” to Improve Machine Learning
     for Text Categorization, in: Human Language Technologies 2007: The Conference of the
     North American Chapter of the Association for Computational Linguistics; Proceedings of
     the Main Conference, Association for Computational Linguistics, Rochester, New York,
     2007, pp. 260–267. URL: https://aclanthology.org/N07-1033.
[16] P. Shivaswamy, T. Joachims, Coactive Learning, Journal of Artificial Intelligence Research
     53 (2015) 1–40. doi:10.1613/jair.4539.
[17] N. Prentzas, A. Nicolaides, E. Kyriacou, A. Kakas, C. Pattichis, Integrating Machine
     Learning with Symbolic Reasoning to Build an Explainable AI Model for Stroke Prediction,
     in: 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE),
     Athens, Greece, 2019, pp. 817–821. doi:10.1109/BIBE.2019.00152.
[18] R. Khardon, D. Roth, Learning to Reason, Journal of the ACM 44 (1997) 697–725. doi:10.
     1145/265910.265918.
[19] B. Juba, Implicit Learning of Common Sense for Reasoning, in: Proceedings of the 23rd
     International Joint Conference on Artificial Intelligence, AAAI Press, Beijing, China, 2013,
     pp. 939–946. URL: https://www.ijcai.org/Proceedings/13/Papers/144.pdf.
[20] L. G. Valiant, Evolvability, Journal of the ACM 56 (2009) 1–21. doi:10.1145/1462153.
     1462156.
[21] R. J. Urbanowicz, J. H. Moore, Learning Classifier Systems: A Complete Introduction,
     Review, and Roadmap, Journal of Artificial Evolution and Applications 2009 (2009) 1–25.
     doi:10.1155/2009/736398.
[22] J. R. Quinlan, Induction of Decision Trees, Machine Learning 1 (1986) 81–106. doi:10.
     1023/A:1022643204877.
[23] R. L. Rivest, Learning Decision Lists, Machine Learning 2 (1987) 229–246. doi:10.1023/A:
     1022607331053.
[24] Y. Dimopoulos, A. Kakas, Learning Non-Monotonic Logic Programs: Learning Exceptions,
     in: N. Lavrac, S. Wrobel (Eds.), Machine Learning: ECML-95, volume 912 of Lecture
     Notes in Computer Science, Springer, Berlin, Heidelberg, 1995, pp. 122–137. doi:10.1007/
     3-540-59286-5_53.
[25] L. Michael, Causal Learnability, in: Proceedings of the Twenty-Second International Joint
     Conference on Artificial Intelligence, AAAI Press, Barcelona, Spain, 2011, pp. 1014–1020.
     doi:10.5591/978-1-57735-516-8/IJCAI11-174.
[26] L. Michael, Cognitive Reasoning and Learning Mechanisms, in: Proceedings of the 4th
     International Workshop on Artificial Intelligence and Cognition, volume 1895 of CEUR
     Workshop Proceedings, CEUR, New York City, NY, 2016, pp. 2–23. URL: http://ceur-ws.org/
     Vol-1895/#AIC16_paper1.
A. Empirical Setting (Details)
We synthetically generated a target policy by starting from a set of atoms 𝐴, an average batch
size 𝑏, a flip count ℓ, and a number of stacks 𝑠, and then partitioning ℓ + 1 into 𝑠 positive
integers 𝑑1 , . . . , 𝑑𝑠 ∈ N, and producing for each one of them a stack of depth 𝑑𝑖 , 𝑖 = 1, . . . , 𝑠. In
order to generate a stack, given 𝐴, 𝑑𝑖 , and the special atom output, we first generated a batch
containing a single rule, referred to as the stack’s root, with body from 𝐴 and head output,
and then we iteratively generated batches of average size 𝑏 and interchangeably conflicting
heads. Although this process does not explicitly determine the constructed target policy’s size
or the average body size of its rules, we manipulated these attributes indirectly, by varying
across policies the average batch size, flip count, number of stacks, and each stack’s root body
size 𝑟. Namely, for the purposes of our experiments, 𝑟 and 𝑏 both ranged from 1 to 31 with a
step of 2, while ℓ ranged from 1 to 31 with a step of 3, and 𝑠 ranged from 1 to ℓ with a step of 2.
For each assignment of values to 𝑟, 𝑏, ℓ, 𝑠, we generated 10 different target policies, to account
(to some extent) for the various possible partitions of ℓ + 1 to 𝑠 positive integers. Lastly, in all
above cases, 𝐴 contained 20 atoms. All related metrics that were manipulated during the data
generation stage are illustrated in an example policy in Figure 1.
   We measured the predictive ability of each hypothesis policy against each target policy in 𝑃 ,
and, in the case of intensional proxy coaches, against the corresponding learned model. We refer
to these metrics as learning performance and learning conformance, respectively, with each of the
two recording correct predictions, wrong predictions, abstentions, and accuracy, as discussed in
Section 4.1. The fitness metric used in the evolutionary mechanism of the extensional proxy
coach was as described in Section 3.2, and is summarized in Table 1.

                                                       Offspring
                                               𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐴𝑏𝑠𝑡𝑎𝑖𝑛       𝑊 𝑟𝑜𝑛𝑔
                                     𝐶𝑜𝑟𝑟𝑒𝑐𝑡     0        −1          −1
                            Parent




                                     𝐴𝑏𝑠𝑡𝑎𝑖𝑛     1         0          −1
                                     𝑊 𝑟𝑜𝑛𝑔      1         1           0

Table 1
Relative fitness metric.



B. Performance Results (Details)
Table 2 contains all results regarding the average final performance using all four proxy coaches,
while in Tables 3 and 4 we present performance on evaluation set aggregated within each of
the several groups of target policies we have produced. For evaluation purposes, we have also
labeled the coaching set and kept its labeled part, so in what follows we report results on all
three datasets (exemplar, evaluation and coaching).
   Most target policy attributes do not seem to significantly affect performance when it comes
to single-tree approaches with the exception of policies containing few flips or large rule
batches. Regarding flips, this behavior is somewhat surprising since policies containing few
                            Performance                                           Conformance
 Set         Exemplar        Coaching         Evaluation       Exemplar            Coaching             Evaluation
         𝜇   Q1 Q2 Q3   𝜇     Q1 Q2 Q3    𝜇    Q1 Q2 Q3    𝜇   Q1 Q2 Q3       𝜇     Q1 Q2 Q3        𝜇    Q1 Q2 Q3
    c .96 .97 .99 .99 .93 .94 .96 .98 .93 .93 .96 .97 .96 .97 .99 .99 .97 .98 .99 1.0 .96 .97 .99 .99
 E1 w .00 .00 .00 .05 .00 .00 .02 .00 .00 .00 .02 .05 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
    a .04 .01 .01 .03 .03 .00 .00 .02 .04 .01 .01 .03 .04 .01 .01 .03 .03 .00 .01 .02 .04 .01 .02 .03
    c .96 .98 .99 .99 .93 .93 .96 .98 .93 .93 .96 .98 .96 .98 .99 .99 .96 .98 .99 1.0 .96 .97 .99 .99
 E2 w .03 .01 .01 .02 .06 .02 .04 .07 .06 .02 .04 .07 .03 .01 .01 .02 .03 .00 .01 .02 .03 .01 .01 .03
    a .01 .00 .00 .00 .01 .00 .00 .00 .01 .00 .00 .00 .01 .00 .00 .00 .01 .00 .00 .00 .01 .00 .00 .00
    c .85 .80 .87 .92 .84 .77 .87 .91 .84 .78 .86 .81 .87 .83 .89 .94 .87 .82 .89 .94 .86 .81 .88 .93
 E3 w .12 .06 .10 .18 .13 .06 .12 .19 .13 .06 .12 .20 .10 .05 .09 .15 .10 .05 .09 .15 .10 .05 .09 .16
    a .03 .00 .00 .00 .03 .00 .00 .00 .03 .00 .00 .00 .03 .00 .00 .00 .03 .00 .00 .00 .03 .00 .00 .00
    c .94 .92 .94 .97 .78 .73 .78 .82 .80 .76 .81 .86      -    -   -     -   -      -   -      -   -     -   -      -
 E4 w .06 .03 .06 .08 .22 .17 .21 .26 .19 .14 .19 .23      -    -   -     -   -      -   -      -   -     -   -      -
    a .00 .00 .00 .00 .00 .00 .00 .00 .01 .00 .00 .00      -    -   -     -   -      -   -      -   -     -   -      -


Table 2
Final performance with respect to all used sets w.r.t. both the target policy (Accuracy) and the Proxy
Coach (Conformance) (c: “correct”, w: “wrong”, a: “abstain”, 𝜇: mean, Q1: 1st quantile, Q2: median, Q3:
3rd quantile).


flips are structurally simpler, which was expected to benefit coaching and requires, thus, further
investigation. On the contrary, poorer performance on policies containing large rule batches
was expected since larger batches impose, in general, a richer and more complex policy structure
overall. As far as forests are concerned, with the exception of relatively large policies, where the
corresponding hypotheses seem to be sufficiently accurate, correct predictions seem to remain
unaffected by variations in target policy’s attributes. This unexpected efficiency on larger target
policies is also subject to further investigation.
   In Figure 5, we present two sets of single policies, each chosen from one of the groups defined
in Section 4.1. Namely, on the left column (Figure 5a) each policy is the group’s median with
respect to the policy attribute varied within that group, while on the right column (Figure 5b),
each policy is the group’s median with respect to the total number of epochs required during
coaching. In both columns, the four stripes correspond to experiments E1-E4, respectively,
while, within each stripe, the first row corresponds to high-value groups and the second row to
low-value ones.


C. Conformance Results (Details)
So far, we have discussed results regarding learning efficiency against the target theory (perfor-
mance). We have also computed, whenever applicable, learning efficiency with respect to proxy
coaches (conformance). Namely, Conformance values recorded how many of the predictions
of the current hypothesis policy on the evaluation set ℰ were correct against the proxy coach,
wrong against the proxy coach, or abstentions, and the accuracy of the definite predictions
(i.e., the number of correct predictions over the number of correct or wrong predictions). As in
Section 4.1, Relative Size values recorded the size of the current hypothesis policy (relative to
                                                               Performance
 Set                     Exemplar                                Coaching                               Evaluation
           Size    Flips Stacks Batches Width Size         Flips Stacks Batches Width Size         Flips Stacks Batches Width
       c .99 .98 .98 .91 .98 .99 .88 .98 .98 .99 .98 .94 .96 .88 .95 .93 .89 .94 .96 .96 .97 .94 .95 .89 .95 .93 .88 .94 .95 .95
  E1   w .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .01 .04 .04 .02 .05 .06 .00 .05 .03 .04 .01 .04 .03 .02 .04 .06 .00 .04 .03 .04
       a .01 .02 .02 .09 .02 .01 .12 .02 .02 .01 .01 .02 .00 .10 .00 .01 .11 .01 .01 .00 .01 .02 .02 .09 .02 .01 .12 .02 .02 .01
       c .98 .98 .99 .91 .99 .99 .88 .98 .98 .99 .97 .95 .96 .88 .95 .94 .88 .94 .96 .96 .97 .95 .96 .89 .96 .93 .88 .95 .95 .96
  E2   w .02 .02 .01 .09 .01 .01 .07 .02 .02 .01 .03 .05 .04 .12 .05 .06 .07 .06 .04 .04 .03 .05 .04 .11 .04 .07 .07 .05 .05 .04
       a .00 .00 .00 .00 .00 .00 .05 .00 .00 .00 .00 .00 .00 .00 .00 .00 .05 .00 .00 .00 .00 .00 .00 .00 .00 .00 .05 .00 .00 .00
       c .91 .84 .87 .81 .86 .83 .86 .85 .86 .86 .90 .83 .87 .80 .85 .81 .86 .83 .86 .85 .91 .83 .86 .82 .84 .81 .85 .83 .85 .84
  E3   w .08 .14 .13 .06 .12 .17 .06 .14 .14 .14 .09 .16 .13 .06 .13 .19 .06 .16 .14 .15 .09 .15 .14 .06 .14 .19 .07 .16 .15 .15
       a .01 .02 .00 .13 .02 .00 .03 .01 .00 .00 .01 .02 .00 .14 .02 .00 .03 .01 .00 .00 .01 .02 .00 .12 .02 .00 .03 .01 .00 .01
       c .93 .86 .87 .91 .87 .86 .92 .86 .87 .85 .80 .73 .74 .79 .75 .70 .81 .71 .75 .73 .84 .76 .78 .82 .78 .73 .83 .74 .78 .75
  E4   w .07 .12 .12 .08 .12 .14 .07 .13 .11 .11 .20 .26 .24 .20 .25 .29 .18 .28 .23 .23 .15 .22 .20 .16 .21 .26 .16 .24 .20 .20
       a .01 .02 .01 .01 .01 .01 .00 .01 .02 .04 .00 .01 .01 .01 .01 .01 .01 .01 .02 .04 .00 .02 .02 .02 .01 .01 .01 .02 .02 .04


Table 3
Average performance by group of target policies for each of the three datasets. (left: high-value group,
right: low-value group, bold: lowest performance w.r.t. dataset and protocol).

                                                              Conformance
 Set                    Exemplar                               Coaching                                Evaluation
         Size    Flips Stacks Batches Width Size         Flips Stacks Batches Width Size         Flips Stacks Batches Width
     c .99 .98 .98 .90 .98 .99 .88 .98 .98 .98 .99 .98 1.0 .91 1.0 .99 .89 .99 .99 1.0 .98 .98 .98 .90 .98 .99 .88 .98 .98 .99
  E1 w .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
     a .01 .02 .02 .10 .02 .01 .12 .02 .02 .02 .01 .02 .00 .09 .00 .01 .11 .01 .01 .00 .01 .02 .02 .10 .02 .01 .12 .02 .02 .01
     c .97 .98 .99 .90 .99 .99 .88 .98 .98 .99 .98 .98 1.0 .90 1.0 .99 .89 .99 .98 1.0 .97 .98 .99 .91 .99 .99 .89 .98 .98 .99
  E2 w .03 .02 .01 .10 .01 .01 .07 .02 .02 .01 .02 .02 .00 .10 .00 .01 .06 .01 .02 .00 .03 .02 .01 .09 .01 .01 .06 .02 .02 .01
     a .00 .00 .00 .00 .00 .00 .05 .00 .00 .00 .00 .00 .00 .00 .00 .00 .05 .00 .00 .00 .00 .00 .00 .00 .00 .00 .05 .00 .00 .00
     c .91 .85 .89 .82 .89 .86 .86 .87 .88 .87 .91 .86 .90 .83 .89 .86 .86 .87 .89 .88 .91 .85 .89 .82 .88 .85 .85 .87 .88 .87
  E3 w .07 .13 .11 .04 .09 .14 .06 .12 .12 .12 .08 .12 .10 .04 .09 .14 .06 .12 .11 .12 .08 .13 .11 .05 .10 .15 .06 .12 .12 .12
     a .01 .02 .00 .13 .02 .00 .03 .01 .00 .01 .01 .02 .00 .13 .02 .00 .03 .01 .00 .01 .01 .02 .00 .13 .02 .00 .03 .01 .00 .01


Table 4
Average conformance by group of target policies for each of the three datasets. (left: high-value group,
right: low-value group, bold: lowest performance w.r.t. dataset and protocol).


the size of 𝑝), and the size of the epoch (relative to the size of 𝒞), the size of each piece of advice
given (relative to the size of 𝐴).
   All results regarding the above metrics are presented in Figure 6 (not applicable for E4).
Overall, the trends observed for the three different advice protocols are similar to those presented
in Figure 3. Nevertheless, one may observe that there are also some minor differences. To begin
with, regarding the full path protocol, wrong predictions, in terms of conformance to the proxy
coach’s ones, are zero. This is due to the advice-giving protocol being highly conservative on
the returned advice which, as discussed in Section 3.1, is quite narrow in terms of coverage but
highly accurate at the same time.
   Another interesting fact is that, regardless of whether efficiency is measured against the
target policy or the proxy coach, coaching under the forest protocol is persistently less efficient
than the other two. This, again, could be interpreted as another hint for the inefficiency of
the underlying advice mining mechanism when it comes to forests or just as a side effect of a
forest’s relative opacity compared to single trees.
              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5




                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0         5              10              15        0    10            20        30         40       50   0 10 20 30 40 50 60 70 80                     0        2                  4         6    0         5             10             15         20                                     0    2             4         6              8    10   0    10            20            30         40   0     10                20             30    40   0        2                   4          6    0        5              10             15             20
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs
              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5




                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0         10                  20              30   0                                 2                   0         10                  20         30   0        5                  10        15   0             10                  20              30                                     0             10                 20              30   0                  5                   10        0             10              20             30   0        10              20             30   0        10             20             30             40
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs




              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5




                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0    2          4             6          8         0         5             10             15        20   0    10             20             30    40   0                  2                  4    0        5              10             15         20                                     0        2              4              6         8    0    5             10            15         20   0         5             10             15    20   0             2                   4          0                       5                        10
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs
              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5




                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0                   5                    10        0                                 2                   0    5              10             15    20   0        5                  10        15   0         5             10             15         20                                     0             5                   10             15   0         2                  4              6    0             5                   10         15   0        5                   10         15   0        5              10             15             20
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs




              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5




                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0    10             20             30         40   0 10 20 30 40 50 60 70 80 90                          0 10 20 30 40 50 60 70 80                     0        10                 20        30   0   10         20            30         40        50                                     0        10             20             30        40   0   10        20        30        40        50   0    10            20        30         40   50   0        10              20             30   0    10            20        30         40            50
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs
              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5
                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance
              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0    10             20             30         40   0                                 2                   0   10         20        30         40   50   0        5                  10        15   0                       10                       20                                      0   10        20        30        40        50   60   0         2                  4              6    0   10 20 30 40 50 60 70                          0   10   20        30        40    50   60   0   10        20        30        40        50        60
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs




              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5
                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance




              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0   10         20        30         40        50   0    10            20        30         40       50   0   10         20        30         40   50   0   10        20       30        40   50   0   10        20        30        40        50    60                                     0   10             20        30            40    50   0   10        20        30        40        50   0    10            20        30         40   50   0   10        20        30        40    50   0    10            20        30         40            50
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs
              1.00                                                                                                                                                                                                                                                            2.0                        1.00                                                                                                                                                                                                                                                                    2.0


              0.75                                                                                                                                                                                                                                                            1.5                        0.75                                                                                                                                                                                                                                                                    1.5
                                                                                                                                                                                                                                                                               Relative Size




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Relative Size
    Performance




                                                                                                                                                                                                                                                                                               Performance




              0.50                                                                                                                                                                                                                                                            1.0                        0.50                                                                                                                                                                                                                                                                    1.0


              0.25                                                                                                                                                                                                                                                            0.5                        0.25                                                                                                                                                                                                                                                                    0.5


              0.00                                                                                                                                                                                                                                                            0.0                        0.00                                                                                                                                                                                                                                                                    0.0
                     0   10         20        30         40        50   0   10        20        30        40        50   60   0   10         20        30         40   50   0   10        20       30        40   50   0    10            20        30         40        50                                     0   10             20        30            40    50   0   10        20        30        40        50   0    10            20        30         40   50   0   10        20        30        40    50   0    10            20        30         40            50
                                   Epochs                                              Epochs                                            Epochs                                          Epochs                                       Epochs                                                                                      Epochs                                            Epochs                                               Epochs                                       Epochs                                         Epochs



      (a) Median policies per group, w.r.t. each policy (b) Median policies per group w.r.t. the total
          attribute.                                        number of epochs.
Figure 5: Performance of the median policy of each group w.r.t. each group’s manipulated policy
attribute (left column) and the total number of epochs within each group (right column). Within each
column, the four stripes correspond to E1, E2, E3 and E4 respectively. Within each stripe, groups vary
from left to right as follows: policy size, flip count, stack count, batches and width. Also, the top row of
each stripe corresponds to high-value groups, while the bottom to low-value ones. Color coding is the
same as in Figure 3.
          1.00                                                                                                        2.0


          0.75                                                                                                        1.5
Conformance




                                                                                                                       Relative Size
          0.50                                                                                                        1.0


          0.25                                                                                                        0.5


          0.00                                                                                                        0.0
                 0.00   0.25   0.50   0.75   1.00 0.00   0.25   0.50   0.75   1.00 0.00   0.25   0.50   0.75   1.00
                    Coaching Progress (%)            Coaching Progress (%)            Coaching Progress (%)
   Figure 6: Results for experiments (E1)–(E4), shown from left to right. Conformance values show correct
   predictions ( ), wrong predictions ( ), abstentions ( ), and accuracy ( ) on the evaluation set,
   against the learned model. Relative Size values show the relative sizes of the current hypothesis policy
   ( ), the epoch ( ), and the given piece of advice ( ). Results are aggregated across all target policies,
   with solid lines representing the mean values, dashed lines representing the median values, and shaded
   areas representing the 𝑄1 to 𝑄3 quartile interval.