Realizing a Denial of Expectation in Pipelined Neural
Data-To-Text Generation
Maurice Langner1 , Ralf Klabunde1
1
    Department of Linguistics, Ruhr-Universität Bochum, Germany


                                         Abstract
                                         This paper aims at the generation of denials of expectations in the domain of vehicle reviews. A denial
                                         of expectation expresses an apparent contradiction between some probabilistically motivated rule and
                                         a current circumstance expressed by a contrastive sentence. For generating such an argumentative
                                         sentence, we present a new approach for content selection in a neural data-to-text generation framework.
                                         In addition to selecting relevant information from tabular data that should appear in the text, further
                                         methods are required for determining evaluations that are rooted in this data, but express individual
                                         appraisals of the respective vehicle. We show how a content selection module is able to decide when
                                         expressing a denial of expectation. We use multi-label and binary classification for content selection on
                                         automatically extracted training data and Random Forest Regression with varying knowledge limitation
                                         for predicting expectations about feature values. These predictions are compared to manually annotated
                                         corpus instances of contrast relations in order to show that the concept of denial of expectation is
                                         a reasonable approach to determining contrasts and evaluative content at the early stage of content
                                         selection.

                                         Keywords
                                         Natural Language Generation (NLG), denial of expectation, evaluative adverbs, content planning, docu-
                                         ment planning


1. Introduction
When arguing in favor of (or against) a new device in order to convince a potential buyer to
purchase (or disregard) that device, it is often useful to compare features of that device with
feature-related expected values, and to indicate possible consequences of discrepancies between
real and expected value. One of the linguistic means for realizing this is the so-called denial of
expectation, which is an apparent contradiction between some rule, be it grounded in domain
knowledge, personal experience or norms of social behaviour, and a current circumstance
expressed in the corresponding sentence. In sentences like:
              1. The sports car is not that big in terms of external dimensions, but it does weigh quite a bit.
              2. Although the sports car is not that big in terms of external dimensions, it does weigh quite a bit.
                 (corpus example, engl. translation)
              3. (Un)fortunately, the sports car does weigh quite a bit.

we have different linguistic realizations of the denial of expectation. Example (1) expresses a
contradiction that is based on technical specifications and their consequences for a car’s weight
6th Workshop on Advances In Argumentation In Artificial Intelligence (AI³ 2022)
Envelope-Open Maurice.Langner@rub.de (M. Langner); Ralf.Klabunde@rub.de (R. Klabunde)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
and the weight of the corresponding sports car. According to the external dimensions, weight
is expected to be lower than the real value. However, the conjunction but is ambiguous, it can
also be used to express a contrast that is grounded in other means than rule violation (as in
John is short but Lea is tall). Contrary to but, the concessive conjunction although in example (2)
can only be used to express the denial of expectation. Adverbs like (un)fortunately are typically
not considered as items expressing a denial of expectation, but in fact it is possible that they
are grounded in the same semantic mechanism the conjunctions are based on. For example,
unfortunately expresses that some part of the proposition that is in the scope of this adverb is
typically evaluated neutrally, and in this special case it has been shifted to the negative end of
this evaluation scale. Example (3) can be interpreted in the following way: The writer’s use of
this sentence is based on her estimation of the weight of sports cars, which is grounded in their
typical dimensions and other technical factors, but the sports car that is relevant here violates
this expectation. It should be noted that other uses of adverbs of this type are not grounded in
a denial of expectation, but in other incompatibilities. For example, unfortunately this car is red
might just express the difference between the speaker’s colour preference and the actual colour
of the car. Therefore, in this paper we consider only those adverb uses with a reference to a
denial of expectation.
   A denial of expectation as a special case of linguistically expressed contrast is inherently
argumentatively motivated [1], since the addressee’s inferred opposition between both conjuncts
of such a sentence is quite decisive for the evaluation of the object or state of affairs at hand.
Using Toulmin’s argumentation schemes [2], examples (1) and (2) are claims based on expected
and real values for the car as grounds, and the underlying expectation functions as the warrant.
   In this paper, we are analyzing denials of expectations from a Natural Language Generation
(NLG) perspective. Hence, we do not determine plausible oppositions for a given denial of
expectation, but we are motivating decisions for realizing such a sentence. As application
domain, we are using driving reports and an associated database with technical specifications
for this. During content selection – the first stage in a pipelined NLG system – possible data
correlations must be determined, together with a violation of this correlation in the current
message to be verbalized.


2. Related Work
In traditional NLG pipelines for data-to-text generation, an interpretation module that encodes
domain-expert knowledge decides which information should be contained in a message within
the document plan. The avoidance of hand-coding a heuristic selection module is desirable for
time and effort reasons and can be realized with the help of data-driven learning techniques.
A premise to this is the availability of a reasonable amount of data. In general, data-to-text
generation is the field of transforming tabular data into surface text [3, 4, 5]. Due to the rise
of neural networks, traditional pipeline approaches were abandoned in favor of end-to-end
trainable encoder-decoder networks [6, 7, 8], which do not separate content selection and
document planning from surface realisation, often at the price of loosing controllability of
generated content.
   As an improvement in regard to content and informational correctness, copy mechanisms
[9, 10] are employed in order to directly copy content from input data to the output text in
order to enhance correctness during the generation process, while maintaining end-to-end
trainability. Mei et al. [7] use an intermediate aligner step between encoding and decoding
in order to integrate more controllable content selection into the end-to-end network. As
Wiseman et al. [8, p. 2259] point out, such a model performance is well below gold standard on
the RotoWire dataset, regarding both, content selection and text coherence, although the copy
mechanisms clearly improve the vanilla encoder-decoder networks in regard to BLEU. Their
results also indicate the absence of correlation between the models’ precision and recall for
content selection and BLEU scores.
   Another important point to mention is that most models are trained to generate short
phrases only, e.g. short biographies from Wikipedia tables [6, 11]. Coherence of discourse and
information structure decreases with increase of text length, which makes the encoder-decoder
models a non-optimal choice for the generation of longer texts.
   Ferreira et al. [4] propose a re-modularization of neural generation networks, chaining
separately trainable and evaluable networks that are specialized for the different tasks of
content selection, document planning and surface realisation. They show that these pipelined
neural generation models outperform end-to-end networks, especially on unseen data, where
the latter tend to produce topic-unrelated, incoherent texts and hallucinations. Turning back
to a pipelined neural generation system necessitates to find a suitable content selection model
that determines what the document plan shall contain.
   Many papers have been published on research how to punish toxicity and inappropriate
language use in neural NLG, resulting in more neutral word choice, but to our knowledge, the
controlled generation of a denial of expectation as a model of contrast relations and evaluative
content has not yet been dealt with in the context of neural NLG.


3. Methodology
We present a classification approach to content selection on automatically augmented data
which predicts what information shall be present in the document plan. Furthermore, we
use regression models in order to predict, given different knowledge limitations, whether the
information to be produced in the surface text agrees with expectations about the information
in order to determine in a data-driven manner whether expressive content is legitimate to use
for putting the information into perspective or not.

3.1. Domain
A domain for data-to-text generation needs tabular data from which surface text shall be
produced. The German Automotive Club ADAC supplied us with a proprietary data set of 1300
road test reports with 3000 to 6000 words including the respective database with 127 technical
and economic properties for each vehicle. The road test reports are written by domain experts
and contain a subset of the properties given in the database as well as surplus information, e.g.
a guess of resale values of used cars, or their opinion on vehicle characteristics. Therefore, the
texts also include evaluative information which is naturally grounded in subjective estimation.
These evaluative decisions, in turn, will be reflected in the use of corresponding evaluative
    • The A3 completes the intermediate sprint from 60 to 100 km/h in a brisk 6.1 s.
    • The petrol engine completes the intermediate sprint from 60 to 100 km/h in 5.6 s.
    • The 108 kW/147 hp mean that the simulated overtaking maneuver (sprint from 60 to 100 km/h) ends
      after just 6.0 seconds.

Table 1
Examples of verbalisations for accelaration (English translation of German corpus examples)


lexical items like adverbs or conjuncts. There is no information in the database which properties
were named in the respective road test report. In order to produce a data set that is usable for a
learning-based content selection module, we need a function that maps a road test review to a
binary decision on properties in the database that either marks the absence of the respective
piece of information from the text with 0 or its presence in the road test report with 1. The
result is a triple (𝑇 , 𝐷, 𝑀) for each road test report 𝑇 consisting of the text 𝑇, a database row 𝐷,
which is a 127-tupel of mixed type information on the vehicle, and a map 𝑀 which is a 127-tupel
of binary values for the presence or absence of each property in 𝐷. We selected a subset of
15 properties from the database which are related to the technical details on engine, chassis
and driving performance. In order to determine 𝑀, we extracted a sample of 200 database
rows and respective test reports and build a heuristic information extraction module, which
uses the tracking of keywords and pattern matching for determining the presence of a piece of
information in the text. At the same time, we manually annotated a set of 50 texts from the
corpus in regard to the presence of the 15 target features we selected beforehand. This set of
annotated texts serves as the gold corpus for evaluating the heuristic IE module. The set of 200
reports for building the IE heuristic and the set of 50 texts for manual annotations are disjoint,
and only the remaining 1050 road test reports and database rows are used for the machine
learning models of content selection and denial of expectation.

3.2. Information Extraction
For information extraction, we searched the subset of 200 reports for keywords and patterns
in which the 15 target properties are verbalized in order to extract rules in form of regular
expressions. Fortunately, across test reports, the verbalisation of each piece of information is
relatively uniform, despite the self-evident differences between properties (see Table 1).
   As the examples in Table 1 show, the data point acceleration is stereotypically verbalized as a
quantity of seconds the car needs to accelerate from 60 to 100 km/h. Additionally, key words
like sprint and overtaking maneuver often indicate the target information acceleration. Not all
of the features are as straightforward in extraction as acceleration. The feature displacement, for
example, occurs in highly aggregated compounds like the 2.5 litre turbo combustor motor and is
less accurately extractable.
   This heuristic information extraction approach is applied to the remaining 1050 pairs of
road reports and database rows in order to produce the necessary training instances for neural
content selection. The output is a table containing 15 columns such that for each road test
report there is a 15-tuple of binary values that indicate whether the respective property was
found in the text or not.
3.3. Content Selection
Content selection has only little bearing in neural end-to-end networks and encoder-decoder
models, especially in regard to the fact that the generated texts often only comprise a few
sentences or a single paragraph [6, 12, 13, 14]. In encoder-decoder systems even with copy
mechanism, the generated texts loose coherence and informational correctness when unfolding
the texts. Even the most recent GPT-3 models, despite perfect grammatical correctness and
language style, show rather weak correctness of information. Puduppully et al. [15] succeed
in generating longer texts in a domain with an average of 330 words. The authors built an
encoder-decoder model that learns relations between NBA basketball game records (numerous
repetitive events with 4 features each) and the respective verbalisations in the corresponding
game summary text. The difference between their dataset and the car review corpus is the
structural non-repetitive nature of the latter – there are no repetitive events that can be mapped
onto a summarization; each property in the ADAC database is a unique informational unit
that is dealt with mostly separately or in paragraphs containing several related properties.
Ferreira et al. [4], who competitively evaluate end-to-end models against their pipelined GRU
and Transformer, use the webNLG corpus, which maps RDF triples onto a short paragraph
containing the information. The authors do not deal with content selection in the actual sense,
but rather content ordering of the preselected triples in the corpus.
   Our content selector takes the very first step in an NLG pipeline and decides which pieces of
information shall be integrated into the document plan, which in turn, after adding the database
values for the respective properties, can also be represented as RDF triples, based on the ADAC
database and the human written car reviews. We assume that the authors of the car reviews
have domain-specific reasons for choosing certain properties from the database and leaving
others aside. For example, horse power is used in nearly all texts, whereas valves are never even
mentioned. Hence we assume that given the texts, the database entries, and our generated
training data, we are able to train a neural classifier that is capable of predicting which features
should be produced. This would mean that the classifier could encode domain expert knowledge
on what to say by finding the same patterns in the technical data as the experts would do.
   Before using classification, we tried to determine feature importance patterns in the data.
Assuming that some of the technical details are interdependent, not all pieces of information
are relevant for predicting the presence of each property. This is also reasonable from an
engineering perspective, since properties of the motor might condition each other, while these
have no influence on the design of the interior.
   The heatmap in Figure 1 reveals some interesting relations. The slightly recognizable line,
which resembles a linear function with gradient -1, is self-evident, since each piece of information
is mapped onto itself. More important are the really strong outliers beside this line, showing
that each data point only has a few, but important points that condition the target point’s
presence.
   According to the feature importance and usability of feature type, we selected 40 categorical
and numerical values for predicting the 15 properties. The 15 data points were first label-
encoded and then one-hot encoded. By applying the heuristic information extraction module to
the texts, we gained about 931 usable text array pairs. The remaining car reviews only provided
partial or fragmental test reports, which we therefore excluded. These arrays contain 15 binary
Figure 1: Heatmap of feature importance for content selection


values [1,0] indicating presence or absence of the target properties.
   An important point to mention here is that the input to the classifier is not a tensor with 15
elements of the binary decision on presence, but the real values of the 40 relevant properties,
for e.g. 𝐻 𝑃, 𝑡𝑜𝑟𝑞𝑢𝑒, 𝑤𝑒𝑖𝑔ℎ𝑡 and so on taken from the database. The network is therefore trained
on determining the presence or absence of a piece of information on the basis of the vehicle’s
data, and not on the presence or absence of the other data points.
   A multi-label classifier [16, 17] with a dense network of 5 layers (Keras, Tensorflow), RELU
activation function and sigmoid activation as final layer did not lead to satisfying results. We
received only a few percent of accuracy per feature when predicting all 15 target features at
once. Reasonable accuracy is only reachable when performing binary classification for each
target feature separately.

3.4. Surface Realisation
For surface realization we used the multi-lingual T5 (MT5, [18]); the model is fine-tuned on
the German webNLG corpus ([19]) with varying numbers of aggregated triples. In order to
fine-tune the MT5 for car reviews, we extracted triples from our ADAC corpus of car reviews by
matching the corresponding surface text with the features and values given in the corresponding
database. The generated sample below shows what the neural pipeline generated on the basis
of the technical data of the Renault Clio TCe 130 GPF. The content selector predicted to generate
the features torque, supercharging and motor power, which were aggregated with feature values
from the database, then linearized by the document planner and fed into the custom MT5 in
order to produce the output surface text. The reference text from the original car review is also
listed for comparison.

    Input:        type: combustor; motor power: 130 HP; cubic: 1.3L; torque: 240 Nm; ...
    Content:      torque, supercharging, motor power
    Planning:     Renault Clio TCe 130 GPF | torque | 240 Nm
                  Renault Clio TCe 130 GPF | supercharging | Turbo
                  Renault Clio TCe 130 GPF | motor power | 130 PS
                  (torque, supercharging, motor power (HP))
    Output:       Der Turbobenziner leistet 130 PS und entwickelt ein maximales
                  Drehmoment von 240.
                  ‘The turbocharged petrol engine makes 130 PS and produces a torque
                  of 240.’
    Reference:    Der 1,3 Liter Vierzylinderbenziner leistet dank Turboaufladung 130 PS
                  und entwickelt ein maximales Drehmoment von kräftigen 240 Nm, das
                  bereits bei 1.600 Umdrehungen pro Minute bereitsteht.
                  ‘The 1.3 liter four-cylinder petrol engine has, thanks to supercharging,
                  130 PS and produces a maximal torque of powerful 240 nm, which is
                  available already at 1.600 turns per minute.’


3.5. Denial of Expectation as Contrast Relation
Vehicle testers have certain expectations about features of a vehicle when writing the reviews.
These are based on domain-specific experience, such that experts have an intuition what weight
a sports car should have, given a set of exterior dimensions, motor block size and so on. Many
evaluations and contrasting details are formulated in regard to technical details which contradict
each other or do not agree to the expected value, both in positive and negative polarity.
   A denial of expectation when the expected value does not agree to the real value may be
a trigger for generating a concessive or evaluative marker that signals this mismatch to the
reader. Independent of the question how such a contrast or evaluation shall be lexicalised,
the underlying semantic mechanism for determining such a mismatch is data-driven and is,
therefore, installed at the interface of content selection and document planning.
   A straightforward approach to predicting values, given a set of features, is regression. Using a
Random Forest regression implementation (Scikit, 100 estimators), which often faces limitations
for linguistic data [20], but is a reasonable choice for our task at hand, we predict numerical
values for each of the 15 target features on the basis of the set of residual features in the database
(excluding the target feature). By comparing these predicted values to the real values, we can
determine whether there is a significant deviation that may trigger a contrast relation or the
usage of an evaluative adverb.
   From the corpus of 1300 car reports we automatically extracted instances of concessive
markers, namely obwohl (‘although’), and the evaluative adverbs erstaunlicherweise (‘surpris-
ingly’), bedauerlicherweise (‘regrettably’) and leider (‘unfortunately’), and filtered them by the
assessibility of the contrasting information in our database. Contrast relations, to which the
denial of expectation applies, but which cannot be modelled in a data-driven way, dealed with
subjective driving experience, e.g. the adjustability of the arm rest or the noise level of the
motor, or would necessitate additional reasoning or information we did not have access to.
19 instances of concessives and evaluative expressions remain for analysis, listed in Table 4.
Note that we confined the evaluative adverbs to those cases with a denial of expectation as
underlying contrasting motivation. This is a small number of instances, but it relates to the fact
that we only searched for the specific markers above. The polarity defines either positive or
negative direction of the denial of expectation, whereas the arrows beside the source and target
properties names indicate whether their values need to be high(-er) (↑) or low(-er) (↓) in order
to match the polarity.
   Furthermore, Table 4 lists four numerical values: the real values for the target data point
which were retrieved from the database and three possible thresholds for modeling denial of
expectation. These are the predicted value on the basis of all 40 relevant data points, the ‘naive’
prediction where the input to the regressor is limited to the source information the authors name
in their contrasting relation, and finally the average value of the target data point across all
vehicles in the database. The numerical values that differ with correct sign from the real value
such that the denial of expectation can be captured by the threshold are printed in boldface.
   A difference to the real value can be interpreted as a denial of expectation. The expected value
is higher or lower than the real value - the sign of the difference and semantic relation between
the source and target features, e.g. higher 𝐻 𝑃 may entail higher 𝑚𝑎𝑥𝑖𝑚𝑎𝑙 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, determine
the polarity of the whole expression. The polarity (+/-) determines whether the denial is a
surprise (+) or a disappointment (-). For example, instance (8) in Table 4 can be paraphrased as
although the car has many horse powers, the mileage is comparably low. The expected values
(both regression values, 5.94 and 6.1) are higher than the real value (5.9), meaning that the real
𝑚𝑖𝑙𝑒𝑎𝑔𝑒 falls below what is expected for a car of the respective 𝐻 𝑃. Since less 𝑚𝑖𝑙𝑒𝑎𝑔𝑒 is positive,
the polarity of the whole expression is positive.
   Before going into the analytic details, a few technical relations need clarification in order to
understand the expression for the target property range(↑)@mileage(↓). Five of the contrasting
relations are established between the size of the fuel tank and the possible range of the car,
indicating that despite a given comparably small tank, a high range is possible. The range
as a numeric value is listed in the database only for electric cars, but for combustors, the
range can be inferred by the tank size and the minimal consumption. Range and mileage
are anti-proportional, meaning that the smaller the mileage, the higher the possible range at
constant tank size. Our target feature is therefore the fuel consumption, where less is generally
considered better. Furthermore, acceleration should be explained. Acceleration is quantified in
seconds needed to reach a certain velocity, e.g. 100 km/h. Higher acceleration leads to fewer
seconds needed.


4. Evaluation
4.1. Information Extraction
In Table 2 the results of the heuristic information extraction approach tested on the gold corpus
of 50 reports are listed. We received perfect or near perfect values for horse power, torque, fuel
consumption, acceleration and brakes. While velocity and transmission still show good recall and
precision values around 0.85, displacement and supercharging reveal that the highly aggregated
expressions for the motor description are too diverse to be captured as accurately as the other
                           feature              precision     recall   counts
                           horse powers            1.0           1.0      927
                           torque                 0.97           1.0      805
                           fuel consumption       0.97           1.0      926
                           acceleration            1.0          0.97      419
                           motor type             0.97          0.91      800
                           price                   1.0          0.89      926
                           brakes                 0.97           1.0      831
                           max. velocity          0.87          0.91      399
                           transmission           0.85          0.85      766
                           cylinders              0.84          0.88      119
                           displacement           0.65          0.77      288
                           supercharging           0.6          0.46      461
                           weight                 0.25          0.27      889
                           valves                   -              -        0
Table 2
Information Extraction: Precision and recall for extracted properties, plus occurence counts in the
resulting training data

                         feature               acc     err      precis.   recall
                         horse powers         0.997   0.005      0.996    0.997
                         valves               0.997   0.004        0           0
                         fuel consumption     0.992   0.013      0.99       0.99
                         price                0.991   0.016      0.99       0.99
                         weight               0.914   0.112      0.96       0.97
                         motor type           0.815   0.098      0.88       0.85
                         brakes               0.797    1.0       0.90       0.86
                         cylinders            0.79     0.12      0.11       0.18
                         torque               0.765   0.173      0.88       0.87
                         transmission         0.72    0.184      0.79       0.75
                         displacement         0.686    0.11      0.29       0.24
                         supercharging        0.61    0.075      0.68       0.53
                         max. velocity        0.56    0.054      0.52       0.38
                         acceleration         0.53    0.071      0.45       0.48
Table 3
Content Selection: Accuracy, precision and recall values for feature-wise binary classification


attributes. For valves we cannot offer data since the feature did not occur in the reviews at
all, which shows its irrelevance to the reader from the domain experts’ perspective. The low
recall and precision values of weight is due to the rather indefinite nature of the car’s weight.
Different weight-related features exist, e.g. trailing load, maximal allowed weight of cargo on
the roof or in the trunk, and sums of varying subsets of those are used in the text. The weight
of the car is therefore complicated to distinguish from other weight-related expressions.
4.2. Content Selection
For evaluating the content selection module, we used 8-fold cross validation and calculated
the average scores. In Table (3), the evaluation metrics for each feature are listed separately.
The features acceleration and 𝑚𝑎𝑥. 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦, which were very well recognized by IE, are the
properties with worst prediction accuracy and only marginally better than random. The data
points supercharging and displacement agree in accuracy with the low precision and recall
values the IE already suggested. Horse power and valves are predicted with best accuracy, which
is explainable by the simple categorical use of the former and the utter absence of the latter,
making their selection deterministic.

4.3. Surface Realisation
The output of the system is evaluated against the original human-produced car reviews using
BLEU. Surprisingly, the model reaches 0.57 BLEU despite the small data set. Depending on
the order of triples, the BLEU score may drop to 0.19 (random order), showing that the linear
order, based on topic extraction, is of essential importance for the successful transformation
into surface text. This is a clear indication that the discourse planner plays a central role in the
acceptance of the generated texts.

4.4. Denial of Expectation as Contrast Relation
Figure 2 and 3 depict the sampling variance [21] of the Random Forest regressor.


                                                  Figure 3: Confidence intervals of
Figure 2: Confidence intervals of acceleration    fuel consumption


  The graphs show the error of the Random Forest Regression model for the two features fuel
consumption and acceleration, which are from the upper and the lower end, respectively, of the
accuracy scale for content selection. Figure 3 shows that the predicted values agree perfectly
with the real values of fuel consumption, and the confidence intervals are uniform, at least in
the area lower than 9 liters of fuel per 100 km. Above this threshold, only few data points are
available in the data, for which the prediction is much less accurate and the confidence much
                                                                           threshold for denial of expect.
  id   token       pol.   source property    target property       real    pred.      naive avg.
  1    obwohl      +      fuel tank(↓)       range(↑)@mileage(↓)   4.1     4.1        4.4       5.4
  2    obwohl      +      fuel tank(↓)       range(↑)@mileage(↓)   6       5.997      5.7       5.4
  3    obwohl      +      fuel tank(↓)       range(↑)@mileage(↓)   3.9     3.98       4.3       5.4
  4    obwohl      +      weight(↑)          vehicle payload(↑)    520     549.18     545.5     497
  5    obwohl      +      weight(↑)          acceleration(↑)       7.8     7.93       8.6       9.17
  6    obwohl      -      dimensions(↓)      weight(↑)             2173    2134       1985      2028
  7    obwohl      +      fuel tank(↓)       range(↑)@mileage(↓)   3.7     3.53       4.93      5.4
  8    obwohl      +      hp(↑)              mileage(↓)            5.9     5.94       6.1       5.4
  9    obwohl      +      weight(↑)          acceleration(↑)       7.6     7.7        8.8       9.17
  10   obwohl      +      fuel tank(↓)       range(↑)mileage(↓)    5.1     5.08       4.7       5.4
  11   obwohl      +      weight(↑)          mileage(↓)            4.8     4.78       8.3       5.4
  12   obwohl      +      weight(↑)          acceleration(↑)       5.2     4.95       6.75      9.14
  13   leider      -      –                  price(↑)              31900   35120      –         42250
  14   leider      -      –                  price(↑)              26295   26161      –         42250
  15   leider      -      –                  mileage(↑)            5.2     5.149      –         5.4
  16   obwohl      +      Supercharging(↓)   torque(↑)             213     195.86     150.0 301.8
  17   obwohl      +      displacement(↓)    max. velocity(↑)      182     179.95     175       200
  18   obwohl      +      supercharging(↓)   acceleration(↑)       6.1     5.9        6.2       9.17
  19   erstaunl.   +      hp(↓)              trailing payload(↑)   4900    3551.15 1100         1719
Table 4
Contrasting relations and evaluative adverbs from the ADAC car review corpus


weaker. A less accurate picture is drawn by the predictions for acceleration, which are still close
to the real values, but fewer predictions are perfectly on point. There is also more variance in
the confidence intervals, but in contrast to fuel consumption, there are fewer outliers and the
values are spread across the interval of 2.5 seconds to 15 seconds in a more balanced way.
   On average, regression seems to yield good results when predicting the 15 target features
on the basis of the 40 most relevant features in the database. But does regression model the
domain experts’ decisions in a sufficiently accurate way? A closer look at instances of denial of
expectation in the corpus sheds light on the relation of author expertise and the motivation for
generating this kind of contrast.
   As the column of predicted values in Table 4 displays, many of the expected values related to
fuel consumption are modeled nearly perfect on point, leaving no or only marginal differences.
Reducing the input features of the regression model to the source information causes more
deviation, which allows us to model contrast with ‘naive’ prediction with limited knowledge.
The average mileage across all database entries is also a good threshold estimator. The outlier
in regard to mileage is instance 2. The original text gives a reasonable explanation for this - the
scale by which the positiveness of reduced fuel consumption is explained is a direct comparison
to the previous version of the same car model. The newer version has a smaller tank, but longer
range. Therefore, the contrast is triggered by the direct comparison of tank and mileage of two
versions of the same car model.
   All contrasts concerning acceleration can either be modeled with the predicted value or the
naive prediction that is limited to the knowledge given by the respective source information. The
average of acceleration also captures all instances correctly. Example 3, stated in section 1, is the
only instance of obwohl with negative polarity, denying the expectation of a lower weight given
the comparably small dimensions. This contrast is also correctly modeled with all of the three
possible thresholds. The evaluative adverb leider, which semantically expresses a negative point,
is not correctly predictable with the regressed values. Only for the mileage-related instance
(15) the average seems to be a reasonable threshold. The other instances of leider deal with
price, which is highly dependent on build quality, brand and prestige and therefore the possibly
most problematic feature for evaluative content. The average price is not a good estimator in
this case. Instances (16) and (18), contrasting the lack in supercharging with surprisingly good
torque and acceleration values, are captured by naively predicted values, the former even by
the value predicted on the whole relevant dataset. Example (17) contrasts the minor motor
displacement with a surprisingly high maximum velocity, which is captured by both predictions,
while the average value is far away from proximity. The contrast relation in (19), marked by
erstaunlicherweise (‘surprisingly’), deals with an extraordinarily high payload given a rather
low 𝐻 𝑃. All thresholds capture this contrast correctly, while the regressed value with full input
features is still the closest to the original value. The usage of erstaunlicherweise instead of
using obwohl may indicate that the authors have a proper classification of contrast markers
and evaluative adverbs that express a certain degree of deviation from the expectation – the
distance to the expected value in either positive or negative direction may trigger the usage of
an expression that semantically quantifies this distance.


5. Conclusion
First, the heuristic approach to information extraction and data augmentation introduces noise
into the training data, possibly to the same extend to which errors occur in the manually
annotated data. The consequence to be drawn from this is that any network trained on this data
may also incorrectly predict on the basis of error-prone input. The only limitation this imposes
on the neural models is that it cannot outperform the correctness of neither heuristic nor the
human annotations, which is an inherent issue of the challenge to apply machine learning
models to noisy data.
   Regarding the content selection module in general we can state that the performance of
the content classifier is acceptable, given the small data set and the complexity of the task
of both IE and annotation, which produced the model input. The content selector can, for
unseen data, generalize from encoded input properties, and predict the presence of the pieces
of information in the text to be generated. Although the accuracy values for some properties
are still imperfect, this is a huge leap towards modeling domain expert knowledge for content
selection with minimal resources of annotations. As already mentioned, the error rate of the
heuristic IE approach is passed on to the content selector, which also entails that its accuracy
will presumably improve with increasing recall and precision of its input.
   In regard to surface realisation, it is important to mention that using Transformer models for
producing surface text has a downside; neural models often hallucinate facts [22] and different
methods have already been applied to prevent it [23]. A special case is expressive content such
as evaluative adverbs and contrast relations, which go beyond the purely propositional content.
Intentional production of such expressive, non-propositional, constructions means that any sort
of non-at-issue content should be suppressed for decoding where the input does not demand
for such verbalisation. Measures against hallucination intend to do that, but for expressives it
would work on word level only. Measures against hallucination that base on fact-checking and
ranking cannot be applied here, because non-at-issue content only puts facts into perspective;
it reflects the author’s opinion. In training instances, where the input data motivate expressive
content, the expressive content in the text output is either preserved, or it has been enriched
with it in case no expressives are present. This training includes the systematic annotation of
non-at-issue content [24]. Further research will show how such an approach dovetails with
methods to minimize fact hallucination.
   Summing up the empirical findings on modeling the denial of expectation, we can state that
42% of the evaluative expressions and contrasts we dealt with are explainable through regression
of the target feature with full domain knowledge. Limiting the knowledge to the source features
of the contrasting relations improves the coverage to 86%, excluding the instances of leider,
where no source features are mentioned explicitly (72% including 13, 14 and 15). The average
value as a threshold scores differently for different features. Although no reliable conclusion
can be drawn from the statistics due to data sparseness, features like 𝑓 𝑢𝑒𝑙 𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 and
𝑎𝑐𝑐𝑒𝑙𝑒𝑟𝑎𝑡𝑖𝑜𝑛 seem more comparable to a global average than details like 𝑡𝑜𝑟𝑞𝑢𝑒 and 𝑝𝑟𝑖𝑐𝑒.
   A very interesting point is the difference between predictions on the full set of input features
and the knowledge limitation of ‘naive’ prediction the authors of the report anticipate. The
empirical data shows that limiting the input knowledge of the regression to the features on
the basis of which the car reviewers make their assumptions doubles the percentage of correct
predictions. Due to data sparseness, we only have a very small number of instances we can use
for evaluation. Nevertheless, the score of 42% and 86% respectively exceed our expectations
and are a good indicator that the acquisition of more instances for evaluation will validate our
findings and the adequateness of our modeling approach.


6. Future Works
A topic we did not yet address in regard to the generation of denials of expectation are false
positives. Regression may determine a trigger for evaluative expressions where no such expres-
sion shows up in the corpus of texts. There might be non-linguistic motivations behind (not)
producing evaluative content, e.g. an economic bias or personal preferences. These motivations
cannot be modeled, but may explain a mismatch between observed and data-driven usage of
evaluatives. This amounts to the necessity, once we have built a data set of verbalisations
of properties containing both, neutral ones and those enriched with evaluative content, to
determine the minimal difference between expected value and real value that triggers a contrast
or evaluative adverb, such that the distribution of evaluatives in the empirical data best matches
their distribution in the document plans.
   With more data, we are sure to be able to determine these property-wise minimal deviations
and deploy them for determining evaluative content at content selection level.
References
 [1] G. Winterstein, What but-sentences argue for: An argumentative analysis of but, Lingua
     122 (2012) 1864–1885.
 [2] S. Toulmin, The Uses of Argument, Cambridge University Press, Cambridge, 1958.
 [3] E. Reiter, R. Dale, Building Natural Language Generation Systems, Cambridge University
     Press, Cambridge, 2000.
 [4] T. C. Ferreira, C. van der Lee, E. van Miltenburg, E. Krahmer, Neural data-to-text generation:
     A comparison between pipeline and end-to-end architectures, EMNLP-IJCNLP 2019 - 2019
     Conference on Empirical Methods in Natural Language Processing and 9th International
     Joint Conference on Natural Language Processing, Proceedings of the Conference (2020)
     552–562. doi:10.18653/v1/d19- 1052 . arXiv:1908.09022 .
 [5] A. Gatt, E. Krahmer, Survey of the state of the art in natural language generation: Core
     tasks, applications and evaluation, Journal of Artificial Intelligence Research 61 (2018)
     1–64. doi:10.1613/jair.5714 . arXiv:1703.09902 .
 [6] R. Lebret, D. Grangier, M. Auli, Neural text generation from structured data with appli-
     cation to the biography domain, EMNLP 2016 - Conference on Empirical Methods in
     Natural Language Processing, Proceedings (2016) 1203–1213. doi:10.18653/v1/d16- 1128 .
     arXiv:1603.07771 .
 [7] H. Mei, M. Bansal, M. R. Walter, What to talk about and how? Selective generation using
     LSTMs with coarse-to-fine alignment, 2016 Conference of the North American Chapter of
     the Association for Computational Linguistics: Human Language Technologies, NAACL
     HLT 2016 - Proceedings of the Conference (2016) 720–730. doi:10.18653/v1/n16- 1086 .
     arXiv:1509.00838 .
 [8] S. Wiseman, S. M. Shieber, A. M. Rush, Challenges in Data-to-Document Generation (2017)
     2253–2263.
 [9] S. Gehrmann, F. Z. Dai, H. Elder, A. M. Rush, End-to-End Content and Plan Selection for
     Natural Language Generation, E2E NLG Challenge System Descriptions (2018) 46–56.
     URL: http://www.macs.hw.ac.uk/InteractionLab/E2E/final_papers/E2E-HarvardNLP.pdf.
     arXiv:1810.04700v1 .
[10] A. Moryossef, Y. Goldberg, I. Dagan, Step-by-step: Separating planning from realization
     in neural data-to-text generation, in: NAACL HLT 2019 - 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies - Proceedings of the Conference, volume 1, Association for Computational
     Linguistics (ACL), 2019, pp. 2267–2277. arXiv:1904.03396 .
[11] A. Chisholm, W. Radford, B. Hachey, Learning to generate one-sentence biographies
     from Wikidata, in: Proceedings of the 15th Conference of the European Chapter of
     the Association for Computational Linguistics: Volume 1, Long Papers, Association for
     Computational Linguistics, Valencia, Spain, 2017, pp. 633–642. URL: https://aclanthology.
     org/E17-1060.
[12] T. Liu, K. Wang, L. Sha, B. Chang, Z. Sui, Table-to-text generation by structure-aware
     seq2seq learning, CoRR abs/1711.09724 (2017). URL: http://arxiv.org/abs/1711.09724.
     arXiv:1711.09724 .
[13] L. Sha, L. Mou, T. Liu, P. Poupart, S. Li, B. Chang, Z. Sui, Order-planning neural text
     generation from structured data, in: AAAI, 2018.
[14] L. Perez-Beltrachini, M. Lapata, Bootstrapping generators from noisy data, in: NAACL,
     2018.
[15] R. Puduppully, L. Dong, M. Lapata, Data-to-text generation with content selection and
     planning, 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative
     Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium
     on Educational Advances in Artificial Intelligence, EAAI 2019 (2019) 6908–6915. doi:10.
     1609/aaai.v33i01.33016908 . arXiv:1809.00582 .
[16] D. Gkatzia, H. F. Hastie, An ensemble method for content selection for data-to-text systems,
     CoRR abs/1506.02922 (2015). URL: http://arxiv.org/abs/1506.02922. arXiv:1506.02922 .
[17] C. Kelly, A. Copestake, N. Karamanis, Investigating content selection for language genera-
     tion using machine learning, in: Proceedings of the 12th European Workshop on Natural
     Language Generation (ENLG 2009), Association for Computational Linguistics, Athens,
     Greece, 2009, pp. 130–137. URL: https://aclanthology.org/W09-0623.
[18] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel,
     mT5: A massively multilingual pre-trained text-to-text transformer (2020). URL: http:
     //arxiv.org/abs/2010.11934. arXiv:2010.11934 .
[19] T. C. Ferreira, D. Moussallem, S. Wubben, E. Krahmer, Enriching the WebNLG corpus,
     INLG 2018 - 11th International Natural Language Generation Conference, Proceedings of
     the Conference (2018) 171–176. doi:10.18653/v1/w18- 6521 .
[20] S. Gries, On classification trees and random forests in corpus linguistics: Some words of
     caution and suggestions for improvement, Corpus Linguistics and Lingustic Theory 16
     (2019). doi:10.1515/cllt- 2018- 0078 .
[21] S. Wager, T. Hastie, B. Efron, Confidence intervals for random forests: The jackknife
     and the infinitesimal jackknife, Journal of machine learning research : JMLR 15 (2014)
     1625–1651.
[22] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, P. Fung, Survey
     of hallucination in natural language generation, CoRR abs/2202.03629 (2022). URL: https:
     //arxiv.org/abs/2202.03629. arXiv:2202.03629 .
[23] C. Rebuffel, M. Roberti, L. Soulier, G. Scoutheeten, R. Cancelliere, P. Gallinari, Controlling
     hallucinations at word level in data-to-text generation, CoRR abs/2102.02810 (2021). URL:
     https://arxiv.org/abs/2102.02810. arXiv:2102.02810 .
[24] C. Hesse, M. Langner, A. Benz, R. Klabunde, Discrepancies Between Database- and
     Pragmatically Driven NLG: Insights from QUD-Based Annotations, in: D. Gromann,
     G. Sérasset, T. Declerck, J. P. McCrae, J. Gracia, J. Bosque-Gil, F. Bobillo, B. Heinisch (Eds.),
     3rd Conference on Language, Data and Knowledge (LDK 2021), volume 93 of Open Access
     Series in Informatics (OASIcs), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl,
     Germany, 2021, pp. 32:1–32:9. URL: https://drops.dagstuhl.de/opus/volltexte/2021/14568.
     doi:10.4230/OASIcs.LDK.2021.32 .