=Paper=
{{Paper
|id=Vol-2777/paper21
|storemode=property
|title=Do I Argue Like Them? A Human Baseline for Comparing Attitudes in Argumentations.
|pdfUrl=https://ceur-ws.org/Vol-2777/paper21.pdf
|volume=Vol-2777
|authors=Markus Brenneis,Martin Mauve
|dblpUrl=https://dblp.org/rec/conf/aiia/BrenneisM20
}}
==Do I Argue Like Them? A Human Baseline for Comparing Attitudes in Argumentations.==
Do I Argue Like Them? A Human Baseline for Comparing Attitudes in Argumentations? Markus Brenneis and Martin Mauve Heinrich-Heine-Universität, Universitätsstraße 1, 40225 Düsseldorf, Germany Markus.Brenneis@uni-duesseldorf.de Abstract. In this paper, we present the results of a study where par- ticipants were asked to rate the similarity between sets of positions and arguments. Our goal is to provide a baseline for metrics that compare the attitudes of individual persons in argumentations, with results matching human intuition. Such metrics have different applications, i.a. in recom- mender systems. We formulated several hypotheses for useful properties, which we then investigated in our survey. As a result, we were able to identify several properties a metric for comparing attitudes in argumen- tations should have, and got some surprising results we discuss in this paper (e.g., many people do not see a “neutral” position on a line between “pro” and “contra”). For some properties, further research is needed to get a clearer understanding of human intuition. Keywords: Argumentation · Metric · Human Baseline. 1 Introduction When discussing with other people, it is interesting to know how similarly an- other person argues like yourself, i.e. how similar your attitudes are. Do you disagree on central statements, or do you generally agree, but differ in some arguments? Do you have the same priorities for political positions or the same reasons, e.g. for the expansion of wind power? Having a mathematical metric for calculating the (dis-)similarity of attitudes in argumentation enables use-cases like collaborative filtering for argumentation applications like kialo 1 or our delib- erate [5], finding representatives of a group, finding a consensus, and matching political parties and voters based on attitudes and used arguments. People typically discuss central positions (e.g. the improvement of a course of study [12] or the distribution of funds [8]) and support (or attack) them with other statements, which we call an argument. Each individual person agrees or disagrees more or less strongly with certain statements, and may consider some arguments more important than others when forming an opinion. When designing a metric for an application where arguments are exchanged, one has to ask which properties that metric should fulfill. For instance, should an ? Copyright ©2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 1 https://kialo.com/ opinion difference in “top-level” arguments against a position (e.g “We should not build nuclear plants, because they are insecure”) weigh more than disagree- ment on “deeper” arguments (e.g. “Nuclear plants are insecure, because there have been several accidents.”)? Are two persons who are against and for a posi- tion equally far apart from each other as two persons where one is for a position, and the other one has a neutral opinion? (Surprisingly for us, our results indicate that the latter is, in fact, the case, as we will explain in Section 4.2.) Any reasonable metric to answer those questions needs to be based on the perception that humans have regarding the similarity of chains of arguments, instead of the “intuition” of researchers who deal with argumentation theory every day. To establish a baseline for this, we asked our survey participants to judge the similarity of two chains of argumentation. Which pair is considered more similar? The questions asked were based on hypotheses presented in this paper. The hypotheses should help with answering how a metric should behave in trade-off situation, with missing information, hierarchies, and weights in ar- gumentations. To our knowledge, such a survey has not been conducted before. Our contribution is the following: We formulate several hypotheses for as- sessing the similarity of argumentations, which should be respected by a metric comparing attitudes expressed in argumentations. We gathered a data set with human assessments of relative similarity of argumentations for testing the real- world relevance of our hypotheses, and checked which hypotheses can be regarded as correct with a high significance. In the following section, we define central concepts of argumentation theory relevant for this paper. Afterwards, we describe our methods used and our hy- potheses. We then present our most important and surprising results. In the fifth section, we discuss our methods, and finally, we comment on related work. 2 Definitions In this paper, we use terms based on the IBIS model [13] for argumentation. Within an argumentation context, there are arguments, which consist of two statements: a premise and a conclusion (e.g., “Nuclear power is sustainable.” can be a premise for “We should build a nuclear power plant.”). When we draw an argumentation graph, statements are nodes, arguments are edges. Statements which are only used as conclusion are called positions, and are typically action- able items like “We should build a nuclear power plant”. The unique root of the argumentation graph is called issue I, and connects all positions. It is typically the overall topic of the discussion, e.g. “What shall the town spend money for?”. Each person can have a specific view on the parts of an argumentation graph: A person can agree or disagree with a statement, which we call the person’s opinion. Arguments and statements can be of different importance (or relevance, weight) to different persons. Each individual person may use one specific sub- set of all available arguments. We call the sum of opinions, importances, and arguments used by a person attitude. The results of our work are independent of this model, but it enables us to precisely formulate our hypotheses (i.a. by having statements, not arguments, as atomic elements), and draw graphs for visualizing scenarios for our hypothesis. So our findings can also be applied to metrics working with Dung-style [7] argu- mentation frameworks; for instance, our issue-based graphs can be transformed to an abstract argumentation framework using the tool dabasco [16]. G I G0 I I What should the town spend money for? a b − a ++ b a We should build a nuclear power plant. a1 Nuclear power plants are insecure. a11 There have been several accidents. a1 a2 b1 + a1 + a2 a2 Nuclear power is sustainable. b We should improve the look of the park. b1 A nice park attracts tourists. a11 Fig. 1: Example for an argumentation graph G and a personal view G0 on that graph G with attitudes. Statements with unknown opinion are not drawn in G0 . To understand how our graphs should be read, Figure 1 depicts an example of an argumentation graph G for a discussion and a personal view G0 on that graph, which contains Alice’s attitudes. In this example, Alice is very sure (++) that she wants the look of the park being improved (b), and she is against a nuclear power plant (a, −). She accepts the statements that nuclear power is sustainable (a2 ) and nuclear power plants are insecure (a1 ), but she thinks the latter weighs more (thick line) for her opinion on building a nuclear power plant. Alice has not mentioned an opinion on the statements a11 and b1 . We will not draw opinions for better readability if the focus of a scenario is not on opinions, and they are considered to be the same across graphs being compared (e.g., “agree”/“+” can be assumed for all statements in Figure 2). 3 Methods We now present how we developed our hypotheses for properties of a metric for comparing the way different persons or organizations argue, how we created questionnaire scenarios, and conducted the survey. Our focus is explicitly on comparing the attitudes of different persons within an argumentation, not prop- erties like number of counterarguments, consistency, or use of rhetorical devices. We are well aware that our list of properties is only a starting point for the work of finding out how human feeling of argumentation similarity can be trans- lated to a mathematical metric. Thus, we expect that our list can be extended with more properties in the future. First, we formulate hypotheses about what we expect of a metric. Those hy- potheses are at least somewhat reasonable for domain experts, and are partially based on properties of a metric we have presented in an earlier work [4]. However, before they are used for guiding the development of metrics for the comparison of argumentations, it should be checked whether they match the perception of average humans. To do so, we developed questionnaire scenarios for every single hypothesis. Participants of the survey were asked to assess the similarity of the people’s argumentation by indicating which person’s argumentation is most similar to the argumentation of another given person. For scenarios which involved only one topic (e.g. an argumentation on nuclear power), we had multiple versions of that scenario with different topics to prevent topic-dependent results. The survey was conducted using Amazon Mechanical Turk (MTurk) because of its easy and fast recruiting process. Only participants from the US were al- lowed to assure that there is a sufficient knowledge of English. Although MTurk users are not representative for the US population, it has been shown that the average difference can be quite small [2]. The questions and scenarios were ran- domly assigned to the participants and the order of answers was randomized. To assure answers of good quality, only answers of participants who answered at least 3 of 5 quality control questions correctly were used in the evaluation. The complete list of hypotheses is in Table 1. They are grouped in four cat- egories with different motivations: First, we were interested in the influence of basic properties of argumentations, like being for/against a different number of statements and adding arguments. Then we asked ourselves what the influence of weights of opinions and arguments is, and whether they play a role at all. The third group deals with the influence of missing information: Real-world applica- tions often do not have complete information of a person’s attitude, how should a metric behave here? The last is about trade-off situations: What weighs more when both, opinions and arguments mentioned, are different between persons? What is the influence if the relevance of positions is rated completely different? Alice p Bob p Charlie p a a b a c Fig. 2: Visualization of the scenario for Hypothesis 4: The graphs represent the attitudes in the argumentation of each person in the scenario. As an example, we now present how Hypothesis 4 (deviations in deeper parts have less contribution to dissimilarity than deviations in higher parts) has been developed and transformed in a questionnaire scenario. All scenarios can be found in our complete data set which is available online.2 2 https://github.com/hhucn/argumentation-similarity-survey-results We asked ourselves whether the level where arguments are added is relevant. To make the idea of the hypothesis clearer, Figure 2 depicts the attitudes of the persons involved in the constructed scenario. Consider Alice, Bob, and Charlie have the same opinions on a position p and a common argument a for it. If Bob adds another argument for p, and Charlie an argument to a, we think that Alice and Charlie are closer because their first- level-argumentation is the same and the deviation is in a deeper part. One could, however, also assume that individuals not familiar with argumentation theory do not have a notion for levels and consider both differences in argumentation behavior as similarly severe. From our hypotheses, we constructed the following scenario and questions: Alice argues as follows on the subject of wind power: More wind turbines should be built because wind power has a low environmental impact. Bob argues as follows: More wind turbines should be built because wind power has a low environmental impact and because wind turbines are safe. Charlie argues as follows: More wind turbines should be built because wind power has a low environmental im- pact. The reason for the low environmental impact is that they do not produce any emissions. Whose attitude does Alice agree with most? – with Bob’s attitude – with Charlie’s attitude – the attitudes are equally far apart Whose attitude does Bob agree with most? – with Alice’s attitude – with Charlie’s attitude – the attitudes are equally far apart Whose attitude does Charlie agree with most? – with Alice’s attitude – with Bob’s attitude – the attitudes are equally far apart The relevant question for us is Whose attitude does Alice agree with most? and our expected answer is with Charlie’s attitude; the other questions were added for gathering additional data and preventing biased answers. Most other scenarios are constructed the same way. An exception are ques- tions related to missing information, where we asked the questions twice: Once we forced a decision (since a complete, well-defined metric has to make some de- cision, too), and once we allowed to choose this cannot be assessed as an answer. 4 Results We now present the results of our survey, and highlight and explain results which were surprising for us. We report p-values for the null hypothesis “our expected answer is not the most frequently (relative frequency) given answer”.3 For space reasons, not all numbers are presented and discussed in detail, but the aggregated raw data for all questions is available online. A summary of the relative answer frequencies for the relevant questions is depicted in Figure 3. 3 We used an intersection–union test [18, p. 240] with one-tailed tests on the variances of the difference of two multinomial proportions [9,17], i.e. H0 is that the differences of the relative answer frequencies between the expected answer and the other answers is not greater than 0. Table 1: Our hypotheses about the assessment of attitude similarity in argumen- tations, grouped in basic properties, influence of weights, influence of missing information, and trade-offs # Hypothesis H1 Proportionally bigger overlap of opinions on positions results in greater similarity than the absolute number of differences. H2 Proportionally bigger overlap on arguments for/against a position results in greater similarity than the absolute number of differences. H3 A neutral opinion is between a positive and a negative opinion. H4 Deviations in deeper parts have less contribution to dissimilarity than deviations in higher parts. H5 Weights of arguments have an influence even if they are the only difference. H6 Argumentation differences in a branch with lower importance contribute less to dissimilarity. H7 No opinion is between a positive and a negative opinion. H8 An unknown opinion is between a positive and a negative opinion. H9 A statement for which no opinion is mentioned is like a statement for which we explicitly say the opinion is unknown. H10 Not mentioning an argument and being against an argument have the same effect. H11 Disagreeing on a position results in greater distance than having the same opinion on that position, but with contrary arguments. H12 It is possible for a difference in arguments for/against positions to result in greater dissimilarity than a difference in opinions on those positions. H13 Two argumentations with weak and contrary opinions on a statement can be closer than two argumentations with the same opinions, but with very different strengths. H14 Two argumentations with weak arguments and contrary opinions on their premises can be closer than two argumentations with the same opinions, but with very different strengths of arguments. H15 When determining the attitude regarding a position, opinions (not) mentioned for a not-accepted argument have no influence. H16 Flipping the two most important positions results in a bigger difference than flipping two less important positions. H17 Adding a new position can remove a previous dissimilarity. H18 Adding a new position as most important position can swap a previous similarity order. H19 Agreeing with someone’s most important position is as important as having that person’s most important opinion matching mine. H20 Adding another most important position results in greater dissimilarity than flipping the priorities of two positions. H21 Having more similar priorities of opinions can result in greater similarity even with lower absolute number of same opinions. H22 Not mentioning a position results in greater dissimilarity than assigning lower priorities. H1 p = .073∼ H2: p < .001‡ H3A: p = .006‡ H3B: p < .001‡ H4: p = .36 H5: p = .001‡ H6: p = .92 H7A: p = .44 H7B: p < .001‡ H8A: p = 1.000 H8B: p < .001‡ H9: p < .001‡ H10: p = .059∼ H11: p = .174 H12: p = .004‡ H13A: p = 1.00 H13B: p = 1.00 H13C: p = .97 H13D: p = .95 H14: p = .98 H15A: p = .002‡ H15B: p < .001‡ H16: p = .018† H17: p = .71 H18A: p = .008‡ H18B: p = .64 H19A: p < .001‡ H19B: p < .001‡ H20A: p = .13 H20B: p = .15 H20C: p < .001‡ H21: p < .001‡ H22: p < .001‡ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Clopper–Pearson confidence intervals (α = 0.05) indicated with expected answer (blue, filled circles) and other answer options (gray) for the relevant question, p-value for H0 “expected answer is not the most frequently given answer”, ‡: p ≤ 0.01, †: p ≤ 0.05, ∼: p ≤ 0.10 Fig. 3: Results for the relevant questions for each hypothesis After removing participants who did not meet our quality standards, we had, on average, 38 answers for every question relevant for our hypotheses. Those participants have a median age of 30-39 years, which matches the US median of 2018 (36.9). The male/female ratio is 1.96 (total US ratio 0.97), thus we had significantly more male than female participants in our random MTurk sample. 4.1 Results that confirmed our expectations For many scenarios, we did not get surprising results, and summarize them here. Proportionally bigger overlap of arguments (H1) or opinions (H2) is indeed more important than the absolute number of differences (H1: expected answer given by 54%, p = .073; H2: 74%, p < .001). If the assessment of argument relevance is the only difference between attitudes, this is considered as difference by most participants (H5, 60%, p = .001). That the most important opinion in one argumentation matches the opinion in the other argumentation is as important as the reverse case (H19), indepen- dent on whether this questions is asked from a person-centric (66%, p < .001) or “bird’s eye view” (70%, p < .001). Flipping the priorities of the most important positions results in a smaller perceived difference than adding a new most impor- tant position p, regardless of whether the other persons have not mentioned their opinion on p (H20A, 48%, p = .13), had an explicit unknown opinion (H20B, 52%, p = .15), or were neutral (H20C, 79%, p < .001). Leaving out a position results in a greater dissimilarity than lowering its priority (H22, 87%, p < .001). Not only the number of matching opinions on positions is relevant, but, if an- other argumentation has only a subset of positions, it can be more important that the priorities are more similar (H21, 74%, p < .001). 4.2 Surprising Results We now have a closer look at more surprising findings from survey which were not in line with the expectations we originally had when designing our hypotheses. No continuum pro–neutral–contra In Hypothesis 3, we conjectured that a neutral opinion lies exactly between a positive and a negative opinion on a statement. As already mentioned in Section 3, we asked this question in two ways: In variant A, “this cannot be assessed” could be chosen by participants, in variant B, a decision has to be made. In both cases, our expected answer (“neutral” is equally far away from “pro” and “contra”) was given by most participants (A: 66%, p = .006; B: 95%, p < 0.001), where the result is much clearer when forced to make a decision. Although the question relevant for us in this scenario was answered as ex- pected, the questions whose attitude is most similar to the positive or negative attitude, respectively, has been answered unexpectedly: We expected that a posi- tive opinion is considered closer to neutral than to negative, but this was only just one of the most frequent answers. In variant B with forced decision, an “equally far apart” assessment has been given by around 50% of the participants. This can be a hint that many people do not have a mental model where pro, neutral, and contra are arranged in a straight line, but on the corners of a triangle. This might be similar to the opinion triangle presented in [11], with the directions Belief, Disbelief, and Ignorance. For Hypotheses 7 and 8, we could see similar effects. Hypothesis 7 dealt with whether no opinion is equally far away from pro and contra. For case A, most people give our expected answer (48%, p = .436), but many also say that the case cannot be assessed (45%). When forced to make a decision, people choose our expected answer “equally far apart” (95%, p < 0.001). But for both variants, we also see the tendency that people have a mental triangle model: In variant B, around 55% have seen pro (contra) equally far away from no opinion and contra (pro). So being neutral (Hypothesis 3) and having no opinion leads to similar assessments when it is forced, but more people tend to not make an assessment in the no opinion case if allowed to. Lastly, if we consider pro, contra, and unknown opinion (Hypothesis 8), an absolute majority thinks the case cannot be assessed, which makes sense. If a decision is forced, more than 75% percent follow the triangle model again. Consideration of hierarchies and weights for branches We expected that adding an argument deeper within an argumentation is considered a smaller dissimilarity than adding a new top-level argument (Hypothesis 4, also see Fig- ure 2). This expectation is not confirmed (38%, p = .36); the answers are nearly equally distributed across all alternatives. We assume that people count the number of arguments used instead of thinking of an argument hierarchy. Here, further investigations with a more extreme example, e.g. a “deeper” argumen- tation, would be interesting. Alice I Bob I Charlie I Alice I Bob I Charlie I p q p q p q p q p q p q ap aq + ap − aq − ap − aq − ap + aq (a) Hypothesis 6 (b) Hypothesis 10 Fig. 4: Visualization of the scenarios for Hypothesis 6 and Hypothesis 10; we expect Bob being closer to Charlie than to Alice in both cases. Related to this finding are unexpected results for Hypothesis 6: Considering the example depicted in Figure 4a, when comparing Bob with Alice and Char- lie, we thought that the similarity to Charlie is greater because the introduced difference is in a branch with lower importance (depicted by a thinner edge). This has not been confirmed, our expected answer is the least frequently chosen answer (24%, p = .92). More participants think that Bob is most similar to Alice (40%) or the attitudes are equally far apart (36%). This is related to the assumption that people do not have a notion for argu- mentation hierarchy. If people do not catch that ap and aq are on the level below p or q, respectively, it makes sense that our expected effect cannot be seen. But this conjecture is contradicted by the answers for Hypothesis 10, where we thought that not mentioning an argument (as in Figure 4a) and being against an argument (Figure 4b) have the same effect. Our expected answer, Bob is more similar to Charlie than to Alice, is now the most frequently chosen answer (52%, p = .059). Thus, our explanations for the unexpected results for Hypothesis 6 do not seem to be correct. Maybe the complexity of the scenario for Hypothesis 10 is so large that people pay closer attention to the nuances of the argumentation. Here, further investigations are necessary. Trade-off between opinions and arguments Consider a scenario where Alice and Bob have the same opinion on a position, but the arguments are contradictory. Charlie has the same opinions as Alice, but a different opinion on the position. We expect that Alice and Bob are closer than Alice and Charlie (Hypothesis 11) since people probably consider opinions on positions as more important than arguments. Most people answered as we have expected (45%, p = .174), but there are also many people saying the attitudes are equally far apart (32%). We can conclude that the common opinion on the position has the greater influence on the assessment of attitude similarity, but arguments also play an important part in the assessment. Alice I Bob I + p1 + p2 + p3 + p4 p5 + + p1 + p2 + p3 + p4 p5 + a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 + − + − + − + − + − ⊕ ⊕ ⊕ ⊕ ⊕ Charlie I + p1 + p2 + p3 + p4 p5 a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 + − + − + − + − + − Fig. 5: Visualization of the scenario for Hypothesis 12; differences to Alice are encircled; we expected Alice is considered closer to Charlie than to Bob. In Hypothesis 12, we assumed not only the opinions on positions are com- pared, but arguments also play a role and can even “flip” the similarity. For an extreme example with many arguments as shown in Figure 5, our expecta- tion that Alice’s attitude is more similar to Charlie’s has been confirmed (62%, p = .004). This is in line with the findings from Hypothesis 11: Not only com- mon opinions on positions are important for assessing similarity, but also the arguments. Note that our scenario for Hypothesis 12 converges to the scenario for Hy- pothesis 11 if p1 to p4 are removed. As we have only presented those two extreme scenarios in the questionnaire, we cannot say what the “turning point” is, i.e. what number of common arguments is needed to make up for different opinions. Opinion tendency vs. weight In Hypotheses 13 and 14, we wanted to know whether an argumentation with e.g. weak positive opinion on a position can be closer to a weak negative opinion on the same position than to a very strong positive opinion. We thought that it is possible, but we were proven wrong. The hypotheses were tested with different formulations and scenarios, as strength/weakness can be expressed in different ways: strongly for vs. slight tendency (A), for vs. no definite opinion (B), strongly for vs. doesn’t really have an opinion (C), involving a second, common position (D), and main reason vs. very unimportant reason (E). Our expected answers were not given by most par- ticipants (A: 13% p = 1.0; B: 12%, p = 1.0; C: 33%, p = .97; D: 36%, p = .95; E: 34%, p = .98), but the similarity to the person with the same direction of opinion has been rated greater (A: 84%, B: 81%, C: 60%, D: 60%, E: 63%). We can conclude that opinion tendencies are more important than the weights of opinions and arguments. Alice argues in favor of wind power as follows: I am in favor of wind power, as wind turbines do not produce CO2 emissions. Also, I’m for wind power because wind turbines look nice. Bob argues in favor of wind power as follows: I am in favor of wind power, as wind turbines do not produce CO2 emissions. I think wind turbines look nice, but that is no argument for wind power and not relevant for the discussion. Charlie argues in favor of wind power as follows: I am in favor of wind power, as wind turbines do not produce CO2 emissions. I don’t think that wind turbines look nice. Fig. 6: Scenario for Hypothesis 15 on the effect of undercuts: We thought that Bob’s and Charlie’s attitudes are considered equal. Understanding of undercuts We expected that an opinion belonging to an undercut argument does not count towards the attitude to a position, i.e. in the scenario described in Figure 6, Charlie’s and Bob’s attitudes are considered equal, regardless whether Charlie’s last sentence is mentioned (case A) or not (B). Our results are not clear for this question: “Do Charlie and Alice [or Bob] have the same attitude (opinion and arguments) on wind power?” has been answered with “Yes” by more than 70% in all cases. We do not understand this result. It could be that the wording of the question for this case is too technical for a good assessment, so that most people only compared the opinions for the position. Another possible explanation is that untrained persons do not understand the undercut attack correctly or find it confusing, and thus fall back to comparing opinions of positions. Influence of adding new positions in a priority order We wanted to know how the introduction of a new position by a participant influences similarity order. Our anticipation was that it is possible to remove a previous dissimilarity this way (Hypothesis 17), or even swap the similarity order (Hypothesis 18). Alice: Bob: Charlie: Charlie’: Alice: Bob: Charlie: Charlie’: 1. b 1. a 1. a 1. d 1. a 1. d 1. a 1. e 2. a 2. c 2. b 2. a 2. c 2. a 2. b 2. a 3. c 3. b 3. c 3. b 3. d 3. b 3. c 3. b 4. c 4. b 4. c 4. d 4. c 5. d (a) Scenario for Hypothesis 17 (b) Scenario for Hypothesis 18 Fig. 7: In these scenarios, Charlie’ introduces a new position not mentioned by the other participants. To investigate whether those hypotheses can hold, we checked the scenarios depicted in Figure 7. In Figure 7a, we thought that Charlie is considered more similar to Bob (Hypothesis 16), but Charlie’ equally far away from Alice and Bob. The former was confirmed, so changing the order of the most important positions results in a greater perceived difference than flipping less important positions (57%, p = .018). The latter was not confirmed (31%, p = .71), but we see a clear difference from 57%, indicating that the additional position has an influence on the intuition on similarity. There is no clear “correct” answer, though, since the answers are nearly evenly distributed across all alternatives. For the scenario in Figure 7b, we anticipated that Charlie is closer to Alice (case A), but Charlie’ closer to Bob (case B; one way to get to this conclusion is counting the number of absolute place differences for each common statement: Charlie–Alice: 4, Charlie–Bob: 6; Charlie’–Alice: 6, Charlie’–Bob: 4). The first expectation has been confirmed (A: 55%, p = .008), but not the latter (B: 33%, p = .64). In case B, the answers are nearly evenly distributed. Although this is no hint that our hypothesis is sensible, we can see a tendency that the change from case A to B moves the three attitudes closer to each other. Note that we can neither show that our hypotheses are consistent nor incon- sistent, because we only asked for concrete example scenarios. Other scenarios may yield different results, and having results for different scenarios leads to more precise results. 5 Discussion Our survey was, to our knowledge, the first of its kind. Many results give valu- able hints on how an intuitive metric for comparing attitudes expressed in an argumentation should behave. Those metrics have applications in e.g. clustering and recommender systems. As seen in the previous section, a definite conclusion cannot be drawn for all hypotheses without further surveys. Also, the way we constructed our survey questions could have been suboptimal. We choose a format which is suitable for most Hypothesis to prevent differences due to different formulations of questions. We considered the option to let people rate the similarity of argumentation on a numeric scale, but we thought that this approach is bound to fail: People are unfamiliar with rating argumentation similarity, would probably need some time for “calibration”, and the task would feel more unnatural. Furthermore, the question for “attitude” could have been a problem, because some people may only consider opinions, not arguments. Asking how similar two people “argue” would also be a problem, which we have seen in an internal pretest: Some people started thinking about meta-argumentation aspects, e.g. whether counterarguments are mentioned, or how many arguments are used, and stopped looking at the person’s actual attitude. For questions with ratings of several positions, we switched between complete sentences and enumerations, depending on the number of positions. We thought complete sentences with many positions distract from the actual differences. The change of format could, of course, have an influence, which we did not measure. We are well aware that MTurk workers are not a representative sample of the US population, and even less for other countries; as already mentioned, the gender distribution does not match the US population. Therefore, generalizing our results for other populations is only possible with caution. Nevertheless, we get some useful insights and hints for further, representative, bigger studies, and possible comparisons between different populations. 6 Related Work We know no other surveys on attitude similarity in argumentation, but there have been surveys for other purposes to find human baselines. [14] proposes different measures for determining the similarity of words, and compares the measures with human ratings from a dataset created by [15]. They also think that the quality of a metric can best be determined by comparing it with human common sense. Their dataset contains absolute ratings from 0 (no similarity) to 4 (synonym) for 30 word pairs, each assessed by 38 subjects. We do not think that an absolute rating would have worked for our experiment. First, our argumentation scenarios can have fine-grained or large differences, which probably makes it hard for a person without argumentation theory background to map the difference on a small absolute scale. Second, an absolute scale works well when you can grasp every pair to compare at once and correct older decision to tweak one’s brain scale; this works well with short word pairs, but not with more complex descriptions of argumentation. In the context of word similarity, [6] find that “comparison with human judgments is the ideal way to evaluate a measure of similarity”, which supports our initial assumption that gathering human judgments is important. In [3], which is based on the study design of [15], 50 human subjects assessed the similarity of process descriptions on a scale from 1 to 5. They compared those assessments with the values of five metrics. Each subject had to indicate how they come to their decision for each comparison, by letting them choose a strategy (e.g. “by process description”) from a menu. We did not ask partici- pants how they have come to their assessments. Firstly, we think that reflecting on one’s decision influences further decisions. We also think that writing an own description of the decision process is too hard, and providing a menu with possi- ble answers could have influenced following decisions. Moreover, asking this for every question would have significantly increased the length of the questionnaire. Metrics and applications for comparing argumentations already exist, e.g. based on cosine similarity for opinion prediction [1], and for comparing one’s own argumentation with others by counting the number of agreements/disagreements on statements [10]. In both cases, no justification is given why the similarity measure is a good choice. With our work, we want to fill that gap. For instance, we showed that simply counting agreements is not enough. 7 Conclusion and Future Work We have conducted a survey with human subjects who had to assess the attitude similarity of argumentations. Our results are available for download, and can be used as basis when developing a metric for measuring attitude similarity in argumentation-based applications, e.g. for collaborative filtering. Our results help to transform human gut feeling into a mathematical metric. Some intuitive hypotheses were confirmed by our results, but there were also surprising results, e.g. neutral is often not seen as falling on a line between pro and con. Our survey cannot establish “absolute truths”, but we have collected first hints on what properties a metric which matches human intuition should have. In future work, we want to compare several metrics to see which properties they fulfill and how that matches human intuition. Moreover, further research is needed for hypotheses where we could not get clear results, and where there are turning points in trade-off scenarios. Also, more representative surveys and a comparison of different countries are needed. References 1. Althuniyan, N., Sirrianni, J.W., Rahman, M.M., Liu, X.F.: Design of mobile ser- vice of intelligent large-scale cyber argumentation for analysis and prediction of collective opinions. In: International Conference on AI and Mobile Services. pp. 135–149. Springer (2019) 2. Berinsky, A.J., Huber, G.A., Lenz, G.S.: Using mechanical turk as a subject re- cruitment tool for experimental research (2011) 3. Bernstein, A., Kaufmann, E., Bürki, C., Klein, M.: How similar is it? towards personalized similarity measures in ontologies. In: Wirtschaftsinformatik 2005, pp. 1347–1366. Springer (2005) 4. Brenneis, M., Behrendt, M., Harmeling, S., Mauve, M.: How Much Do I Argue Like You? Towards a Metric on Weighted Argumentation Graphs. In: Proceedings of the Third International Workshop on Systems and Algorithms for Formal Ar- gumentation (SAFA 2020). pp. 2–13. No. 2672 in CEUR Workshop Proceedings, Aachen (Sep 2020) 5. Brenneis, M., Mauve, M.: deliberate – Online Argumentation with Collaborative Filtering. In: Computational Models of Argument. vol. 326, p. 453–454. IOS Press (Sep 2020) 6. Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and other lexical resources. vol. 2, pp. 2–2 (2001) 7. Dung, P.M.: On the acceptability of arguments and its fundamental role in non- monotonic reasoning, logic programming and n-person games. Artificial Intelligence 77(2), 321–357 (1995) 8. Ebbinghaus, B., Mauve, M.: decide: Supporting Participatory Budgeting with Online Argumentation. In: Computational Models of Argument. Proceedings of COMMA 2020. Frontiers in Artificial Intelligence and Applications, vol. 326, p. 463–464. IOS Press (Sep 2020) 9. Franklin, C.H.: The ‘margin of error’ for differences in polls. See https://abcnews.go.com/images/PollingUnit/MOEFranklin.pdf (2007) 10. Gordon, T.F.: Structured consultation with argument graphs. From Knowledge Representation to Argumentation in AI. A Festschrift in Honour of Trevor Bench- Capon on the Occasion of his 60th Birthday pp. 115–133 (2013) 11. Haenni, R.: Probabilistic argumentation. Journal of Applied Logic 7(2), 155–176 (2009) 12. Krauthoff, T., Meter, C., Mauve, M.: Dialog-Based Online Argumentation: Find- ings from a Field Experiment. In: Proceedings of the 1st Workshop on Advances in Argumentation in Artificial Intelligence. pp. 85–99 (November 2017) 13. Kunz, W., Rittel, H.W.J.: Issues as elements of information systems, vol. 131. Citeseer (1970) 14. Li, Y., Bandar, Z.A., McLean, D.: An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on knowl- edge and data engineering 15(4), 871–882 (2003) 15. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Language and cognitive processes 6(1), 1–28 (1991) 16. Neugebauer, D.: DABASCO: Generating AF, ADF, and ASPIC+ Instances from Real-World Discussions. In: Computational Models of Argument. Proceedings of COMMA 2018. pp. 469–470 (2018) 17. Scott, A.J., Seber, G.A.: Difference of proportions from the same survey. The American Statistician 37(4a), 319–320 (1983) 18. Silvapulle, M.J., Sen, P.K.: Constrained Statistical Inference: Order, Inequality, and Shape Constraints, vol. 912. John Wiley & Sons (2011)