Predicting responses of individual reasoners in syllogistic
        reasoning by using collaborative filtering

                                Ilir Kola1 and Marco Ragni1
       1
           Cognitive Computation Lab, University of Freiburg, 79110 Freiburg, Germany
                        kola@informatik.uni-freiburg.de
                        ragni@informatik.uni-freiburg.de


       Abstract. A syllogism consists of two premises each containing one of four
       quantifiers (All, Some, Some not, None) and two out of three objects totaling in
       64 reasoning problems. The task of the participants is to draw or evaluate a
       conclusion, given the premise information. Most, if not all cognitive theories
       for syllogistic reasoning, focus on explaining and sometimes predicting the ag-
       gregated response pattern for participants of a whole psychological experiment.
       While only few theories focus on the level of an individual reasoner that might
       have a specific mental representation that explains her response pattern. If dif-
       ferent reasoners can be grouped into similar answer patterns then it is possible
       to identify even cognitive styles that depend on the underlying representation.
       To test the idea of individual predictions, we start by developing a pair-wise
       similarity function based on the subjects’ answers to the task. For 10% of the
       subjects, we randomly delete 15% of their answers. By using collaborative fil-
       tering techniques, we check whether it is possible to predict the deleted answers
       of a specific individual solely by using the answers given by similar subjects to
       those specific questions. Results show that not only the correct answer is pre-
       dicted in around 70% of the cases, and the answer is in the top two predictions
       in 89% of the cases, which outperforms other theoretical approaches, but the
       predictions are as well accurate for cases where participants deviate from the
       correct answer. This implies that there are cognitive principles responsible for
       the patterns. If these principles are identified, then there is no need for complex
       models, because even simple ones can achieve high accuracy. This supports that
       individual performance in reasoning tasks can be predicted leading to a new
       level of cognitive modeling.

       Keywords: computational reasoning, individual differences, syllogisms, col-
       laborative filtering, machine learning


1      Introduction

Reasoning problems have been studied in such diverse disciplines as psychology,
philosophy, cognitive science, as well as in computer science. From an artificial intel-
ligence perspective, modeling human reasoning is crucial if we want to have artificial
agents which help us in everyday life. To be successful at this, it is important to un-
derstand that each individual can have a different reasoning pattern. Sometimes devia-
2


tions of the individual participants from the norms of classical logic have led to a
qualification of such reasoners as rather irrational (e.g., [27]). Another possibility is
that there is a so-called bounded rationality [6]. An indicator could be that these “de-
viators” are inherently consistent in their answers and even more that their answers
can be predicted. Most previous work has focused on overall distribution of answers,
trying to predict the most chosen answer by subjects. However, as noted by Pachur
and colleagues [18], in presence of individual differences, tests of group mean differ-
ences can be highly misleading. For this reason, we focus on individual subjects and
try to predict the exact answer they would give.
   Collaborative filtering, a method employed in recommender systems [23], can
show that a single reasoner does not deviate from similar reasoners, and that conse-
quently her answers can be predicted based on answers of the similar reasoners.
   The rest of this paper is structured as follows: first, we give an introduction to theo-
ries on reasoning and individual differences, syllogistic reasoning, and recommender
systems. Then, we present a model which uses collaborative filtering to predict an-
swers in the syllogistic reasoning task and compare it to other models or theoretical
predictions. Lastly, we draw conclusions and suggest further steps for research.


2      Background

2.1    Theories on reasoning and individual differences in reasoning

Scientist have tried to understand human reasoning for a long time. Up to date, there
are at least five more prominent theories on how people reason. These theories are
based on heuristics [3,4,6,14] mental logic [24,25], pragmatic reasoning schemas [2],
mental models [8], and probability theory [15]. Oaksford and Chater [16] offer a gen-
eral review of these theories.
   The need for all these theories is caused by the fact that people differ in how they
answer to reasoning tasks. Theories usually aim at explaining general answering pat-
terns, but if we focus on individual answers then these differences are even more vast.
These differences can be caused by intellectual abilities, memory capacity, strategies
being used, among others [17, 26].

2.2    Syllogistic reasoning

In a syllogistic task, subjects are presented with two premises, and they have to evalu-
ate what follows or whether a third given conclusion necessarily follows. Consider the
following example [12]:

Some Artists are Bakers,
All Bakers are Chemists.
Therefore, some Artists are Chemists.

   Each premise can have four possible moods, two of which are affirmative (Some,
All), and two are negative ones (Some not, No). The premises have two terms each,
                                                                                      3


but overall only three terms are used. This is because the first two premises always
share a common term (in this case bakers), and the third premise asks about the re-
maining two terms (artists and chemists). Terms can have four figures, based on their
configuration:
                 Figure 1       Figure 2        Figure 3        Figure 4
                   A-B            B-A             A-B             B-A
                   B-C            C-B             C-B             B-C

   Since each premise can have four moods, and there are four possible figures, there
can be 64 distinct pairs of premises. 27 of them have a conclusion which is valid in
classical logic, whereas for the remaining 37 there is no valid conclusion. The conclu-
sion (a third statement) allows again four possible moods, and two figures (A-C or C-
A), so overall there are 512 syllogisms that can be evaluated.
   Studies using syllogisms with different forms of content from abstract to realistic
one have shown that errors are not random, but are systematically according to two
main factors: figure and mood (see [5]). Syllogistic reasoning has caught the attention
of many researchers. Khemlani and Johnson-Laird [12] provide a review of seven
theories of syllogistic reasoning. We will describe the ones which perform better in
the meta-analysis, and they will be later used as a baseline for the performance of our
model.
   The first theory, illicit conversions [1,22] is based on a misinterpretation of the
quantifiers interpreting All B are A when given All A are B and Some B are not A
when told Some A are not B. Both these conversions are logically invalid, and lead to
errors such as inferring All c are a given the premises All A are B and All C are B. In
order to predict the answers of syllogisms, this theory uses classical logic conversions
and operators, as well as the two aforementioned invalid conversions.
   The verbal models theory [20] claim that reasoners built verbal mental models
from syllogistic premises and either formulates a conclusion or declares that nothing
follows. The model then performs a reencoding of the information based on the in-
formation that the converse of the quantifiers Some and No are valid. In another ver-
sion, the model also reencodes invalid conversions. The authors argue that a crucial
part of deduction is the linguistic process of encoding and reencoding the information,
rather than looking for counterexamples.
   Unlike the previous example, mental models (formulated for syllogisms first in [7])
are inspired by the use of counterexamples. The core idea is that individuals under-
stand that a putative conclusion is false if there is a counterexample to it. The theory
states that when being faced with a premise, individuals build a mental model of it
based on meaning and knowledge. E.g. when given the premise All Artists are Bee-
keepers the following model is built:

  Artist Beekeeper
  Artist Beekeeper
        …
4


   Each row represents the properties of an individual, and the ellipsis denotes indi-
viduals which are not artists. This model can be fleshed out to an explicit model
which contains information on all potential individuals:

    Artist Beekeeper
    Artist Beekeeper
           Beekeeper

   In a nutshell, the theory states that many individuals simply reach a conclusion
based on the first implicit model, which can be wrong (in this case it would give the
impression that All Beekeepers are Artists). However, there are individuals who built
other alternative models in order to find counterexamples, which usually leads to a
logically correct answer.

2.3    Collaborative filtering and recommender systems

Recommender systems are software tools used to provide suggestions for items which
can be useful to users [23]. One way to implement a recommender system is through
collaborative filtering. In a nutshell, collaborative filtering suggests that if Alice likes
items 1 and 2, and Bob likes items 1, 2 and 3, then Alice also probably likes item 3.
More formally, in collaborative filtering we look for patterns in observed preference
behavior, and try to predict new preferences based on those patterns. Users’ prefer-
ences for the items are stored as a matrix, in which each row represents a user and
each column represents an item. Then, for each user we build a similarity function to
see who are the users which have similar preferences. This means, for each user we
have a neighborhood of other users similar to them. Then, when a certain item has not
been rated by our user, we rely on this neighborhood to see how would our user rate
that item. If the rate would be high enough, we can recommend that item to the user.


                        Fig. 1. Users’ ratings represented as a matrix


   The main challenge in this case would be to select the appropriate similarity func-
tion, and to determine the adequate size of the neighborhood.
                                                                                        5


3      Predicting performance in syllogistic reasoning by using
       collaborative filtering

3.1    Motivation

As aforementioned, people make mistakes when solving reasoning tasks such as syl-
logisms. When it comes to preference behavior, we have seen that collaborative filter-
ing can achieve very good results in predicting which items to recommend to users.
This shows that people are consistent in their preferences. Could it be the case that
people are also consistent in the way they perform in reasoning tasks, and can we
predict their answers (including errors) in the aforementioned reasoning domains? We
will explore this by using collaborative filtering to predict participants’ behavior in
reasoning tasks.

3.2    The experimental setting

For this model, we will use an unpublished data set from an online experiment con-
ducted at the Cognitive Computation Lab (University of Freiburg). It includes data
from 140 subjects which completed all 64 syllogistic tasks. Each subject was present-
ed with two premises, and had to choose between nine answer options (the eight
mood/figure combinations, plus the ninth option being No Valid Conclusion).

3.3    The model
In our setting, the users are the 140 subjects of the study, and the items are the 64
tasks. We define the similarity function as follows:
                                       𝑛𝑠𝑎𝑚𝑒𝐴𝑛𝑠𝑤𝑒𝑟𝑠
                               𝑠𝑖𝑚 =                                              (1)
                                            𝑁

where N represents the amount of questions which were answered by both subjects.
As we can see, similarity is a function between 0 and 1.
   We start by randomly selecting 14 subjects for which there exists at least one other
subject with a similarity of 0.6 or higher, and then randomly deleting 10 of their an-
swers. These will be the answers which have to be predicted.
   The model computes the pair-wise similarities between subjects, and then whenev-
er for the current subject there is a missing answer, it identifies all subjects in its
neighborhood (i.e., subjects with a similarity higher than 0.35) which have answered
that task, and performs a “weighted voting” as following:

for answer in possible_answers:
   for user in users:
      value[answer]=value[answer]+sim[user]*given[user]

where sim[user] represents the similarity of our subject with the user which we
are currently computing, and given[user] is a binary attribute showing whether
6


the user gave this answer to the task or not. We perform this weighting inspired by the
intuition that answers given by more similar subjects should matter more. Then, the
answer with the highest value is the predicted one.

3.4    Results

The model is very simple, and it does not include any learning, its performance is,
however, fairly accurate. It is important to notice that the model predicts one out of
nine possible options, so a model which is simply guessing would be on average cor-
rect in about 11% of times. Our model compares the predicted answers to the true
ones, and reports the percentage of correctly predicted answers.
   In order to interpret the result better it would be useful to compare the performance
of our model to other models or theoretical predictions. As we already stated, most
theories do not focus on individual answer predictions, but on most chosen answers.
For example, a theory can state that for the premises All A are B, Some B are C then
people draw the answers Some A are C, Some C are A or All A are C. We try to see
what these theories would predict for our individual missing answers, and we use the
relaxation that if the missing answer is one of the predicted answers from the theory,
then it is counted as correct. We notice that this is quite a big relaxation, since there
are theories which predict three to four answers for the same pair of premises, which
means they would of course achieve a better accuracy than our model which always
predicts just one answer. We calculate the accuracy of the predictions of theories
based on illicit conversions, verbal models, and mental models, as well as the predic-
tions of mReasoner, an implementation of the mental models theory.
   One thing to keep in mind is that for some syllogisms there is more than one valid
answer, however subjects could select in our experiment only one answer. This can
cause a difficulty for our comparison as we need to deal with cognitive theories that
often predict up to four or five answers per syllogism. For this reason, we construct
two other versions of our model. Instead of predicting only one answer, we checked
what would be the accuracy of the prediction if we predict the top two and top three
most voted answers. We repeat the procedure 100, 300 and 500 times (to check if
results converge); the results are reported in Table 1:

              Exact      Top 2      Top 3       IC         VM         MM        mReasoner
100 runs      0.68       0.89       0.95       0.61        0.77       0.95        0.87
300 runs      0.69       0.88       0.95       0.62        0.77       0.95        0.87
500 runs      0.69       0.89       0.95       0.62        0.77       0.95        0.87

Table 1. The accuracy of the cognitive theories in predicting missing answers. The reported
results are average accuracies over 100, 300, and 500 runs. (Exact, Top2 and Top3 refer to our
model producing 1, 2 and 3 answers; IC refers to Illicit Conversions; VM refers to Verbal
Models; MM refers to Mental Models)
                                                                                       7


3.5    Discussion

The results show that our model which predicts the exact answer does not only per-
form reliably better than chance, but even manages to outperform the theoretical pre-
dictions based on illicit conversions, which for almost half of the syllogisms predicts
more than one answer. Furthermore, we notice that our model with the two most vot-
ed predictions outperforms the predictions of the verbal models as well as of mRea-
soner, which is right now one of the state-of-the-art predictors for syllogistic reason-
ing. Another thing which is important to notice is that our model reaches the same
accuracy even if we delete 32 (out of the 64) answers for up to 50% of the partici-
pants, showing robust performance.
    We notice that the top performance is achieved by the predictions made by the
mental models theory. However, it is important to notice that for almost half of the
syllogisms this theory predicts four or even five answers, which means it has an ad-
vantage for this type of metric. Still, our model which predicts the top 3 answers (still
less than the mental models predictions) achieves the same performance.
    mReasoner is an implementation which is based on mental models, but it has some
parameters which limit the number of predicted answers for each syllogism (it pre-
dicts one answer for 7 syllogisms, and more than two for 16 syllogisms). In this com-
parison, we used the default setting for mReasoner, and we see that our model which
predicts the top two answers has a better performance.
    Khemlani and Johnson-Laird [13] propose a model where mReasoner learns pa-
rameters for individual subjects in a small dataset consisting of 20 participants, and
then simulates the answers of each subject and compares them to the true answers.
They report a mean correlation to the data of 0.7, which means on average in 70% of
the cases mReasoner made the right prediction. This result is comparable to our basic
model, but built on general cognitive principles. Both approaches differ in their meth-
odology: Our approach requires participants data to classify and predict other reason-
ers and does not have cognitive principles, while on the other hand mReasoners is
built on cognitive principles but trains the system parameters on the whole dataset, so
it is not actually predicting. A combination of both methods to reach a “prediction”
based on cognitive principles is important.


4      Conclusions and future steps


These results show that collaborative filtering can help in predicting individual per-
formance for reasoning tasks, but also that there are new challenges (especially by the
performance boost when considering the top two predictions). First of all, it will be
interesting to test the same model with data from other reasoning domains, e.g., the
Wason selection task [28]. This would allow us to test for consistency across different
reasoning domains. Secondly, as we mentioned the model is simple, it would be inter-
esting to build a more adaptive model which learns from the subjects’ answers and
can identify cognitive principles. This could be achieved by analyzing potential rea-
8


sons for differences in performance, combined to using more advanced techniques
from machine learning to build the recommender system.
   One alternative would be to formalize the tasks by using ternary logic, and then
learn how different subjects map logical operators to truth tables. Ternary logic has
shown to provide high flexibility in modeling Wason’s selection task [21]. Another
alternative would be to include theories’ predictions to the task, and check whether a
subject is consistent with the predictions of a certain theory (i.e. we would find simi-
larities with theoretical predictions rather than with other subjects). This would also
help us for cases where there are not enough subjects to build informative similarity
functions among them.
   We tried to use machine learning techniques to cluster the data in order to identify
potential reasoning profiles, however the dataset seems to be too diverse. A method
called fcclusterdata, a hierarchal clustering technique from the sckit-learn package
[19] in Python, identifies more than 40 clusters (for the 140 participants), whereas by
using the k-medoids technique, in which we can specify the number of clusters, for up
to 6 clusters the similarity of subjects in the cluster remains low and we do not
achieve better performance. Studies [17, 26] have identified reasons which might lead
to individual differences such as level of intellect, memory capacity etc. Our intuition
is that although these reasons are similar for different individuals, the way they are
presented in people makes it difficult to create clusters. For example, an individual
can have high intellectual capacity but bad memory, another one medium intellectual
capacity and very good memory, and so on. This is why we think that an approach
which focuses on finding similar reasoners for each individual can be more effective.
   Reasoners are relatively consistent in their performance in syllogistic reasoning,
since some tend to give similar answers and often predictable mistakes. This means it
is possible to build reasoning models which can identify a person’s reasoning pattern,
and exploit it to better understand the overall reasoning process. This is exactly what
our simple model does, and in its relaxed version it manages to be as good as state of
the art complex reasoning models.


References
    1. Chapman, L. J., & Chapman, J. P. Atmosphere effect re-examined. Journal of experi-
       mental psychology, 58(3), 220. (1959)
    2. Cheng, P. W., & Holyoak, K. J. Pragmatic reasoning schemas. Cognitive psychology,
       17(4), 391-416. (1985)
    3. Evans, J. St. B. T. Heuristic and analytic processes in reasoning. British Journal of Psy-
       chology, 75, 451-468. (1984)
    4. Evans, J. St. B. T. Bias in human reasoning: Causes and consequences. Hillsdale, NJ:
       Erlbaum. (1989)
    5. Evans, J. S. B., Newstead, S. E., & Byrne, R. M. Human reasoning: The psychology of de-
       duction. Psychology Press. (1993)
    6. Gigerenzer, G., & Hug, K. Domain-specific reasoning: Social contracts, cheating, and per-
       spective change. Cognition, 43(2), 127-171. (1992)
    7. Johnson-Laird, P. N. Models of deduction. Reasoning: Representation and process in chil-
       dren and adults, 7-54. (1975)
                                                                                             9


 8. Johnson-Laird, P. N. Mental models: Towards a cognitive science of language, inference,
    and consciousness (No. 6). Harvard University Press. (1983)
 9. Johnson-Laird, P. N., & Steedman, M. The psychology of syllogisms. Cognitive psycholo-
    gy 10.1: 64-99. (1978)
10. Johnson-Laird, P. N., & Wason, P. C. A theoretical analysis of insight into a reasoning
    task. Cognitive Psychology, 1(2), 134-148. (1970)
11. Kaufman, L., & Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster
    Analysis, Wiley New York Google Scholar. (1990)
12. Khemlani, S., & Johnson-Laird, P. N. Theories of the syllogism: A meta-analysis. Psycho-
    logical Bulletin, Vol 138(3), May 2012, 427-457. (2012)
13. Khemlani, S., & Johnson-Laird, P. N. How people differ in syllogistic reasoning. In Pro-
    ceedings of the 36th Annual Conference of the Cognitive Science Society. Austin, TX:
    Cognitive Science Society. (2016)
14. Newell, A., & Simon, H. A. Human problem solving (Vol. 104, No. 9). Englewood Cliffs,
    NJ: Prentice-Hall. (1972)
15. Oaksford, M., & Chater, N. A rational analysis of the selection task as optimal data selec-
    tion. Psychological Review, 101, 608-631. (1994)
16. Oaksford, M., & Chater, N. Theories of reasoning and the computational explanation of
    everyday inference. Thinking & Reasoning, 1(2), 121-152. (1995)
17. Oberauer, K., Süß, H. M., Wilhelm, O., & Sander, N. Individual differences in working
    memory capacity and reasoning ability. Variation in working memory, 49-75. (2007)
18. Pachur, T., Bröder, A., & Marewski, J. N. The recognition heuristic in memory‐based in-
    ference: is recognition a non‐compensatory cue? Journal of Behavioral Decision Making,
    21(2), 183-210. (2008)
19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
    Vanderplas, J. Scikit-learn: Machine learning in Python. Journal of Machine Learning Re-
    search, 12(Oct), 2825-2830. (2011)
20. Polk, T. A., & Newell, A. Deduction as verbal reasoning. Psychological Review, 102(3),
    533. (1995)
21. Ragni, M., Dietz, E. A., Kola, I., & Hölldobler, S. Two-Valued Logic is Not Sufficient to
    Model Human Reasoning, but Three-Valued Logic is: A Formal Analysis. Bridging 2016
    – Bridging the Gap between Human and Automated Reasoning, 1651:61–73. (2016)
22. Revlis, R. Two models of syllogistic reasoning: Feature selection and conversion. Journal
    of Verbal Learning and Verbal Behavior, 14(2), 180-195. (1975)
23. Resnick, P., & Varian, H. R. Recommender systems. Communications of the ACM, 40(3),
    56-58. (1997)
24. Rips, L. J. Cognitive processes in propositional reasoning. Psychological review, 90(1),
    38. (1983)
25. Rips, L. J. The psychology of proof: Deductive reasoning in human thinking. MIT Press.
    (1994)
26. Stanovich, K. E., & West, R. F. Individual differences in rational thought. Journal of ex-
    perimental psychology: general, 127(2), 161. (1998)
27. Tversky, A., & Kahneman, D. Availability: A heuristic for judging frequency and proba-
    bility. Cognitive psychology, 5(2), 207-232. (1973)
28. Wason, P. C. Reasoning. New Horizons in Psychology. pp. 135-151. (1966)