=Paper=
{{Paper
|id=Vol-3052/paper6
|storemode=property
|title=On the Impact of Features and Classifiers for Measuring Knowledge Gain during Web Search - A Case Study
|pdfUrl=https://ceur-ws.org/Vol-3052/paper6.pdf
|volume=Vol-3052
|authors=Wolfgang Gritz,,Anett Hoppe,,Ralph Ewerth
|dblpUrl=https://dblp.org/rec/conf/cikm/GritzHE21
}}
==On the Impact of Features and Classifiers for Measuring Knowledge Gain during Web Search - A Case Study==
<pdf width="1500px">https://ceur-ws.org/Vol-3052/paper6.pdf</pdf>
<pre>
On the Impact of Features and Classifiers for Measuring
Knowledge Gain during Web Search - A Case Study
Wolfgang Gritz1 , Anett Hoppe1,2 and Ralph Ewerth1,2
1
    TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany
2
    L3S Research Center, Leibniz University Hannover, Germany


                                             Abstract
                                             Search engines are normally not designed to support human learning intents and processes. The field of Search as Learning
                                             (SAL) aims to investigate the characteristics of a successful Web search with a learning purpose. In this paper, we analyze the
                                             impact of text complexity of Web pages on predicting knowledge gain during a search session. For this purpose, we conduct
                                             an experimental case study and investigate the influence of several text-based features and classifiers on the prediction
                                             task. We build upon data from a study of related work, where 104 participants were given the task to learn about the
                                             formation of lightning and thunder through Web search. We perform an extensive evaluation based on a state-of-the-art
                                             approach and extend it with additional features related to textual complexity of Web pages. In contrast to prior work, we
                                             perform a systematic search for optimal hyperparameters and show the possible influence of feature selection strategies on
                                             the knowledge gain prediction. When using the new set of features, state-of-the-art results are noticeably improved. The
                                             results indicate that text complexity of Web pages could be an important feature resource for knowledge gain prediction.

                                             Keywords
                                             Textual Complexity, Knowledge Gain, Search as Learning, Learning Resources, Web-based Learning


1. Introduction                                                                                                       ing was that the time spent on text-based Web pages
                                                                                                                      had a greater impact on knowledge gain than time spent
Conventional information retrieval systems are usually                                                                on video-based Web pages. Gadiraju et al. [7] explored
designed to satisfy an information need. The research                                                                 the influence of behavioral features on the learning out-
area Search as Learning (SAL), on the other hand, deals                                                               come, and found a positive correlation between the av-
with the assumption that search sessions can also be                                                                  erage complexity of user queries and their knowledge
driven by a learning intention. Research in the area of                                                               gain. Recently, some approaches have been suggested
SAL is not only concerned with the ranking of search                                                                  that combine several types of features [8, 9]. For exam-
results, but also with the detection or prediction of                                                                 ple, Otto et al. [9] studied the effect on knowledge gain
the learning intention or even the knowledge state and                                                                prediction, when complexity and linguistic features are
knowledge gain [1, 2].                                                                                                complemented with multimedia features. They achieved
   Vakkari [3] presented a survey of features which indi-                                                             slight improvements by adding multimedia features, e.g.,
cate the user’s knowledge and learning needs, but also                                                                representing the amount of image and video data on the
knowledge gain during the search process. More recently,                                                              screen or the image type (infographics, outdoor photog-
a wide variety of features were considered, including                                                                 raphy, etc.).
resource-based (based on text or multimedia content)                                                                     A crucial aspect of learning is the appropriateness
or behavioral features. For example, Syed and Collins-                                                                of the text for the reader. In his survey, Collins-
Thompson [4] have considered document retrieval fea-                                                                  Thompson [10] has summarized studies that deal with
tures to improve learning outcome for short- and long-                                                                the automatic assessment of the reading difficulty of
term vocabulary learning. Collins-Thompson et al. [5],                                                                texts. Hancke [11] has previously analyzed lexical, syn-
on the other hand, have studied different query types                                                                 tactic, and morphological features for German, while
and found a correlation between the variety of intrinsic                                                              Kurdi et al. [12] investigated features that allow for con-
query types and knowledge gain. Pardi et al. [6] further                                                              clusions about the complexity of English texts.
examined the time spent on Web pages with primarily                                                                      In this paper, we investigate the influence of text com-
textual or video content and learning outcome. One find-                                                              plexity of Web pages on knowledge gain prediction in a
                                                                                                                      comprehensive experimental case study. For this purpose,
Proceedings of the CIKM 2021 Workshops, November 1–5, Gold Coast,
Queensland, Australia
                                                                                                                      we present a large set of text-based features of various
" wolfgang.gritz@tib.eu (W. Gritz); anett.hoppe@tib.eu                                                                types and, furthermore, analyze the impact of different
(A. Hoppe); ralph.ewerth@tib.eu (R. Ewerth)                                                                           classifiers and feature selection strategies on knowledge
 0000-0003-1668-3304 (W. Gritz); 0000-0002-1452-9509                                                                 gain prediction. First, the experimental results show that
(A. Hoppe); 0000-0003-0918-6297 (R. Ewerth)                                                                           state-of-the-art results [9] can be significantly improved
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     and, second, that the textual complexity of Web pages can
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
be a valuable predictor for the classification of knowledge       2.2. Knowledge Gain Measurement
gain. Our contributions can be summarized as follows:
                                                                  To measure knowledge gain, the participants were asked
        • A large set of features describing textual complex-     to solve a 10-item multiple choice test one week be-
          ity of Web pages is presented.                          fore (t1) and immediately after (t2) the Web search. The
        • We conduct an extensive, systematic evaluation          knowledge gain is subsequently defined as the difference
          including multiple classifiers, hyperparameter          between the numbers of correct answers of t2 and t1. The
          analysis and optimization, as well as feature se-       potential range of values for the knowledge gain is there-
          lection strategies and analyze their impact on          fore [−10, 10]. The average value in t1 was 5.24 ± 1.80
          knowledge gain prediction.                              respectively 7.46 ± 1.43 in t2. The average knowledge
        • We demonstrate that the state-of-the-art-results        gain was 2.22 ± 1.78 and lies in the range of [−3, 6].
          can be improved, even when only considering
          textual complexity features.                            2.3. Feature Extraction
The remainder of this paper is structured as follows: In      In the study, the participants performed free Web
Section 2 the experimental setup and the process of ex-       searches, such that realistic search and browsing behav-
traction is described. Experimental results are reported      ior could be recorded. Since we focus on the textual
in Section 3 and the impact of text complexity features is    complexity of the visited pages, other page types like
analyzed. Finally, a summary of the main results and an       search engine result pages and video-based contents are
outlook is given in section 4.                                filtered. For this purpose, we used a keyword-based ap-
                                                              proach and omitted pages which contained the following
                                                              keyterms in their URL: "google.", "youtu", "ecosia", "RDSIn-
2. Experimental Setup and                                     dex","universitaetsbibliothek", "meteoros", "webcam" and
      Text-based Features                                     "learningsnacks". For all remaining pages, we extracted
                                                              all displayed text without further processing. This can
We use data from a study [13] in which participants were lead to the fact that e.g., tables or advertisements are
asked to acquire knowledge about the formation of thun- in the analyzed texts. We decided against any further
der and lightning. The topic has already proven useful in preprocessing in order to minimize the bias in the data
previous work [14, 15]. It is a phenomenon that is gen- set.
erally known and requires both factual and procedural
knowledge. On the Web, many sources exist on the sub- 2.4. Website Features
ject, explaining it in diverse ways (texts, graphics, videos,
etc.). The participants were asked to do a Web search for To assess the complexity of text on Web pages, we extract
a maximum of 30 minutes; but were allowed to end the eight different types of features:
search earlier if they felt they had learned everything im-
                                                                    • syntactical features
portant. We could use data from 𝑁 = 104 participants
(88 female, 16 male, average age of 22.7 ± 2.7 years), for          • readability scores
which the visited Web pages were downloaded during                  • part of speech (POS) density
the experiment. The participants were recruited over                • lexical richness
a local recruitment portal composed of students from                • lexical variation
the University of Tübingen. Students were compensated               • lexical sophistication
with 16e per person for participating in the study. None            • syntactic constituents features
of the participants had former expertise in meteorology.            • connectives

                                                             Since the study was conducted in German, we mainly rely
2.1. Technical Setup of the Study
                                                             on the Common Text Analysis Platform (CTAP) tool [16],
While plenty of data were collected during the study which currently provides 218 different complexity fea-
(data sources such as eye and mouse tracking informa- tures for the German language. In total, we extract 248
tion), here, we focus on the text content of the visited Web features from each Web page. Below we give a short de-
pages. During the Web search, all visited Web pages of scription of each feature group. For a complete overview
the participants were tracked and recorded via the "Scrap- consider the appendix3 .
bookX" (1.5.14)1 and "ScrapbookXAutosave" (1.4.3)2 plu-         The syntactic features group consists of basic text
gins.                                                        statistics such as the number of letters, syllables, words,
                                                             and sentences. Moreover, the average length of each
    1
        https://github.com/danny0838/firefox-scrapbook
    2                                                                3
        https://github.com/danny0838/firefox-scrapbook-autosave          https://github.com/molpood/IWILDS_Complexity_Feature_List/
element is considered, like sentence length in letters or   Additionally, ratios to each other are calculated, e.g., noun
word length in syllables, as well as the standard deviation.phrases per T-unit, but also words per T-unit or noun
In addition, we calculate the average reading time of the   phrases per sentence. Moreover, we consider the tenses
Web pages by assuming 180 words per minute [17].            in the text based on Kurdi [12]’s observation that there
   The second group of features consists of well-known      may be a connection between more difficult texts and
readability scores that aim to estimate the skills a reader more complex tenses. To extract the tenses, we use the
must have to understand the text. The features are based    tool of Dönicke [21].
on combinations of the syntactic features (automated           The last group Connectives (according to
readability index (ARI), Coleman-Liau index, Flesch-        Breindl et al. [22]) examines units of the German
Kincaid grade, Flesch reading ease) and partly on difficult language that express semantic relations between
or complex words. They are given either by a list (Dale-    sentences. The connectives form a class consisting of
Chall readability score, Gunning fog) or by words with      subsets of defined parts of speech like conjunctions
three or more syllables (SMOG index). For example, the      (and, or, etc.) or adverbs (in contrast, therefore, etc.).
formula for ARI is as follows:                              The absolute number of connectives, as well as ratios,
                |characters|             |words|            such as multi-word connectives divided by single-word
 ARI = 4.71 ·                 + 0.5 ·              − 21.43 connectives, are calculated as features.
                   |words|             |sentences|
                                                               The eight groups consist of a total of 248 features that
In the case of the ARI, the result is a human-interpretable are calculated for each Web page visited during the search
numerical value on a scale of 1-14 (1: Kindergarten, 14: sessions. Since the participants accessed a different num-
Professor).                                                 ber of Web pages, we compute the average, the minimum
   The POS density group reflects the density of different and the maximum for each feature for each participant.
word types like adjectives or verbs in the website text. It As a result, we obtain a total of 3 · 248 = 744 features
is based on the tokenization of the text and calculates the for knowledge gain prediction.
different number of word types (e.g., adjectives or verbs)
in relation to all tokens, e.g.,
                                                                3. Experimental Results
                                    |adjectives|
              densityadjectives =
                                      |tokens|                 In this section, we report results for knowledge gain
                                                               prediction using features for text complexity. For a fair
   The fourth group lexical richness is very similar. Here,
                                                               comparison, we use the same evaluation setting including
the number of non-duplicated tokens is set in relation to
                                         types                 hyperparameter optimization for all experiments. In the
all tokens. In addition to the fraction tokens , various vari-
                                                               same way, we replicate the results of Otto et al. [9]4 with
ations such as the logarithm or square root are applied
                                                               our evaluation procedure.
to the numerator and denominator.
   The lexical variation group examines the subset of lex-
ical words (LW) consisting of nouns, verbs, adjectives 3.1. Knowledge Gain Definition
and adverbs. The class puts the number of individual To categorize the measured knowledge gain, we use the
components in relation to the number of lexical words, common approach [7, 8, 9] to assign each search session
e.g., the lexical variation lv_adjectives for adjectives:      to one of three classes 𝐶 = {𝐿𝑜𝑤, 𝑀 𝑜𝑑𝑒𝑟𝑎𝑡𝑒, 𝐻𝑖𝑔ℎ}
                                    |adjectives|                based on the Standard Deviation Classification approach.
              lv_adjectives =                                   For this purpose, the knowledge gain 𝑋𝑖 of participant 𝑖
                                       |LW|
                                                                is z-normalized (𝑋ˆ𝑖 ) according to equation 1.
The group of lexical sophistication features is based on dif-
ferent frequency lists [18, 19]. All words of the Web page                         𝑋ˆ𝑖 = 𝑋𝑖 − 𝜇                    (1)
text are assigned to sets of all words AW, lexical words                                    𝜎
LW (as mentioned before consisting of nouns, verbs, ad-         Here, 𝜇 is the mean and 𝜎 is the standard deviation
jectives and adverbs) and functional words FW (i.e., not        of all knowledge gain measures 𝑋. Then, for every z-
LW). The logarithmic or absolute frequency in the fre-          normalized knowledge gain 𝑋ˆ𝑖 the class is assigned as
quency lists (per million words) of AW, LW and FW is            follows:
consequently used as a feature. Furthermore, the Karls-                        ⎨ Low,
                                                                               ⎧
                                                                                              if 𝑋
                                                                                                 ˆ𝑖 < − 1
                                                                                                         2
ruhe Childrens Text (KCT) [20] list is used to determine             𝐶(𝑋𝑖 ) :=    Moderate, if − 12 ≤ 𝑋 ˆ𝑖 ≤ 1
the average and minimum age of active use of AW, LW                                                            2
                                                                                  High,       if 𝑋
                                                                                                 ˆ𝑖 > 1
                                                                               ⎩
and FW.                                                                                                                      2

   The group of syntactic constituents consists of features          4
                                                                       Otto et al. [9] analyzed features for 113 participants. Technical issues with
                                                                logging led to missing HTML data for nine participants which were crawled at a
that determine the number of different syntactic con-           later date. We rely on the data crawled during the original experiment, leading to
stituents, like noun phrases, relative clauses or T-units.      𝑁 = 104 records for our analysis.
                                                                       We use min-max normalization to normalize each fea-
                                                                    ture of the 80% to the interval [0, 1]. This is an essen-
       s
      ld
    fo
                                                                    tial step for some of the classifiers, e.g., Support Vector
   5


                                                                    Machine. The 20% test set is then normalized by the
                  80% train/validation     20% test
                                                                    minimum and maximum of the 80% for evaluation. It is
              80% train         20% val     20% test                possible that the values lie outside the interval of [0, 1].
                              min
                                                                    However, we decide against clipping in order to not lose
       Normalization
                             max
                                          Preprocessing             any information due to normalization. Figure 1 provides
                                                                    an overview of our proposed evaluation. In our evalua-
    Feature Selection
        (optional)
                            subset         Prediction               tion we use the implementation of Scikit-learn [23].
                                              results
    Hyperparameter
                           classifier
                                                                    3.3.1. Hyperparameter Optimization
     Optimization

                                                                    The performance of classification algorithms strongly
                                                                    depends on the chosen hyperparameters. However, since
Figure 1: Overview of our evaluation method. A 5-fold cross-
validation is performed and for each split the features are         the training, validation and test data change in each itera-
first normalized, optionally selected/reduced and the hyper-        tion due to cross-validation, these cannot be determined
parameters of the respective classifier are optimized on the        once and used for the entire evaluation. Therefore, to
80% train/validation data. The test data are scaled with the        obtain valid results, we perform an optimization of the
minimum and maximum of the train/validation data and op-            hyperparameters in each of the five iterations. We utilize
tionally the features are filtered. Finally, the classifier opti-   Optuna [24] for an Bayesian search to efficiently find a
mized on the train and validation data is used to predict the       good configuration and limit the number of runs to 500
knowledge gain on the test data set.                                to reduce the computational cost. From the 80% of the
                                                                    data coming from the 80:20 split of the cross-validation,
                                                                    another 80:20 split is performed, where 80% is training
                                                                    data and 20% is validation data. We set the maximization
This yields the following class distribution: |XLow |=40,           of the weighted F1 score as the optimization objective.
|XModerate |=39, |XHigh |=25.                                       This is to prevent the class imbalance from making the
                                                                    underrepresented class High less important, as it would
3.2. Metrics                                                        be, for example, with overall accuracy.

To evaluate the classification results, we use precision,           3.3.2. Feature Selection
recall, 𝐹1 score, and accuracy. These are defined as fol-
lows:                                                       The classification results may also depend on the number
                                   TP                       of input features (more is not always better). For example,
                  precision =                         (2)
                               TP + FP                      in the Random Forest algorithm, a subset of the features
                                 TP                         is selected several times to create weak classifiers and
                    recall =                          (3)
                             TP + FN                        there is no guarantee that "good features" will prevail.
                            precision · recall              For this reason, we want to reduce the number of fea-
            F1 score = 2 ·                            (4)   tures while trying to preserve valuable features. Again,
                            precision + recall
                                                            it is important to separate the feature selection from the
                               TP + TN
           accuracy =                                   (5) test data, which changes in each iteration. As with hy-
                        TP + TN + FP + FN                   perparameter optimization, we use the further split into
where TP are the values correctly classified as positive, training and validation data to do this. It follows that the
TN are the values correctly classified as negative, and FP selected features may change in each iteration. For the
are the values incorrectly classified as positive and FN selection of the features to be used for this evaluation,
are the values incorrectly classified as negative.          we rely on two strategies:
                                                                        1. 𝜒2 -based Feature Selection: This method ex-
3.3. Experimental Setup                                                    amines whether a feature has a statistically signif-
Cross-validation is a good way to evaluate the classifica-                 icant relationship to knowledge gain. While one
tion result, since every feature vector acts as a test sample              feature is analyzed for a relationship, all other
in one fold. We thus choose a 5-fold cross-validation with                 features are ignored. The features with the 𝑁
80% train/validation and 20% test set split. This results                  highest values based on the 𝜒2 -test are selected.
in five elements per class in each test set in each iteration           2. Tree-based Feature Selection: Features with-
of the cross-validation.                                                   out a direct correlation to the knowledge gain
Table 1
Results of the Knowledge gain classification for the classes Low, Moderate and High respectively for the classifiers (clf) Ad-
aboost (Ada), Decision Tree (DT), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), Random Forest (RF) and Support
Vector Machine (SVM) and for weighted guessing (WG). For the reported results of Otto et al. [9] (Otto*) and reproduced
(Otto), respectively, our results (our) and the combination of Otto et al. ’s [9] and our features (Otto+our), precision (pre),
recall (rec), F1 score (f1) and overall accuracy (accu) are reported.

                                 Low                   Moderate                  High                   macro scores
                clf      pre     rec     f1     pre      rec    f1        pre     rec     f1      pre    rec    f1   accu
                WG       38.4    38.6   38.4    37.4     37.3   37.2      24.0    24.0   23.8    33.3    33.3    33.1    34.6
    Otto*


                RF       41.5    52.0   46.1    39.1     40.0   39.5      28.4    14.8   19.1    36.4    35.6    34.9    38.7

                Ada      42.1    40.0   41.0    35.4     43.6   39.1      16.7    12.0   14.0    31.4    31.9    31.4    34.6
                DT       40.0    50.0   44.4    40.5     38.5   39.5      23.5    16.0   19.0    34.7    34.8    34.3    37.5
                KNN      26.7    20.0   22.9    38.1     41.0   39.5      18.8    24.0   21.1    27.8    28.3    27.8    28.8
    Otto


                MLP      41.2    35.0   37.8    46.3     48.7   47.5      34.5    40.0   37.0    40.7    41.2    40.8    41.3
                RF       30.2    40.0   34.4    32.6     35.9   34.1      12.5     4.0    6.1    25.1    26.6    24.9    29.8
                SVM      38.6    42.5   40.5    45.5     38.5   41.7      33.3    36.0   34.6    39.1    39.0    38.9    39.4
                Ada      42.1    40.0   41.0    41.9     46.2   43.9      34.8    32.0   33.3    39.6    39.4    39.4    40.4
                DT       39.0    40.0   39.5    41.7     38.5   40.0      29.6    32.0   30.8    36.8    36.8    36.8    37.5
                KNN      21.7    12.5   15.9    45.0     46.2   45.6      26.8    44.0   33.3    31.2    34.2    31.6    32.7
    our


                MLP      38.9    35.0   36.8    55.9     48.7   52.1      26.5    36.0   30.5    40.4    39.9    39.8    40.4
                RF       40.5    42.5   41.5    51.3     51.3   51.3      34.8    32.0   33.3    42.2    41.9    42.0    43.3
                SVM      40.6    32.5   36.1    54.3     48.7   51.4      27.0    40.0   32.3    40.6    40.4    39.9    40.4
                Ada      42.3    55.0   47.8    54.8     43.6   48.6      42.9    36.0   39.1    46.7    44.9    45.2    46.2
                DT       53.1    42.5   47.2    42.9     38.5   40.5      27.0    40.0   32.3    41.0    40.3    40.0    40.4
    Otto+our


                KNN      20.0    12.5   15.4    45.0     46.2   45.6      25.6    40.0   31.2    30.2    32.9    30.7    31.7
                MLP      25.0    12.5   16.7    41.3     66.7   51.0      28.6    24.0   26.1    31.6    34.4    31.2    35.6
                RF       37.5    45.0   40.9    45.0     46.2   45.6      37.5    24.0   29.3    40.0    38.4    38.6    40.4
                SVM      22.7    12.5   16.1    41.7     38.5   40.0      30.4    56.0   39.4    31.6    35.7    31.9    32.7


               can be important predictors in combination with         3.4. Classifier Performance
               other features. For this reason, we employ a tree-
                                                                       In Table 1, we compare the performance for all classifiers.
               based approach using a Random Forest classifier.
                                                                       As baselines, we list the results for weighted guessing
               This is fitted to the training data and then ana-
                                                                       (WG), which is the mean of each metric for 10,000 ran-
               lyzed to see which features were most heavily
                                                                       domly generated vectors consisting of class labels with
               used in the decision. The 𝑁 values with the high-
                                                                       respect to the class distribution, and the original reported
               est importance are selected. The goal is to select
                                                                       results from Otto et al. [9] (Otto*). For a fair compari-
               valuable features for the classification even with-
                                                                       son with our features, we reproduced the results using
               out direct correlation.
                                                                       the features from Otto et al. [9] with our pipeline (Otto).
                                                                       Furthermore, to analyze the performance for a feature
3.3.3. Classifiers                                                     set as diverse as possible, we combined the features of
Otto et al. [9] limit their evaluation to a Random For-                Otto et al. [9], and our proposed feature set for evalua-
est [25] classifier. In addition to that, we explore several           tion (Otto+our). For the cumulative predictions for all
alternative classifiers: Adaboost [26], Decision Tree [27],            five iterations of cross-validation, the precision, recall,
K-Nearest Neighbors [28], Multi-layer Perceptron [29],                 and F1 score are calculated for each class (Low, Moderate,
and Support Vector Machine [30]. The objective is to ex-               and High), as well as the average of these metrics over
perimentally determine the best configuration in order                 all classes, and the overall accuracy.
to find the maximum potential for knowledge gain pre-                     First, it is notable that the reproduced results of
diction, given the set of features.                                    Otto et al. [9] (Otto) are better compared to their reported
                                                                       result (Otto*). The results of the Multi-layer Perceptron
                                                                       (MLP) provide a 5.9% higher F1 score (34.9% compared
                                                                       to 40.8%). However, in direct comparison to the repro-
duced result with a Random Forest (RF), the original            Table 2
results are better. It is striking, that the improved out-      The optimized hyperparameters per fold 𝐹1 , ...𝐹5 for the
come stems mainly from better predictions from the class        Random Forest classifier for our features.
High. A closer look reveals that the recall scores for the                                            𝐹1        𝐹2          𝐹3      𝐹4             𝐹5
tree-based classifiers Adaboost (Ada), Decision Tree (DT)
and Random Forest (RF) are comparatively low. These                       estimators                 242         299        154     150        223
                                                                          max_depth                   22         17          8      17          17
algorithms seem to preferentially predict the more repre-
                                                                          max_features               sqrt       log2        sqrt   log2        sqrt
sented classes for the features of Otto et al. [9] and accept             criterion                  entr.      gini        sqrt   entr.       gini
a worse result for the underrepresented class High. This                  min_n_split                 6           3          7       7          4
impression is enforced by the fact that for all feature                   min_n_leaf                  5           8          3       8          7
sets the F1 score (f1) for the three classifiers is signifi-
cantly worse for the class High than for the classes Low
and Moderate. This is not the case for any of the other
classifiers.                                                    hyperparameters for each fold 𝐹1 , ..., 𝐹5 are shown in
   Nevertheless, Random Forest (RF) and Adaboost (Ada)          Table 2. No pattern can be discovered in the parameters,
perform best for the other feature sets (our and Otto+our).     they are very different in shape. This could possibly be
The RF using the features of textual complexity (our)           related to the heterogeneity of the data and the weakness
yields a slightly better macro F1 score (42.0%) than the        of the features for prediction.
MLP using the features of Otto (41.2%). In addition, the
RF achieves an overall accuracy of 43.3% while the MLP
only achieves 40.6%. The best result is obtained by the         3.5. Feature Selection
Adaboost classifier for Otto+our with 45.2% macro F1            In Table 1, it is observable that the classification result
score and 46.2% overall accuracy. Examining the results         for the Random Forest classifier (RF) performs worse
for the Random Forest algorithm for all three feature sets,     for the combination of features (Otto+our) than for the
we notice that the F1 scores of all three classes for the       complexity-only features (our). It seems that consid-
combination of features are strictly between the F1 scores      ering more features does not necessarily improve the
of the individual feature sets. At the same time, the F1        classification quality. The result for the Random Forest
scores for the combination of features are all better than      classifier (RF) for the textual complexity (our) features
for the individual sets for Adaboost. We assume that            for 𝑁 ∈ {1, 3, 5, ..., 99} is shown in Figure 2. It can
the Random Forest algorithm is affected by too many             be seen, that the classification result is achieved with
(diverse) features. Adaboost can weight the features dif-       fewer features, regardless of the feature selection strat-
ferently and thus utilize the strengths of both feature         egy. With the 𝜒2 -based selection method, the result is
sets.                                                           also achieved with fewer features, but later than with
   Another observation is that the F1 scores of all feature     the tree-based method. This makes sense in so far as the
sets for the K-Nearest Neighbors (KNN) algorithm are
significantly higher for the class Moderate than for the
classes Low and High. Therefore, we suspect that search                                                Our, Random Forest
                                                                                     50
strategies with Low (or High) knowledge gain differ much                                                                            chi2
more. Furthermore, we can observe that the F1 score for                                                                             tree
                                                                                                                                    all features
the class Moderate of our features is high compared to                               45
the classes Low and High, independent of the classifier.
                                                                 avg F1 Score in %


On closer inspection, we found that often instances of                               40
the class Low are classified as High and vice versa. If
we put the classification result for the classes Low and                             35
High together, i.e., a new class Not Moderate, we would
get 74.1%, 70.8% and 73.1% F1 score for the classifiers                              30
MLP, RF and SVM, respectively, for this new class. It
seems like the complexity features are useful to detect if
                                                                                     25
someone does not have a Moderate increase in knowledge                                    0   20           40          60          80
gain. We plan to investigate this interesting aspect in the                                        Number of Selected Features
future.                                                         Figure 2: Average F1 scores of the Random Forest classifier
   For our textual complexity features, the best result was     using 𝑁 ∈ {1, 3, 5, ..., 99} of our features for the 𝜒2 -based
obtained with the Random Forest classifier. In each itera-      (chi2) and the tree-based (tree) Feature Selection strategy.
tion of the 5-fold cross-validation, an independent hyper-      The result for all features is indicated with the dotted line.
parameter optimization was performed. The optimized
Table 3
Features selected at least three out of five times during cross-validation by the tree-based selection strategy.

            type                               feature                                                                     aggregation         count
            POS Density Feature                Subordinating Conjunction                                                         min                4
            Lexical Sophistication Feature     SUBTLEX Word Frequency (LW Token)                                                 min                4
            Syntactic Complexity Feature       Mean Length of Verb Cluster                                                       min                3


𝜒2 -based method considers the features independently
                                                                                                                         Otto, Random Forest
of each other, and only measures the individual correla-                                              50
                                                                                                                                                        chi2
tion of a feature with knowledge gain. In contrast, the                                                                                                 tree
                                                                                                                                                        all features


                                                                   avg F1 Score for class High in %
tree-based strategy selects features based on their impor-                                            40
tance for an upstream Random Forest. Thus, the baseline
level can already be reached with 𝑁 = 19 features.                                                    30
   Cross-validation is used for evaluation as described
above (Section 3.3.1). Similarly, feature selection is per-                                           20
formed five times. However, this implies that the features
chosen in each iteration of the cross-validation may differ,
                                                                                                      10
which complicates the analysis of which features most
influence the classification result. We therefore propose
to highlight the features that were selected in at least                                              0
                                                                                                           0   10   20    30     40     50     60          70
three out of five iterations. Since the classification result                                                        Number of Selected Features
of the Random Forest was already achieved with 𝑁 = 17            Figure 3: F1 scores for the class High for the features of
features, we report the features based on this configu-          Otto et al. [9] for 𝑁 ∈ {1, 3, 5, ..., 79} features for the
ration. The features and their frequencies are shown in          𝜒2 -based (chi2) and the tree-based (tree) Feature Selection
Table 3. Three features were selected at least three times,      strategy. The result for all features is indicated with the dot-
but none were selected in every iteration of the cross-          ted line.
validation. All three were aggregated by the minimum,
indicating that the Web page with the lowest textual
complexity is most important for the classification result.      4. Conclusions
This strengthens the impression that the features or the
aggregations (Minimum, Maximum and Average) are too              In this paper, we have investigated the impact of textual
weak to provide a strong prediction of the knowledge             complexity of Web pages on knowledge gain during a
gain. In the future, we aim to include more features and         Web search. The experimental results demonstrated that
find aggregations that are more suitable to reflect search       the state of the art can be improved by only considering
patterns.                                                        the textual complexity of Web pages. The results also
   In the last section, it was observed that the F1 score        showed that a systematic assessment of different hyper-
for the class High is significantly below the values for         parameter settings, feature selection, and several classi-
the classes Low and Moderate, regardless of the feature          fiers is important – in particular, since the correlations
set. We performed feature selection before hyperpa-              between features and the target outcome are relatively
rameter optimization and repeated the evaluation with            weak. During the evaluation, it became apparent that
𝑁 ∈ {1, 3, 5, ..., 79} features. Figure 3 shows how the          as little as 17 features per iteration of cross-validation
F1 score for the class High changes with a subset of the         would have been sufficient to achieve the result. Further-
features of Otto et al. [9]. The green curve describes the       more, we found that a moderate knowledge gain can be
F1 scores based on the tree-based feature selection strat-       predicted relatively well, but, interestingly, the distinc-
egy, which tries to select the most important features           tion between successful and unsuccessful Web search
for classification. It is noticeable that almost any tested      does not work well (in terms of knowledge gain). The
subset would have been more suitable than using the full         reasons for this effect have to be investigated in more
feature set. Moreover, the curve does not change from            detail.
𝑁 = 65 onward (same observation for the classes Low                 Although we have obtained state-of-the-art results,
and Moderate), which suggests that the tree-based fea-           there are some limitations. In this case study, we ana-
ture selection strategy does not consider many features          lyzed only the data of a study on knowledge acquisition
at all.                                                          about a specific science topic, the formation of thunder-
storms. Consequently, limited conclusions can be drawn            CHIIR 2018, New Brunswick, NJ, USA, March 11-15,
about general Web searches and the results need to be             2018, ACM, 2018, pp. 191–200. URL: https://doi.org/
confirmed or extended by future studies. In this sense,           10.1145/3176349.3176397. doi:10.1145/3176349.
the reported results need to be reproduced for (a) differ-        3176397.
ent types of learning tasks (e.g., procedural knowledge)      [5] K. Collins-Thompson, S. Y. Rieh, C. C. Haynes,
and (b) conceptual learning tasks in other domains (e.g.,         R. Syed, Assessing learning outcomes in web
non-science topics).                                              search: A comparison of tasks and query strate-
   In the future, we would like to deepen our understand-         gies, in: D. Kelly, R. Capra, N. J. Belkin, J. Teevan,
ing of what behavioral patterns characterize effective            P. Vakkari (Eds.), Proceedings of the 2016 ACM
Web searches, for instance, by examining how the se-              Conference on Human Information Interaction
quence of Web pages (and their characteristics) influence         and Retrieval, CHIIR 2016, Carrboro, North Car-
learning success. An intuitive assumption is, for exam-           olina, USA, March 13-17, 2016, ACM, 2016, pp. 163–
ple, that a successful learning session consists of Web           172. URL: https://doi.org/10.1145/2854946.2854972.
pages of increasing complexity. Furthermore, we have              doi:10.1145/2854946.2854972.
considered the textual complexity of the entire Web page,     [6] G. Pardi, J. von Hoyer, P. Holtz, Y. Kammerer, The
but not in every case is the Web page content read in its         role of cognitive abilities and time spent on texts
entirety. In future work we would like to focus more on           and videos in a multimodal searching as learning
the actual seen during Web search.                                task, in: H. L. O’Brien, L. Freund, I. Arapakis,
   Lastly, we focused on text-based Web pages in this             O. Hoeber, I. Lopatovska (Eds.), CHIIR ’20: Con-
case study. However, many of the Web searches were                ference on Human Information Interaction and
not unimodal but multimodal. Consequently, further                Retrieval, Vancouver, BC, Canada, March 14-18,
investigations will need to include further complexity            2020, ACM, 2020, pp. 378–382. URL: https://doi.org/
measures such as visual complexity of the Web pages or            10.1145/3343413.3378001. doi:10.1145/3343413.
videos.                                                           3378001.
                                                              [7] U. Gadiraju, R. Yu, S. Dietze, P. Holtz, Analyzing
                                                                  knowledge gain of users in informational search
Acknowledgments                                                   sessions on the web, in: C. Shah, N. J. Belkin,
                                                                  K. Byström, J. Huang, F. Scholer (Eds.), Proceed-
Part of this work is financially supported by the Leib-
                                                                  ings of the 2018 Conference on Human Informa-
niz Association, Germany (Leibniz Competition 2018,
                                                                  tion Interaction and Retrieval, CHIIR 2018, New
funding line "Collaborative Excellence", project SALIENT
                                                                  Brunswick, NJ, USA, March 11-15, 2018, ACM, 2018,
[K68/2017]).
                                                                  pp. 2–11. URL: https://doi.org/10.1145/3176349.
                                                                  3176381. doi:10.1145/3176349.3176381.
References                                                    [8] R. Yu, R. Tang, M. Rokicki, U. Gadiraju,
                                                                  S. Dietze,        Topic-independent modeling of
 [1] A. Hoppe, P. Holtz, Y. Kammerer, R. Yu, S. Dietze,           user knowledge in informational search ses-
     R. Ewerth, Current challenges for studying search            sions,      Inf. Retr. J. 24 (2021) 240–268. URL:
     as learning processes, in: 7th Workshop on Learn-            https://doi.org/10.1007/s10791-021-09391-7.
     ing & Education with Web Data (LILE2018), in con-            doi:10.1007/s10791-021-09391-7.
     junction with ACM Web Science, 2018.                     [9] C. Otto, R. Yu, G. Pardi, J. von Hoyer, M. Rokicki,
 [2] M. Machado, P. A. Gimenez, S. Siqueira, Raising the          A. Hoppe, P. Holtz, Y. Kammerer, S. Dietze, R. Ew-
     dimensions and variables for searching as a learning         erth, Predicting knowledge gain during web search
     process: A systematic mapping of the literature, in:         based on multimedia resource consumption, in:
     Anais do XXXI Simpósio Brasileiro de Informática             I. Roll, D. S. McNamara, S. A. Sosnovsky, R. Luckin,
     na Educação, SBC, 2020, pp. 1393–1402.                       V. Dimitrova (Eds.), Artificial Intelligence in Educa-
 [3] P. Vakkari, Searching as learning: A systemati-              tion - 22nd International Conference, AIED 2021,
     zation based on literature, J. Inf. Sci. 42 (2016) 7–        Utrecht, The Netherlands, June 14-18, 2021, Pro-
     18. URL: https://doi.org/10.1177/0165551515615833.           ceedings, Part I, volume 12748 of Lecture Notes
     doi:10.1177/0165551515615833.                                in Computer Science, Springer, 2021, pp. 318–330.
 [4] R. Syed, K. Collins-Thompson, Exploring document             URL: https://doi.org/10.1007/978-3-030-78292-4_26.
     retrieval features associated with improved short-           doi:10.1007/978-3-030-78292-4\_26.
     and long-term vocabulary learning outcomes, in:         [10] K. Collins-Thompson,         Computational assess-
     C. Shah, N. J. Belkin, K. Byström, J. Huang, F. Sc-          ment of text readability: A survey of cur-
     holer (Eds.), Proceedings of the 2018 Conference             rent and future research, ITL - International
     on Human Information Interaction and Retrieval,              Journal of Applied Linguistics 165 (2014) 97–
     135. URL: https://www.jbe-platform.com/content/               ence Abstracts, Stanford University Library, 2011,
     journals/10.1075/itl.165.2.01col. doi:https://doi.            p. 8. URL: http://xtf-prod.stanford.edu/xtf/view?
     org/10.1075/itl.165.2.01col.                                  docId=tei/ab-003.xml.
[11] J. Hancke, S. Vajjala, D. Meurers, Readability classi-   [20] R. Lavalley, K. Berkling, S. Stüker, Preparing chil-
     fication for german using lexical, syntactic, and             dren’s writing database for automated process-
     morphological features, in: M. Kay, C. Boitet                 ing, in: K. M. Berkling (Ed.), Language Teach-
     (Eds.), COLING 2012, 24th International Confer-               ing, Learning and Technology, Satellite Work-
     ence on Computational Linguistics, Proceedings                shop of SLaTE-2015, LTLT@SLaTE 2015, Leipzig,
     of the Conference: Technical Papers, 8-15 Decem-              Germany, September 4, 2015, ISCA, 2015, pp. 9–
     ber 2012, Mumbai, India, Indian Institute of Tech-            15. URL: http://www.isca-speech.org/archive/ltlt_
     nology Bombay, 2012, pp. 1063–1080. URL: https:               2015/lt15_009.html.
     //aclanthology.org/C12-1065/.                            [21] T. Dönicke, Clause-level tense, mood, voice and
[12] M. Kurdi, Lexical and syntactic features selection            modality tagging for german, in: K. Evang,
     for an adaptive reading recommendation system                 L. Kallmeyer, R. Ehren, S. Petitjean, E. Seyffarth,
     based on text complexity, in: ICISDM ’17, 2017.               D. Seddah (Eds.), Proceedings of the 19th Interna-
[13] J. von Hoyer, G. Pardi, Y. Kammerer, P. Holtz,                tional Workshop on Treebanks and Linguistic Theo-
     Metacognitive judgments in searching as learning              ries, TLT 2020, Düsseldorf, Germany, October 27-28,
     (sal) tasks: Insights on (mis-) calibration, multime-         2020, Association for Computational Linguistics,
     dia usage, and confidence, in: Proceedings of the             2020, pp. 1–17. URL: https://doi.org/10.18653/v1/
     1st International Workshop on Search as Learning              2020.tlt-1.1. doi:10.18653/v1/2020.tlt-1.1.
     with Multimedia Information, SALMM ’19, Associa-         [22] E. Breindl, A. Volodina, U. H. Waßner, Hand-
     tion for Computing Machinery, New York, NY, USA,              buch der deutschen Konnektoren 2, De Gruyter,
     2019, p. 3–10. doi:10.1145/3347451.3356730.                   2014. URL: https://doi.org/10.1515/9783110341447.
[14] R. Mayer, R. Moreno, A split-attention effect in              doi:doi:10.1515/9783110341447.
     multimedia learning: Evidence for dual processing        [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     systems in working memory, Journal of Educational             B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     Psychology 90 (1998) 312–320.                                 R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
[15] F. Schmidt-Weigand, K. Scheiter, The role of spatial          D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     descriptions in learning from multimedia, Com-                esnay, Scikit-learn: Machine learning in Python,
     put. Hum. Behav. 27 (2011) 22–28. URL: https://               Journal of Machine Learning Research 12 (2011)
     doi.org/10.1016/j.chb.2010.05.007. doi:10.1016/j.             2825–2830.
     chb.2010.05.007.                                         [24] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama,
[16] X. Chen, D. Meurers, CTAP: A web-based tool                   Optuna: A next-generation hyperparameter opti-
     supporting automatic complexity analysis, in:                 mization framework, in: A. Teredesai, V. Kumar,
     D. Brunato, F. Dell’Orletta, G. Venturi, T. François,         Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Pro-
     P. Blache (Eds.), Proceedings of the Workshop on              ceedings of the 25th ACM SIGKDD International
     Computational Linguistics for Linguistic Complex-             Conference on Knowledge Discovery & Data Min-
     ity, CL4LC@COLING 2016, Osaka, Japan, Decem-                  ing, KDD 2019, Anchorage, AK, USA, August 4-
     ber 11, 2016, The COLING 2016 Organizing Commit-              8, 2019, ACM, 2019, pp. 2623–2631. URL: https:
     tee, 2016, pp. 113–119. URL: https://www.aclweb.              //doi.org/10.1145/3292500.3330701. doi:10.1145/
     org/anthology/W16-4113/.                                      3292500.3330701.
[17] M. Ziefle, Effects of display resolution on visual       [25] L. Breiman,           Random forests,           Mach.
     performance, Hum. Factors 40 (1998) 554–568.                  Learn. 45 (2001) 5–32. URL: https://doi.org/
     URL: https://doi.org/10.1518/001872098779649355.              10.1023/A:1010933404324.            doi:10.1023/A:
     doi:10.1518/001872098779649355.                               1010933404324.
[18] M. Brysbaert, M. Buchmeier, M. Conrad, A. Jacobs,        [26] Y. Freund, R. E. Schapire, A decision-theoretic gen-
     J. Bölte, A. Böhl, The word frequency effect: a               eralization of on-line learning and an application
     review of recent developments and implications                to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139.
     for the choice of frequency estimates in german.,             URL: https://doi.org/10.1006/jcss.1997.1504. doi:10.
     Experimental psychology 58 5 (2011) 412–24.                   1006/jcss.1997.1504.
[19] E. L. Aiden, J. Michel, Culturomics: Quantitative        [27] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone,
     analysis of culture using millions of digitized books,        Classification and Regression Trees, Wadsworth,
     in: 6th Annual International Conference of the Al-            1984.
     liance of Digital Humanities Organizations, DH           [28] E. Fix, J. L. Hodges, Discriminatory analysis - non-
     2011, Stanford, CA, USA, June 19-22, 2011, Confer-            parametric discrimination: Consistency properties,
     International Statistical Review 57 (1989) 238.
[29] F. Rosenblatt, Principles of neurodynamics. percep-
     trons and the theory of brain mechanisms, Ameri-
     can Journal of Psychology 76 (1963) 705.
[30] C. Cortes, V. Vapnik,          Support-vector net-
     works, Mach. Learn. 20 (1995) 273–297. URL:
     https://doi.org/10.1007/BF00994018. doi:10.1007/
     BF00994018.

</pre>