=Paper=
{{Paper
|id=Vol-3052/paper6
|storemode=property
|title=On the Impact of Features and Classifiers for Measuring Knowledge Gain during Web Search - A Case Study
|pdfUrl=https://ceur-ws.org/Vol-3052/paper6.pdf
|volume=Vol-3052
|authors=Wolfgang Gritz,,Anett Hoppe,,Ralph Ewerth
|dblpUrl=https://dblp.org/rec/conf/cikm/GritzHE21
}}
==On the Impact of Features and Classifiers for Measuring Knowledge Gain during Web Search - A Case Study==
On the Impact of Features and Classifiers for Measuring Knowledge Gain during Web Search - A Case Study Wolfgang Gritz1 , Anett Hoppe1,2 and Ralph Ewerth1,2 1 TIB – Leibniz Information Centre for Science and Technology, Hannover, Germany 2 L3S Research Center, Leibniz University Hannover, Germany Abstract Search engines are normally not designed to support human learning intents and processes. The field of Search as Learning (SAL) aims to investigate the characteristics of a successful Web search with a learning purpose. In this paper, we analyze the impact of text complexity of Web pages on predicting knowledge gain during a search session. For this purpose, we conduct an experimental case study and investigate the influence of several text-based features and classifiers on the prediction task. We build upon data from a study of related work, where 104 participants were given the task to learn about the formation of lightning and thunder through Web search. We perform an extensive evaluation based on a state-of-the-art approach and extend it with additional features related to textual complexity of Web pages. In contrast to prior work, we perform a systematic search for optimal hyperparameters and show the possible influence of feature selection strategies on the knowledge gain prediction. When using the new set of features, state-of-the-art results are noticeably improved. The results indicate that text complexity of Web pages could be an important feature resource for knowledge gain prediction. Keywords Textual Complexity, Knowledge Gain, Search as Learning, Learning Resources, Web-based Learning 1. Introduction ing was that the time spent on text-based Web pages had a greater impact on knowledge gain than time spent Conventional information retrieval systems are usually on video-based Web pages. Gadiraju et al. [7] explored designed to satisfy an information need. The research the influence of behavioral features on the learning out- area Search as Learning (SAL), on the other hand, deals come, and found a positive correlation between the av- with the assumption that search sessions can also be erage complexity of user queries and their knowledge driven by a learning intention. Research in the area of gain. Recently, some approaches have been suggested SAL is not only concerned with the ranking of search that combine several types of features [8, 9]. For exam- results, but also with the detection or prediction of ple, Otto et al. [9] studied the effect on knowledge gain the learning intention or even the knowledge state and prediction, when complexity and linguistic features are knowledge gain [1, 2]. complemented with multimedia features. They achieved Vakkari [3] presented a survey of features which indi- slight improvements by adding multimedia features, e.g., cate the user’s knowledge and learning needs, but also representing the amount of image and video data on the knowledge gain during the search process. More recently, screen or the image type (infographics, outdoor photog- a wide variety of features were considered, including raphy, etc.). resource-based (based on text or multimedia content) A crucial aspect of learning is the appropriateness or behavioral features. For example, Syed and Collins- of the text for the reader. In his survey, Collins- Thompson [4] have considered document retrieval fea- Thompson [10] has summarized studies that deal with tures to improve learning outcome for short- and long- the automatic assessment of the reading difficulty of term vocabulary learning. Collins-Thompson et al. [5], texts. Hancke [11] has previously analyzed lexical, syn- on the other hand, have studied different query types tactic, and morphological features for German, while and found a correlation between the variety of intrinsic Kurdi et al. [12] investigated features that allow for con- query types and knowledge gain. Pardi et al. [6] further clusions about the complexity of English texts. examined the time spent on Web pages with primarily In this paper, we investigate the influence of text com- textual or video content and learning outcome. One find- plexity of Web pages on knowledge gain prediction in a comprehensive experimental case study. For this purpose, Proceedings of the CIKM 2021 Workshops, November 1–5, Gold Coast, Queensland, Australia we present a large set of text-based features of various " wolfgang.gritz@tib.eu (W. Gritz); anett.hoppe@tib.eu types and, furthermore, analyze the impact of different (A. Hoppe); ralph.ewerth@tib.eu (R. Ewerth) classifiers and feature selection strategies on knowledge 0000-0003-1668-3304 (W. Gritz); 0000-0002-1452-9509 gain prediction. First, the experimental results show that (A. Hoppe); 0000-0003-0918-6297 (R. Ewerth) state-of-the-art results [9] can be significantly improved © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and, second, that the textual complexity of Web pages can CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) be a valuable predictor for the classification of knowledge 2.2. Knowledge Gain Measurement gain. Our contributions can be summarized as follows: To measure knowledge gain, the participants were asked • A large set of features describing textual complex- to solve a 10-item multiple choice test one week be- ity of Web pages is presented. fore (t1) and immediately after (t2) the Web search. The • We conduct an extensive, systematic evaluation knowledge gain is subsequently defined as the difference including multiple classifiers, hyperparameter between the numbers of correct answers of t2 and t1. The analysis and optimization, as well as feature se- potential range of values for the knowledge gain is there- lection strategies and analyze their impact on fore [−10, 10]. The average value in t1 was 5.24 ± 1.80 knowledge gain prediction. respectively 7.46 ± 1.43 in t2. The average knowledge • We demonstrate that the state-of-the-art-results gain was 2.22 ± 1.78 and lies in the range of [−3, 6]. can be improved, even when only considering textual complexity features. 2.3. Feature Extraction The remainder of this paper is structured as follows: In In the study, the participants performed free Web Section 2 the experimental setup and the process of ex- searches, such that realistic search and browsing behav- traction is described. Experimental results are reported ior could be recorded. Since we focus on the textual in Section 3 and the impact of text complexity features is complexity of the visited pages, other page types like analyzed. Finally, a summary of the main results and an search engine result pages and video-based contents are outlook is given in section 4. filtered. For this purpose, we used a keyword-based ap- proach and omitted pages which contained the following keyterms in their URL: "google.", "youtu", "ecosia", "RDSIn- 2. Experimental Setup and dex","universitaetsbibliothek", "meteoros", "webcam" and Text-based Features "learningsnacks". For all remaining pages, we extracted all displayed text without further processing. This can We use data from a study [13] in which participants were lead to the fact that e.g., tables or advertisements are asked to acquire knowledge about the formation of thun- in the analyzed texts. We decided against any further der and lightning. The topic has already proven useful in preprocessing in order to minimize the bias in the data previous work [14, 15]. It is a phenomenon that is gen- set. erally known and requires both factual and procedural knowledge. On the Web, many sources exist on the sub- 2.4. Website Features ject, explaining it in diverse ways (texts, graphics, videos, etc.). The participants were asked to do a Web search for To assess the complexity of text on Web pages, we extract a maximum of 30 minutes; but were allowed to end the eight different types of features: search earlier if they felt they had learned everything im- • syntactical features portant. We could use data from 𝑁 = 104 participants (88 female, 16 male, average age of 22.7 ± 2.7 years), for • readability scores which the visited Web pages were downloaded during • part of speech (POS) density the experiment. The participants were recruited over • lexical richness a local recruitment portal composed of students from • lexical variation the University of Tübingen. Students were compensated • lexical sophistication with 16e per person for participating in the study. None • syntactic constituents features of the participants had former expertise in meteorology. • connectives Since the study was conducted in German, we mainly rely 2.1. Technical Setup of the Study on the Common Text Analysis Platform (CTAP) tool [16], While plenty of data were collected during the study which currently provides 218 different complexity fea- (data sources such as eye and mouse tracking informa- tures for the German language. In total, we extract 248 tion), here, we focus on the text content of the visited Web features from each Web page. Below we give a short de- pages. During the Web search, all visited Web pages of scription of each feature group. For a complete overview the participants were tracked and recorded via the "Scrap- consider the appendix3 . bookX" (1.5.14)1 and "ScrapbookXAutosave" (1.4.3)2 plu- The syntactic features group consists of basic text gins. statistics such as the number of letters, syllables, words, and sentences. Moreover, the average length of each 1 https://github.com/danny0838/firefox-scrapbook 2 3 https://github.com/danny0838/firefox-scrapbook-autosave https://github.com/molpood/IWILDS_Complexity_Feature_List/ element is considered, like sentence length in letters or Additionally, ratios to each other are calculated, e.g., noun word length in syllables, as well as the standard deviation.phrases per T-unit, but also words per T-unit or noun In addition, we calculate the average reading time of the phrases per sentence. Moreover, we consider the tenses Web pages by assuming 180 words per minute [17]. in the text based on Kurdi [12]’s observation that there The second group of features consists of well-known may be a connection between more difficult texts and readability scores that aim to estimate the skills a reader more complex tenses. To extract the tenses, we use the must have to understand the text. The features are based tool of Dönicke [21]. on combinations of the syntactic features (automated The last group Connectives (according to readability index (ARI), Coleman-Liau index, Flesch- Breindl et al. [22]) examines units of the German Kincaid grade, Flesch reading ease) and partly on difficult language that express semantic relations between or complex words. They are given either by a list (Dale- sentences. The connectives form a class consisting of Chall readability score, Gunning fog) or by words with subsets of defined parts of speech like conjunctions three or more syllables (SMOG index). For example, the (and, or, etc.) or adverbs (in contrast, therefore, etc.). formula for ARI is as follows: The absolute number of connectives, as well as ratios, |characters| |words| such as multi-word connectives divided by single-word ARI = 4.71 · + 0.5 · − 21.43 connectives, are calculated as features. |words| |sentences| The eight groups consist of a total of 248 features that In the case of the ARI, the result is a human-interpretable are calculated for each Web page visited during the search numerical value on a scale of 1-14 (1: Kindergarten, 14: sessions. Since the participants accessed a different num- Professor). ber of Web pages, we compute the average, the minimum The POS density group reflects the density of different and the maximum for each feature for each participant. word types like adjectives or verbs in the website text. It As a result, we obtain a total of 3 · 248 = 744 features is based on the tokenization of the text and calculates the for knowledge gain prediction. different number of word types (e.g., adjectives or verbs) in relation to all tokens, e.g., 3. Experimental Results |adjectives| densityadjectives = |tokens| In this section, we report results for knowledge gain prediction using features for text complexity. For a fair The fourth group lexical richness is very similar. Here, comparison, we use the same evaluation setting including the number of non-duplicated tokens is set in relation to types hyperparameter optimization for all experiments. In the all tokens. In addition to the fraction tokens , various vari- same way, we replicate the results of Otto et al. [9]4 with ations such as the logarithm or square root are applied our evaluation procedure. to the numerator and denominator. The lexical variation group examines the subset of lex- ical words (LW) consisting of nouns, verbs, adjectives 3.1. Knowledge Gain Definition and adverbs. The class puts the number of individual To categorize the measured knowledge gain, we use the components in relation to the number of lexical words, common approach [7, 8, 9] to assign each search session e.g., the lexical variation lv_adjectives for adjectives: to one of three classes 𝐶 = {𝐿𝑜𝑤, 𝑀 𝑜𝑑𝑒𝑟𝑎𝑡𝑒, 𝐻𝑖𝑔ℎ} |adjectives| based on the Standard Deviation Classification approach. lv_adjectives = For this purpose, the knowledge gain 𝑋𝑖 of participant 𝑖 |LW| is z-normalized (𝑋ˆ𝑖 ) according to equation 1. The group of lexical sophistication features is based on dif- ferent frequency lists [18, 19]. All words of the Web page 𝑋ˆ𝑖 = 𝑋𝑖 − 𝜇 (1) text are assigned to sets of all words AW, lexical words 𝜎 LW (as mentioned before consisting of nouns, verbs, ad- Here, 𝜇 is the mean and 𝜎 is the standard deviation jectives and adverbs) and functional words FW (i.e., not of all knowledge gain measures 𝑋. Then, for every z- LW). The logarithmic or absolute frequency in the fre- normalized knowledge gain 𝑋ˆ𝑖 the class is assigned as quency lists (per million words) of AW, LW and FW is follows: consequently used as a feature. Furthermore, the Karls- ⎨ Low, ⎧ if 𝑋 ˆ𝑖 < − 1 2 ruhe Childrens Text (KCT) [20] list is used to determine 𝐶(𝑋𝑖 ) := Moderate, if − 12 ≤ 𝑋 ˆ𝑖 ≤ 1 the average and minimum age of active use of AW, LW 2 High, if 𝑋 ˆ𝑖 > 1 ⎩ and FW. 2 The group of syntactic constituents consists of features 4 Otto et al. [9] analyzed features for 113 participants. Technical issues with logging led to missing HTML data for nine participants which were crawled at a that determine the number of different syntactic con- later date. We rely on the data crawled during the original experiment, leading to stituents, like noun phrases, relative clauses or T-units. 𝑁 = 104 records for our analysis. We use min-max normalization to normalize each fea- ture of the 80% to the interval [0, 1]. This is an essen- s ld fo tial step for some of the classifiers, e.g., Support Vector 5 Machine. The 20% test set is then normalized by the 80% train/validation 20% test minimum and maximum of the 80% for evaluation. It is 80% train 20% val 20% test possible that the values lie outside the interval of [0, 1]. min However, we decide against clipping in order to not lose Normalization max Preprocessing any information due to normalization. Figure 1 provides an overview of our proposed evaluation. In our evalua- Feature Selection (optional) subset Prediction tion we use the implementation of Scikit-learn [23]. results Hyperparameter classifier 3.3.1. Hyperparameter Optimization Optimization The performance of classification algorithms strongly depends on the chosen hyperparameters. However, since Figure 1: Overview of our evaluation method. A 5-fold cross- validation is performed and for each split the features are the training, validation and test data change in each itera- first normalized, optionally selected/reduced and the hyper- tion due to cross-validation, these cannot be determined parameters of the respective classifier are optimized on the once and used for the entire evaluation. Therefore, to 80% train/validation data. The test data are scaled with the obtain valid results, we perform an optimization of the minimum and maximum of the train/validation data and op- hyperparameters in each of the five iterations. We utilize tionally the features are filtered. Finally, the classifier opti- Optuna [24] for an Bayesian search to efficiently find a mized on the train and validation data is used to predict the good configuration and limit the number of runs to 500 knowledge gain on the test data set. to reduce the computational cost. From the 80% of the data coming from the 80:20 split of the cross-validation, another 80:20 split is performed, where 80% is training data and 20% is validation data. We set the maximization This yields the following class distribution: |XLow |=40, of the weighted F1 score as the optimization objective. |XModerate |=39, |XHigh |=25. This is to prevent the class imbalance from making the underrepresented class High less important, as it would 3.2. Metrics be, for example, with overall accuracy. To evaluate the classification results, we use precision, 3.3.2. Feature Selection recall, 𝐹1 score, and accuracy. These are defined as fol- lows: The classification results may also depend on the number TP of input features (more is not always better). For example, precision = (2) TP + FP in the Random Forest algorithm, a subset of the features TP is selected several times to create weak classifiers and recall = (3) TP + FN there is no guarantee that "good features" will prevail. precision · recall For this reason, we want to reduce the number of fea- F1 score = 2 · (4) tures while trying to preserve valuable features. Again, precision + recall it is important to separate the feature selection from the TP + TN accuracy = (5) test data, which changes in each iteration. As with hy- TP + TN + FP + FN perparameter optimization, we use the further split into where TP are the values correctly classified as positive, training and validation data to do this. It follows that the TN are the values correctly classified as negative, and FP selected features may change in each iteration. For the are the values incorrectly classified as positive and FN selection of the features to be used for this evaluation, are the values incorrectly classified as negative. we rely on two strategies: 1. 𝜒2 -based Feature Selection: This method ex- 3.3. Experimental Setup amines whether a feature has a statistically signif- Cross-validation is a good way to evaluate the classifica- icant relationship to knowledge gain. While one tion result, since every feature vector acts as a test sample feature is analyzed for a relationship, all other in one fold. We thus choose a 5-fold cross-validation with features are ignored. The features with the 𝑁 80% train/validation and 20% test set split. This results highest values based on the 𝜒2 -test are selected. in five elements per class in each test set in each iteration 2. Tree-based Feature Selection: Features with- of the cross-validation. out a direct correlation to the knowledge gain Table 1 Results of the Knowledge gain classification for the classes Low, Moderate and High respectively for the classifiers (clf) Ad- aboost (Ada), Decision Tree (DT), K-Nearest Neighbors (KNN), Multi-layer Perceptron (MLP), Random Forest (RF) and Support Vector Machine (SVM) and for weighted guessing (WG). For the reported results of Otto et al. [9] (Otto*) and reproduced (Otto), respectively, our results (our) and the combination of Otto et al. ’s [9] and our features (Otto+our), precision (pre), recall (rec), F1 score (f1) and overall accuracy (accu) are reported. Low Moderate High macro scores clf pre rec f1 pre rec f1 pre rec f1 pre rec f1 accu WG 38.4 38.6 38.4 37.4 37.3 37.2 24.0 24.0 23.8 33.3 33.3 33.1 34.6 Otto* RF 41.5 52.0 46.1 39.1 40.0 39.5 28.4 14.8 19.1 36.4 35.6 34.9 38.7 Ada 42.1 40.0 41.0 35.4 43.6 39.1 16.7 12.0 14.0 31.4 31.9 31.4 34.6 DT 40.0 50.0 44.4 40.5 38.5 39.5 23.5 16.0 19.0 34.7 34.8 34.3 37.5 KNN 26.7 20.0 22.9 38.1 41.0 39.5 18.8 24.0 21.1 27.8 28.3 27.8 28.8 Otto MLP 41.2 35.0 37.8 46.3 48.7 47.5 34.5 40.0 37.0 40.7 41.2 40.8 41.3 RF 30.2 40.0 34.4 32.6 35.9 34.1 12.5 4.0 6.1 25.1 26.6 24.9 29.8 SVM 38.6 42.5 40.5 45.5 38.5 41.7 33.3 36.0 34.6 39.1 39.0 38.9 39.4 Ada 42.1 40.0 41.0 41.9 46.2 43.9 34.8 32.0 33.3 39.6 39.4 39.4 40.4 DT 39.0 40.0 39.5 41.7 38.5 40.0 29.6 32.0 30.8 36.8 36.8 36.8 37.5 KNN 21.7 12.5 15.9 45.0 46.2 45.6 26.8 44.0 33.3 31.2 34.2 31.6 32.7 our MLP 38.9 35.0 36.8 55.9 48.7 52.1 26.5 36.0 30.5 40.4 39.9 39.8 40.4 RF 40.5 42.5 41.5 51.3 51.3 51.3 34.8 32.0 33.3 42.2 41.9 42.0 43.3 SVM 40.6 32.5 36.1 54.3 48.7 51.4 27.0 40.0 32.3 40.6 40.4 39.9 40.4 Ada 42.3 55.0 47.8 54.8 43.6 48.6 42.9 36.0 39.1 46.7 44.9 45.2 46.2 DT 53.1 42.5 47.2 42.9 38.5 40.5 27.0 40.0 32.3 41.0 40.3 40.0 40.4 Otto+our KNN 20.0 12.5 15.4 45.0 46.2 45.6 25.6 40.0 31.2 30.2 32.9 30.7 31.7 MLP 25.0 12.5 16.7 41.3 66.7 51.0 28.6 24.0 26.1 31.6 34.4 31.2 35.6 RF 37.5 45.0 40.9 45.0 46.2 45.6 37.5 24.0 29.3 40.0 38.4 38.6 40.4 SVM 22.7 12.5 16.1 41.7 38.5 40.0 30.4 56.0 39.4 31.6 35.7 31.9 32.7 can be important predictors in combination with 3.4. Classifier Performance other features. For this reason, we employ a tree- In Table 1, we compare the performance for all classifiers. based approach using a Random Forest classifier. As baselines, we list the results for weighted guessing This is fitted to the training data and then ana- (WG), which is the mean of each metric for 10,000 ran- lyzed to see which features were most heavily domly generated vectors consisting of class labels with used in the decision. The 𝑁 values with the high- respect to the class distribution, and the original reported est importance are selected. The goal is to select results from Otto et al. [9] (Otto*). For a fair compari- valuable features for the classification even with- son with our features, we reproduced the results using out direct correlation. the features from Otto et al. [9] with our pipeline (Otto). Furthermore, to analyze the performance for a feature 3.3.3. Classifiers set as diverse as possible, we combined the features of Otto et al. [9] limit their evaluation to a Random For- Otto et al. [9], and our proposed feature set for evalua- est [25] classifier. In addition to that, we explore several tion (Otto+our). For the cumulative predictions for all alternative classifiers: Adaboost [26], Decision Tree [27], five iterations of cross-validation, the precision, recall, K-Nearest Neighbors [28], Multi-layer Perceptron [29], and F1 score are calculated for each class (Low, Moderate, and Support Vector Machine [30]. The objective is to ex- and High), as well as the average of these metrics over perimentally determine the best configuration in order all classes, and the overall accuracy. to find the maximum potential for knowledge gain pre- First, it is notable that the reproduced results of diction, given the set of features. Otto et al. [9] (Otto) are better compared to their reported result (Otto*). The results of the Multi-layer Perceptron (MLP) provide a 5.9% higher F1 score (34.9% compared to 40.8%). However, in direct comparison to the repro- duced result with a Random Forest (RF), the original Table 2 results are better. It is striking, that the improved out- The optimized hyperparameters per fold 𝐹1 , ...𝐹5 for the come stems mainly from better predictions from the class Random Forest classifier for our features. High. A closer look reveals that the recall scores for the 𝐹1 𝐹2 𝐹3 𝐹4 𝐹5 tree-based classifiers Adaboost (Ada), Decision Tree (DT) and Random Forest (RF) are comparatively low. These estimators 242 299 154 150 223 max_depth 22 17 8 17 17 algorithms seem to preferentially predict the more repre- max_features sqrt log2 sqrt log2 sqrt sented classes for the features of Otto et al. [9] and accept criterion entr. gini sqrt entr. gini a worse result for the underrepresented class High. This min_n_split 6 3 7 7 4 impression is enforced by the fact that for all feature min_n_leaf 5 8 3 8 7 sets the F1 score (f1) for the three classifiers is signifi- cantly worse for the class High than for the classes Low and Moderate. This is not the case for any of the other classifiers. hyperparameters for each fold 𝐹1 , ..., 𝐹5 are shown in Nevertheless, Random Forest (RF) and Adaboost (Ada) Table 2. No pattern can be discovered in the parameters, perform best for the other feature sets (our and Otto+our). they are very different in shape. This could possibly be The RF using the features of textual complexity (our) related to the heterogeneity of the data and the weakness yields a slightly better macro F1 score (42.0%) than the of the features for prediction. MLP using the features of Otto (41.2%). In addition, the RF achieves an overall accuracy of 43.3% while the MLP only achieves 40.6%. The best result is obtained by the 3.5. Feature Selection Adaboost classifier for Otto+our with 45.2% macro F1 In Table 1, it is observable that the classification result score and 46.2% overall accuracy. Examining the results for the Random Forest classifier (RF) performs worse for the Random Forest algorithm for all three feature sets, for the combination of features (Otto+our) than for the we notice that the F1 scores of all three classes for the complexity-only features (our). It seems that consid- combination of features are strictly between the F1 scores ering more features does not necessarily improve the of the individual feature sets. At the same time, the F1 classification quality. The result for the Random Forest scores for the combination of features are all better than classifier (RF) for the textual complexity (our) features for the individual sets for Adaboost. We assume that for 𝑁 ∈ {1, 3, 5, ..., 99} is shown in Figure 2. It can the Random Forest algorithm is affected by too many be seen, that the classification result is achieved with (diverse) features. Adaboost can weight the features dif- fewer features, regardless of the feature selection strat- ferently and thus utilize the strengths of both feature egy. With the 𝜒2 -based selection method, the result is sets. also achieved with fewer features, but later than with Another observation is that the F1 scores of all feature the tree-based method. This makes sense in so far as the sets for the K-Nearest Neighbors (KNN) algorithm are significantly higher for the class Moderate than for the classes Low and High. Therefore, we suspect that search Our, Random Forest 50 strategies with Low (or High) knowledge gain differ much chi2 more. Furthermore, we can observe that the F1 score for tree all features the class Moderate of our features is high compared to 45 the classes Low and High, independent of the classifier. avg F1 Score in % On closer inspection, we found that often instances of 40 the class Low are classified as High and vice versa. If we put the classification result for the classes Low and 35 High together, i.e., a new class Not Moderate, we would get 74.1%, 70.8% and 73.1% F1 score for the classifiers 30 MLP, RF and SVM, respectively, for this new class. It seems like the complexity features are useful to detect if 25 someone does not have a Moderate increase in knowledge 0 20 40 60 80 gain. We plan to investigate this interesting aspect in the Number of Selected Features future. Figure 2: Average F1 scores of the Random Forest classifier For our textual complexity features, the best result was using 𝑁 ∈ {1, 3, 5, ..., 99} of our features for the 𝜒2 -based obtained with the Random Forest classifier. In each itera- (chi2) and the tree-based (tree) Feature Selection strategy. tion of the 5-fold cross-validation, an independent hyper- The result for all features is indicated with the dotted line. parameter optimization was performed. The optimized Table 3 Features selected at least three out of five times during cross-validation by the tree-based selection strategy. type feature aggregation count POS Density Feature Subordinating Conjunction min 4 Lexical Sophistication Feature SUBTLEX Word Frequency (LW Token) min 4 Syntactic Complexity Feature Mean Length of Verb Cluster min 3 𝜒2 -based method considers the features independently Otto, Random Forest of each other, and only measures the individual correla- 50 chi2 tion of a feature with knowledge gain. In contrast, the tree all features avg F1 Score for class High in % tree-based strategy selects features based on their impor- 40 tance for an upstream Random Forest. Thus, the baseline level can already be reached with 𝑁 = 19 features. 30 Cross-validation is used for evaluation as described above (Section 3.3.1). Similarly, feature selection is per- 20 formed five times. However, this implies that the features chosen in each iteration of the cross-validation may differ, 10 which complicates the analysis of which features most influence the classification result. We therefore propose to highlight the features that were selected in at least 0 0 10 20 30 40 50 60 70 three out of five iterations. Since the classification result Number of Selected Features of the Random Forest was already achieved with 𝑁 = 17 Figure 3: F1 scores for the class High for the features of features, we report the features based on this configu- Otto et al. [9] for 𝑁 ∈ {1, 3, 5, ..., 79} features for the ration. The features and their frequencies are shown in 𝜒2 -based (chi2) and the tree-based (tree) Feature Selection Table 3. Three features were selected at least three times, strategy. The result for all features is indicated with the dot- but none were selected in every iteration of the cross- ted line. validation. All three were aggregated by the minimum, indicating that the Web page with the lowest textual complexity is most important for the classification result. 4. Conclusions This strengthens the impression that the features or the aggregations (Minimum, Maximum and Average) are too In this paper, we have investigated the impact of textual weak to provide a strong prediction of the knowledge complexity of Web pages on knowledge gain during a gain. In the future, we aim to include more features and Web search. The experimental results demonstrated that find aggregations that are more suitable to reflect search the state of the art can be improved by only considering patterns. the textual complexity of Web pages. The results also In the last section, it was observed that the F1 score showed that a systematic assessment of different hyper- for the class High is significantly below the values for parameter settings, feature selection, and several classi- the classes Low and Moderate, regardless of the feature fiers is important – in particular, since the correlations set. We performed feature selection before hyperpa- between features and the target outcome are relatively rameter optimization and repeated the evaluation with weak. During the evaluation, it became apparent that 𝑁 ∈ {1, 3, 5, ..., 79} features. Figure 3 shows how the as little as 17 features per iteration of cross-validation F1 score for the class High changes with a subset of the would have been sufficient to achieve the result. Further- features of Otto et al. [9]. The green curve describes the more, we found that a moderate knowledge gain can be F1 scores based on the tree-based feature selection strat- predicted relatively well, but, interestingly, the distinc- egy, which tries to select the most important features tion between successful and unsuccessful Web search for classification. It is noticeable that almost any tested does not work well (in terms of knowledge gain). The subset would have been more suitable than using the full reasons for this effect have to be investigated in more feature set. Moreover, the curve does not change from detail. 𝑁 = 65 onward (same observation for the classes Low Although we have obtained state-of-the-art results, and Moderate), which suggests that the tree-based fea- there are some limitations. In this case study, we ana- ture selection strategy does not consider many features lyzed only the data of a study on knowledge acquisition at all. about a specific science topic, the formation of thunder- storms. Consequently, limited conclusions can be drawn CHIIR 2018, New Brunswick, NJ, USA, March 11-15, about general Web searches and the results need to be 2018, ACM, 2018, pp. 191–200. URL: https://doi.org/ confirmed or extended by future studies. In this sense, 10.1145/3176349.3176397. doi:10.1145/3176349. the reported results need to be reproduced for (a) differ- 3176397. ent types of learning tasks (e.g., procedural knowledge) [5] K. Collins-Thompson, S. Y. Rieh, C. C. Haynes, and (b) conceptual learning tasks in other domains (e.g., R. Syed, Assessing learning outcomes in web non-science topics). search: A comparison of tasks and query strate- In the future, we would like to deepen our understand- gies, in: D. Kelly, R. Capra, N. J. Belkin, J. Teevan, ing of what behavioral patterns characterize effective P. Vakkari (Eds.), Proceedings of the 2016 ACM Web searches, for instance, by examining how the se- Conference on Human Information Interaction quence of Web pages (and their characteristics) influence and Retrieval, CHIIR 2016, Carrboro, North Car- learning success. An intuitive assumption is, for exam- olina, USA, March 13-17, 2016, ACM, 2016, pp. 163– ple, that a successful learning session consists of Web 172. URL: https://doi.org/10.1145/2854946.2854972. pages of increasing complexity. Furthermore, we have doi:10.1145/2854946.2854972. considered the textual complexity of the entire Web page, [6] G. Pardi, J. von Hoyer, P. Holtz, Y. Kammerer, The but not in every case is the Web page content read in its role of cognitive abilities and time spent on texts entirety. In future work we would like to focus more on and videos in a multimodal searching as learning the actual seen during Web search. task, in: H. L. O’Brien, L. Freund, I. Arapakis, Lastly, we focused on text-based Web pages in this O. Hoeber, I. Lopatovska (Eds.), CHIIR ’20: Con- case study. However, many of the Web searches were ference on Human Information Interaction and not unimodal but multimodal. Consequently, further Retrieval, Vancouver, BC, Canada, March 14-18, investigations will need to include further complexity 2020, ACM, 2020, pp. 378–382. URL: https://doi.org/ measures such as visual complexity of the Web pages or 10.1145/3343413.3378001. doi:10.1145/3343413. videos. 3378001. [7] U. Gadiraju, R. Yu, S. Dietze, P. Holtz, Analyzing knowledge gain of users in informational search Acknowledgments sessions on the web, in: C. Shah, N. J. Belkin, K. Byström, J. Huang, F. Scholer (Eds.), Proceed- Part of this work is financially supported by the Leib- ings of the 2018 Conference on Human Informa- niz Association, Germany (Leibniz Competition 2018, tion Interaction and Retrieval, CHIIR 2018, New funding line "Collaborative Excellence", project SALIENT Brunswick, NJ, USA, March 11-15, 2018, ACM, 2018, [K68/2017]). pp. 2–11. URL: https://doi.org/10.1145/3176349. 3176381. doi:10.1145/3176349.3176381. References [8] R. Yu, R. Tang, M. Rokicki, U. Gadiraju, S. Dietze, Topic-independent modeling of [1] A. Hoppe, P. Holtz, Y. Kammerer, R. Yu, S. Dietze, user knowledge in informational search ses- R. Ewerth, Current challenges for studying search sions, Inf. Retr. J. 24 (2021) 240–268. URL: as learning processes, in: 7th Workshop on Learn- https://doi.org/10.1007/s10791-021-09391-7. ing & Education with Web Data (LILE2018), in con- doi:10.1007/s10791-021-09391-7. junction with ACM Web Science, 2018. [9] C. Otto, R. Yu, G. Pardi, J. von Hoyer, M. Rokicki, [2] M. Machado, P. A. Gimenez, S. Siqueira, Raising the A. Hoppe, P. Holtz, Y. Kammerer, S. Dietze, R. Ew- dimensions and variables for searching as a learning erth, Predicting knowledge gain during web search process: A systematic mapping of the literature, in: based on multimedia resource consumption, in: Anais do XXXI Simpósio Brasileiro de Informática I. Roll, D. S. McNamara, S. A. Sosnovsky, R. Luckin, na Educação, SBC, 2020, pp. 1393–1402. V. Dimitrova (Eds.), Artificial Intelligence in Educa- [3] P. Vakkari, Searching as learning: A systemati- tion - 22nd International Conference, AIED 2021, zation based on literature, J. Inf. Sci. 42 (2016) 7– Utrecht, The Netherlands, June 14-18, 2021, Pro- 18. URL: https://doi.org/10.1177/0165551515615833. ceedings, Part I, volume 12748 of Lecture Notes doi:10.1177/0165551515615833. in Computer Science, Springer, 2021, pp. 318–330. [4] R. Syed, K. Collins-Thompson, Exploring document URL: https://doi.org/10.1007/978-3-030-78292-4_26. retrieval features associated with improved short- doi:10.1007/978-3-030-78292-4\_26. and long-term vocabulary learning outcomes, in: [10] K. Collins-Thompson, Computational assess- C. Shah, N. J. Belkin, K. Byström, J. Huang, F. Sc- ment of text readability: A survey of cur- holer (Eds.), Proceedings of the 2018 Conference rent and future research, ITL - International on Human Information Interaction and Retrieval, Journal of Applied Linguistics 165 (2014) 97– 135. URL: https://www.jbe-platform.com/content/ ence Abstracts, Stanford University Library, 2011, journals/10.1075/itl.165.2.01col. doi:https://doi. p. 8. URL: http://xtf-prod.stanford.edu/xtf/view? org/10.1075/itl.165.2.01col. docId=tei/ab-003.xml. [11] J. Hancke, S. Vajjala, D. Meurers, Readability classi- [20] R. Lavalley, K. Berkling, S. Stüker, Preparing chil- fication for german using lexical, syntactic, and dren’s writing database for automated process- morphological features, in: M. Kay, C. Boitet ing, in: K. M. Berkling (Ed.), Language Teach- (Eds.), COLING 2012, 24th International Confer- ing, Learning and Technology, Satellite Work- ence on Computational Linguistics, Proceedings shop of SLaTE-2015, LTLT@SLaTE 2015, Leipzig, of the Conference: Technical Papers, 8-15 Decem- Germany, September 4, 2015, ISCA, 2015, pp. 9– ber 2012, Mumbai, India, Indian Institute of Tech- 15. URL: http://www.isca-speech.org/archive/ltlt_ nology Bombay, 2012, pp. 1063–1080. URL: https: 2015/lt15_009.html. //aclanthology.org/C12-1065/. [21] T. Dönicke, Clause-level tense, mood, voice and [12] M. Kurdi, Lexical and syntactic features selection modality tagging for german, in: K. Evang, for an adaptive reading recommendation system L. Kallmeyer, R. Ehren, S. Petitjean, E. Seyffarth, based on text complexity, in: ICISDM ’17, 2017. D. Seddah (Eds.), Proceedings of the 19th Interna- [13] J. von Hoyer, G. Pardi, Y. Kammerer, P. Holtz, tional Workshop on Treebanks and Linguistic Theo- Metacognitive judgments in searching as learning ries, TLT 2020, Düsseldorf, Germany, October 27-28, (sal) tasks: Insights on (mis-) calibration, multime- 2020, Association for Computational Linguistics, dia usage, and confidence, in: Proceedings of the 2020, pp. 1–17. URL: https://doi.org/10.18653/v1/ 1st International Workshop on Search as Learning 2020.tlt-1.1. doi:10.18653/v1/2020.tlt-1.1. with Multimedia Information, SALMM ’19, Associa- [22] E. Breindl, A. Volodina, U. H. Waßner, Hand- tion for Computing Machinery, New York, NY, USA, buch der deutschen Konnektoren 2, De Gruyter, 2019, p. 3–10. doi:10.1145/3347451.3356730. 2014. URL: https://doi.org/10.1515/9783110341447. [14] R. Mayer, R. Moreno, A split-attention effect in doi:doi:10.1515/9783110341447. multimedia learning: Evidence for dual processing [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, systems in working memory, Journal of Educational B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Psychology 90 (1998) 312–320. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, [15] F. Schmidt-Weigand, K. Scheiter, The role of spatial D. Cournapeau, M. Brucher, M. Perrot, E. Duch- descriptions in learning from multimedia, Com- esnay, Scikit-learn: Machine learning in Python, put. Hum. Behav. 27 (2011) 22–28. URL: https:// Journal of Machine Learning Research 12 (2011) doi.org/10.1016/j.chb.2010.05.007. doi:10.1016/j. 2825–2830. chb.2010.05.007. [24] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, [16] X. Chen, D. Meurers, CTAP: A web-based tool Optuna: A next-generation hyperparameter opti- supporting automatic complexity analysis, in: mization framework, in: A. Teredesai, V. Kumar, D. Brunato, F. Dell’Orletta, G. Venturi, T. François, Y. Li, R. Rosales, E. Terzi, G. Karypis (Eds.), Pro- P. Blache (Eds.), Proceedings of the Workshop on ceedings of the 25th ACM SIGKDD International Computational Linguistics for Linguistic Complex- Conference on Knowledge Discovery & Data Min- ity, CL4LC@COLING 2016, Osaka, Japan, Decem- ing, KDD 2019, Anchorage, AK, USA, August 4- ber 11, 2016, The COLING 2016 Organizing Commit- 8, 2019, ACM, 2019, pp. 2623–2631. URL: https: tee, 2016, pp. 113–119. URL: https://www.aclweb. //doi.org/10.1145/3292500.3330701. doi:10.1145/ org/anthology/W16-4113/. 3292500.3330701. [17] M. Ziefle, Effects of display resolution on visual [25] L. Breiman, Random forests, Mach. performance, Hum. Factors 40 (1998) 554–568. Learn. 45 (2001) 5–32. URL: https://doi.org/ URL: https://doi.org/10.1518/001872098779649355. 10.1023/A:1010933404324. doi:10.1023/A: doi:10.1518/001872098779649355. 1010933404324. [18] M. Brysbaert, M. Buchmeier, M. Conrad, A. Jacobs, [26] Y. Freund, R. E. Schapire, A decision-theoretic gen- J. Bölte, A. Böhl, The word frequency effect: a eralization of on-line learning and an application review of recent developments and implications to boosting, J. Comput. Syst. Sci. 55 (1997) 119–139. for the choice of frequency estimates in german., URL: https://doi.org/10.1006/jcss.1997.1504. doi:10. Experimental psychology 58 5 (2011) 412–24. 1006/jcss.1997.1504. [19] E. L. Aiden, J. Michel, Culturomics: Quantitative [27] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, analysis of culture using millions of digitized books, Classification and Regression Trees, Wadsworth, in: 6th Annual International Conference of the Al- 1984. liance of Digital Humanities Organizations, DH [28] E. Fix, J. L. Hodges, Discriminatory analysis - non- 2011, Stanford, CA, USA, June 19-22, 2011, Confer- parametric discrimination: Consistency properties, International Statistical Review 57 (1989) 238. [29] F. Rosenblatt, Principles of neurodynamics. percep- trons and the theory of brain mechanisms, Ameri- can Journal of Psychology 76 (1963) 705. [30] C. Cortes, V. Vapnik, Support-vector net- works, Mach. Learn. 20 (1995) 273–297. URL: https://doi.org/10.1007/BF00994018. doi:10.1007/ BF00994018.