Models of Literary Evaluation and Web 2.0. An Annotation Experiment with Goodreads Reviews

Introduction

Nowadays, reviews are ubiquitous. Pushed by our natural tendency to share information and encouraged by the very companies that sell us their products, we constantly take part in the production and accumulation of huge amounts of data regarding our preferences and judgements. Literature has also been strongly affected by this phenomenon. On digital social reading platforms [22,19], such as Goodreads, one can find terabytes of information regarding the reception of millions of books, covering the tastes of the most diverse typologies of readers. Finding efÏcient ways to manage these records is essential not only for market reasons -as demonstrated by the great effort that companies put into developing algorithms to better capture and predict users' tastes 1 -but also for research purposes. Never before, in fact, has it been possible to gather insights about the reception of literary works as extensive and diverse as the ones that can be drawn from these platforms. Designing ways to analyze this data will allow us to shed a whole new light on the dynamics that regulate literary evaluation.

Literary studies have been preoccupied with evaluation since around the 1980s, when, thanks to the influence of the emergent field of postcolonial studies [2,25], it started to become more and more evident that the prominence of the works of the so-called western canon wasn't at all objective, but rather an arbitrary construct, contingent on the social and cultural background of the readers. Van Rees [28] proposed a model in which literary value was built through a three-step process involving journalistic reviewers, essayists, and academic critics. At the same time more detailed and flexible, von Heydebrand and Winko's model [13] also posits the social construction of literary value, although allowing for the contribution of any agent within the literary field. Similar ideas are shared by Dalen-Oskam [6], who recently demonstrated, in an empirical fashion, how social biases can influence the evaluation of books.

According to these authors, society influences literary evaluation by setting up standards against which books are judged. This would imply the existence of criteria independent from any specific evaluative act, that manifest themselves across a multiplicity of such acts, thus offering a means to systematize them. In fact, as von Hydebrand and Winko have shown, although they can vary greatly depending on the historical context, these "standards of value" can nonetheless be organized according to some general features: they can concern the aesthetic level of a book (its style, characters, plot, etc.), the way it relates to other works (thus being original or derivative, for instance), the impact it has on its reader, and so on.

In this paper, we build on von Heydebrand and Winko's theory and interpret online reviews as acts of "linguistic evaluation". What distinguishes such acts from more implicit forms of evaluation (even the simple purchase of a book could be interpreted as such) is that they "require a standard of value -as well as certain categorizing assumptions -in order to progress from the description of a text to its evaluation" [12, p. 227]. Seen as evaluative acts of this kind, book reviews are not simple expressions of individual taste, but the result of a socially learned practice with specific schemes and regularities. Different kinds of evaluative criteria can then be taken as the axis along which to situate single reviews, thus quantifying their variability and reducing their complexity. Furthermore, knowing which criteria are implemented by an individual (or a group) could allow us to predict their future judgements or to reveal the societal influences operating on them.

To make the identification of evaluative criteria as objective as possible, we turned to the practice of "interpretative markup", a form of annotation "devoted to recording a scholar's or analyst's observations and conjectures in an open-ended way" [20, p. 202]. As argued by Gius and Jake [10], this kind of annotation, when carried out in a collaborative way, can be a powerful tool for systematizing the interpretation of potentially ambiguous texts. Our main goal consisted in using the annotated dataset for training a classifier capable of recognizing the use of different evaluative criteria in possibly any online book review.

The paper is organized as follows. In section 2, we present the corpus that constituted the basis of our work. The discussion of the annotation process occupies Section 3, where we outline the tagset we developed (3.1), the workflow we followed and the main problems we encountered (3.2). To automate the annotation process we tried two different approaches: finetuning a Bert-based classifier and instructing a Large Language Model (GPT 4). The results of these approaches are presented, respectively, in Sections 4.1 and 4.2.

Data

The dataset used to perform the annotation task was built on the same basis of the AbsORB dataset: a selection of "approximately six million English language reviews of nine different genres (i.e., fantasy, romance, thriller, horror, mystery, science fiction, historical fiction, contemporary, classics)" [15], downloaded from the Goodreads website between 2018 and 2019. Out of them, a total of 100 reviews were randomly selected, with just one filter set to exclude extremely short (below 200 words) ones. To counter an intrinsic imbalance in the dataset (i.e., the dominance of reviews of "young adult" novels, frequently tagged as "fantasy" or "romance"), we added a filter which forced the selection of 20 reviews from among the ones tagged as "classics". Reviews were then automatically split into sentences by using SpaCy2 , so as to allow a sentence-by-sentence annotation.

A first partitioning of our corpus, composed by 11 reviews, was used as a toy dataset to train the annotators, and was not therefore retained for analysis. The remaining 89 reviews, which constituted our main corpus, had a mean length of 1155 tokens (SD = 449). They were written by 84 different reviewers in a timespan going from 2008 to 2018. The number of reviews per year is lightly skewed towards more recent dates reflecting the growth of the Goodreads platform. There are 84 different books reviewed in the corpus, spanning across 9 different genres (with most of the books belonging to more than one genre). The distribution of reviews per genre and per year can be seen, respectively, in Figures 2 and 1.

All copyright and privacy implications in using such a dataset have already been discussed [16,23]. In any case, to safeguard the rights of the authors of the reviews and to comply to copyright limitations (while still profiting from the research exceptions recently introduced in multiple European legislations, following the 2019 Directive on Copyright in the Digital Single Market), we decided not to share it publicly. Researchers who would like to access it will have to contact the authors of this paper, by stating their intended use.

Designing the Tag Set

The starting point for the creation of our tag set was von Heydebrand and Winko's model, which systematizes literary evaluation based on the different kinds of criteria that can be employed for such a task. However, in order to be adapted to our material and goals, it needed to be simplified. First of all, as demonstrated by the high number of categories they devised, it is clear that von Heydebrand and Winko aim at maximizing the exhaustivity of their model. Our intent was slightly different. In fact, we needed to account for the highest possible number of scenarios with the fewest possible categories, in order to reach a number of examples per category high enough to train a classifier, without having to annotate an extreme number of reviews.

Furthermore, despite their intent to take into consideration the contributions of potentially any actor within the literary field, it is quite clear that the kind of evaluative acts that von Heydebrand and Winko had in mind while designing their model were quite different from what one can find today on digital social reading platforms (and quite understandably so, given that their essay predates the advent of the latter by almost ten years). Many of their categories reflect the judgment patterns of, if not professionals, at least cultivated readers, capable of assessing the salient stylistic features of a given book, as well as its positioning in the overall landscape of literary tradition. On the other hand, online reviews "are mostly expressions of consumer satisfaction or dissatisfaction" [26, p. 3]. In them, the individual dimension is magnified, with evaluations hinging around the reader's personal relation with the book.

In our tag set, we devoted a special attention to criteria based on one's own thoughts, feelings and personal experiences. However, not all can be traced back to the individual level. By contrast, we believe literary evaluation to be a complex phenomenon, involving not only a reader and a book, but also the whole community in which the interaction between the former two takes place. To account for such complexity, we have incorporated in our tag set two further dimensions alongside that of individual criteria: one for evaluations based on the book itself, its content-related or formal aspects, and one for considerations on its societal value, its impact on a given community of readers. Furthermore, we have included a label for evaluations that do not fall into any of the aforementioned categories. In Appendix A, we list the 7 labels of our tag set, with a short description for each. Figure 3, on the other hand, shows the overall structure of the tag set.

The Annotation Workflow

After a brief training regarding the task they were going to perform, two annotators were asked to independently annotate all the reviews in our corpus, attributing one (and only one) label to each of their sentences. The choice to take sentences as the basic unit of our analysis allowed us to organize the annotation in a very straightforward way, by providing annotators with tables where each row corresponded to one sentence, to which they had to associate a label in a dedicated column (columns containing the title of the reviewed book, the review and the sentence id numbers were also included). In the cases where an evaluation spanned across multiple sentences, annotators simply had to attribute the same label to each one of them.

A first partitioning of our corpus, composed by 11 reviews, was used as a toy dataset to train the annotators (annotations were therefore not retained for analysis). The actual annotation work was carried out on the remaining 89 reviews in 4 consecutive rounds, each followed by a meeting during which cases of disagreement were discussed, as well as possible shortcomings of our tagging system. Inter-annotator agreement was computed using Cohen's Kappa. The scores for each round are reported in Table 1.

Considering the entirety of the tag set ('all_labels'), the coefÏcients are in the range of a moderate agreement [18], while their increase over time can be taken as a sign of the efÏcacy of our workflow. Furthermore, the growth of the agreement for the tags 'generic_val' and 'aesthetic' can be seen as a consequence of the consecutive clarifications that we were able to offer during the meetings that followed each round. The agreement for the category 'no_val' (i.e., the simple distinction between evaluative and non-evaluative sentences), despite a drop in round 2, remained almost unchanged throughout the entirety of the work. It should be noted that this was by far the most frequent label, accounting for 26% of all annotations. This is why we managed to reach such a substantial coefÏcient for this category, which still registered many instances of disagreement, resulting for the most part from intrinsic ambiguities in our data. Take, for instance, even a sentence as simple as the following: "[this book] is not a horror novel" 3 . In and of itself, this would look like a mere observation. However, with a small interpretative leap, we could see it as a negative evaluation, a way for the reader to express their disappointment in finding out that the book was not what they expected it to be. To make the identification of evaluative sentences as unambiguous as possible, we then decided that for a sentence to be considered as such it needed to contain an explicit evaluation of the reviewed book, or an explicit mention of the impact it had on the reviewer. Although focusing only on explicit cases could be seen as limiting, such approach allowed us: a) to reduce disagreement between annotators, and b) to gather a corpus of clearly evaluative sentences to successfully train a classifier, if not in recognizing all of our categories, at least in distinguishing evaluative statements from other elements that can be found in a review, like summaries of the book's plot, accounts of reviewer's personal experiences, and so on.

The absence of agreement registered for the 'social' label in the first three rounds can be explained by the extremely limited number of occurrences of this category, which made it difÏcult to develop specific guidelines. To understand such scarcity, let's remember that the label was thought to account for references of socially established evaluative standards, such as prestigious prizes and canonical works. In many cases, such standards are the manifestation of an academic -as in the case of canonical works, see [11] -, or professional perspective on literature. They are the standards held by that cohesive community of readers that Chervel calls the "literary establishment" [5]. By not referencing the standards of the latter, Goodreads users simply demonstrate to belong to a different community, characterized by its own standards and rituals (think about the Goodreads Choice Award, a literary prize where winners are voted exclusively by members of the Goodreads community).

More than giving rise to a stable and universal canon, on Goodreads the social component of literary evaluation seems to operate at the level of specific literary genres. Indeed, both annotators reported noticing that reviews tended to follow genre-specific evaluative patterns. One of these patterns, concerning the over-represented genre of young adult novels (in Figure 2 split between romance and fantasy) could be at the heart of the problems we encountered with the labels in the 'individual' category, reflected in the erratic trends of the corresponding kappa co-efÏcients. Something that became apparent already from the first rounds of annotation, is that readers of this genre tend to develop very strong personal relationships with the characters in the books they read, either treating them as if they were real persons or completely identifying with them. This phenomenon gave rise to an ambiguity that caused a good part of the disagreement between annotators. Take, for instance, the sentence "there is a depth to his character that bit by bit when revealed you can't help but fall that much harder for him because his character demands nothing less" 4 . Apart from the poor grammar, one notes right away that the reviewer is evaluating the way a character is built, which would make the sentence fall under the category 'aesthetic'. However, far from being a neutral assessment of a stylistic feature, the sentence is highly emotionally charged, which would justify the label 'ind_emotional'. Lastly, based on the patent involvement of the reader in what they are reading, one could argue for the attribution of this sentence to the 'ind_pragmatic' category. To face such ambiguity, we decided to tag as 'aesthetic' all the sentences that contained an explicit reference to literary art (e.g., plot, characters, style).

The last step after annotation has been a curation phase, during which we 'settled' the cases of disagreement between the annotators and attributed a definitive label to the respective sentences. This was necessary to allow for the subsequent phase, the training of a classifier, for which each sentence in our dataset needed to be assigned to one and only one label. In extremely rare cases (less than one percent of the sentences) we also intervened to correct attributions that, despite the annotators' agreement, were blatantly wrong.5

Fine-tuning Bert models

For the first approach, the 6,014 annotated and curated sentences were used as ground truth to fine-tune multiple Transformer models of the Bert family. We decided to test the following three models:

• google-bert/bert-base-uncased (from now on, referred to as google-bert), as representative of a large, general-purpose model for English language; • LiYuan/amazon-review-sentiment-analysis (from now on, LiYuan), because it was finetuned on a similar kind of data (general product reviews) for a similar kind of task (sentiment analysis); • JoelVIU/bert-base-uncased-finetuned-amazon_reviews_books (from now on, JoelVIU ), because it was finetuned on an even more coherent dataset (Amazon book reviews).

Given the already-discussed issues of unbalance and underrepresentation in the annotation labels, we decided to simplify them by adopting two different strategies:

our tag set and guidelines were addressed during the course of the annotation itself, thanks to the insights we gathered at each new step. 1. we reduced the labels to three classes: 'eval_individual' (under which the three categories related to the impact on the individual reviewer were merged), 'eval_generic' (under which all the evaluation categories were merged), and 'no_val'; 2. we further reduced the labels to two classes: 'eval' (under which all the evaluation/impact labels were merged) and 'no_val'.

Testing was first performed via a 5-fold cross validation, to establish which one among the chosen models was the most performant. As shown by Table 2, performance depended on the task, with the JoelVIU model slightly outperforming the others for the three-label classification and google-bert producing the best results for the binary classification. However, it should be noted that the three-label setup produced low F1-macro scores because of the substantial failure in classifying the 'eval_individual' sentences (F-1 scores for that label were never higher than 0.144 throughout models/folds), due probably to the very low number of available samples in the dataset (444 sentences, corresponding to 7.4% of the total). In light of these considerations, we decided to use the google-bert model as a reference point for our subsequent analyses.

A second level of analysis concerned the fitness of the number of annotated sentences to effectively train the classifier. To address this question, we performed a series of 5-fold cross validations by fine-tuning the model with an increasing number of sentences (from 600 to 6,000, selected randomly from the dataset). Figure 4 shows how, for both the binary and three-class classification, F1-macro scores reach a plateau at around 3,000 sentences, thus suggesting that the learning threshold for the classifier is substantially lower than the number of annotated sentences.

A third and final level of analysis was chosen in order to get efÏciency scores comparable to the ones obtained with Large Language Models (see section 4.2). Here the selection strategy was changed slightly, by randomly choosing entire reviews instead of single sentences, without performing any k-fold cross validation. After testing different configurations, we selected one with 18 reviews in the test set (corresponding precisely to 20% of the sentences in the dataset, with comparable distributions of labels), which produced efÏciency scores similar to the ones obtained before (for an overview, see Table 3 and 4).

Given the unsatisfactory performance reached on the three-labels dataset (the model simply ignored the under-represented 'eval_individual' category) we decided to share only the model finetuned for the binary classification of evaluative and non-evaluative sentences, which can now be freely accessed on Hugging Face.7

Instructing GPT 4

For the second approach, the annotation guidelines were used as a starting point to develop system prompts for the GPT-4 Large Language Model. By situating itself in the emergent area of "prompt engineering" [4], the work here became more exploratory, aiming at the identification of the best technique to instruct the model.

In the first phase of testing, the overall prompting strategy was a simple zero-shot (i.e., instructions plus input), with system prompts (i.e., the instructions) composed by adopting three different approaches:

• The first, defined as complex, was a straightforward adaptation of the annotation guidelines in their latest stage. Note that guidelines were loosely structured, with occasional repetitions and multiple addenda; • The second, defined as simple, was a drastic simplification of the annotation guidelines, extracting only the most relevant information and giving it a simpler structure; • The third, defined as procedural, was inspired by the chart in Figure 3 and by the "Tree of Thoughts" (ToT) prompting technique [4], producing a more structured prompt, where the assignment of a label was the result of a set of nested choices.

These three approaches were then combined with three different tagsets:

• The first, defined as full, adopted the 7 labels originally used by the annotators (note that this setup was not even tested with Bert models, because of the scarcity of annotations for some labels); • The second, defined as 3-class, reduced the labels to the first setup with Bert models: 'eval_individual', 'eval_generic', and 'no_val'; • The third, defined as binary, further reduced the labels to the second Bert setup: 'val' and 'no_val'.

For each approach, the system prompt produced with the full tagset was then adapted into 3-class and binary by performing the minimum amount possible of modifications, so as to keep the core of the prompt intact. Final result was a set of 9 different system prompts, which can be consulted in Appendix B.

To profit from the ability of GPT-4 to process large amounts of text and to emulate as closely as possible the work of the annotators, we decided to give it as an input entire reviews split into sentences. The user prompt was therefore structured as a .csv file with three columns: 'book_title' (containing the title of the reviewed book), 'sentence_id' (with a numeric identifier of the sentence), and 'sentence' (with the sentence text). At each trial, GPT-4 was prompted 18 times, with the 18 reviews identified with the criteria described in section 4.1.

Requests were processed by using the "gpt-4o-2024-05-13" model (with temperature set to 0 to produce the most deterministic behavior) on the OpenAI API. The whole operation cost a total of 3$. Comparisons of the GPT-4 annotations with the ground truth are shown by Table 5. Note that the adaptation of the full tagset to 3-class and binary, and of the 3-class tagset to binary could also be accomplished ex post (i.e., after having performed the annotation with GPT-4).

Overall, GPT-4 efÏciency is higher than fine-tuned Bert models for the 3-class condition (in fact, GPT-4 performs substantially better on the 'eval_individual' label, overcoming the issue of its underrepresentation in the dataset), while it is lower for the binary condition. When comparing the different system prompts, the complex approach performs slightly better than the procedural one, suggesting how a structured prompt may not be necessary to obtain the highest efÏciency. Finally, ex-post tagset adaptation produces mixed results (with the best efÏciency for the 3-class and binary tagsets obtained with and without adaptation, respectively), even if in the majority of the cases (6 out of 9) it worsens efÏciency. Tables 6, 7, and 8 show more in detail the efÏciency of the best-performing setups.

In the second phase of testing, we adopted a "few-shot" strategy by using as a basis the best performing setup (complex approach with binary tagset). The "few-shot" strategy implies providing not only instructions, but also examples for performing the task. Examples were extracted from the remaining 71 reviews, selecting the ones that showed the proportion of 'val' vs. 'no_val' labels closest to the overall mean in the dataset (i.e. the most "balanced" ones). These reviews were presented to the model together with the curated annotations, before asking it to annotate the new reviews. Three different tests were then performed, with an increasing number of sample reviews:

• two reviews, corresponding to 172 sentences (the whole operation cost 0.66$);

• four reviews, corresponding to 354 sentences (1.21$);

• eight reviews, corresponding to 596 sentences (1.95$).

As prices increased substantially, we decided not to test bigger selections of sample reviews. Also, quite surprisingly, this prompting technique had a detrimental effect on the efÏciency of the model, with F1-macro scores decreasing to 0.780 (with two sample reviews), 0.754 (four reviews), and 0.724 (eight reviews).

Conclusions

We believe that our work can contribute to the field in several ways. First, the development of a tag set to capture evaluative criteria in unstructured reviews enriches the current search for possible ways to operationalize literary evaluation and study it from an empirical perspective [27,6]. Furthermore, the application of the aforementioned tag set during the work of annotation revealed some interesting features of the material analyzed. First, let's note that many scholars seem to interpret online book reviewing as animated by a distrust for (if not a full-fledged opposition to) more traditional practices of literary criticism, as embodied by bourdeauian "gatekeepers" [3] such as publishers, professors or jurors of a literary prize. For Franzen, online critics are characterized by "a general distrust of established institutions of aesthetic opinion formation" [9, p. 3]. Even when it is not so clearly thematized, a similar view of today's critical landscape is implicit in many works that aim at unveiling the differences between lay criticism and its traditional counterpart [1,29]. Last, scholars have interpreted the decline of journalistic criticism as an effect of the rise of internet's rating culture, which, by empowering readers and allowing them to express their own judgements, brings to the slow and inevitable demise of those figures once charged with directing the public's taste [5]. Contrasting such interpretations, the reviewers in our corpus showed no concern for established critical discourse, as demonstrated by the extremely low number of instances of the 'social' label.

The detection of genre-specific evaluative patterns is another interesting finding that deserves to be further investigated in future works. Two interpretations for such result can be hypothesized: first, that the different features of books of different genres [8] somehow 'call for' different evaluative approaches, or, second, that different literary genres have different social constituencies, that is, they are read by different 'kinds of persons' [17], who, in turn, elaborate different evaluative patterns. Unfortunately, our data do not allow us to effectively test either one of these hypotheses.

Last, many researchers are searching for computational ways to analyze the evaluation of books through user-generated online reviews [30,7]. We contribute to this particular line of research by developing two models for the classification, in online book reviews, of: a) evaluative and non-evaluative sentences and b) generic evaluations, evaluations based on the impact of the book on the reader, and non-evaluative sentences. Furthermore, the comparison between the performance of these models and that of and GPT-4 confirm the findings of recent studies [21,24], demonstrating the validity of employing LLMs for classification tasks, where they can attain results that are comparable to those of fine-tuned Transformer models. Further research is needed on the reasons for the unsuccess of the "few-shot" prompting technique, which could equally be ascribed to limitations in LLMs or to the complexity of the annotation task. However, this result highlights the importance of applying LLMs to datasets like ours, which could pose new challenges and stimuli for their development.

• no_val: The sentence does not express an evaluation of the reviewed book • generic_val: Into this category fall all those evaluations related to the work as a whole ("beautiful book, " "highly recommended, " "an unbearable read, " and so on). Also, tag with this category all evaluative sentences that do not fall into any of the following categories. • aesthetic: Any evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...). • social: This value concerns the impact that a book has had not on a single reader, but on a community of readers (references to literary awards, to the popularity of the book).

Of particular interest here are all those judgments that seek to enact (or to reafÏrm) a canonization of the judged work, that is, to place it on the roster of 'important' readings. • ind_cognitive: Evaluations regarding the cognitive impact of a book on a reader, the information that the latter extracted from the former or the intellectual stimulation they experienced while reading it. • ind_pragmatic: Evaluations regarding the impact of the book on the reader's life or the existential "lessons" that the latter learned from the former • ind_emotional: Evaluations regarding the emotional impact of the book on the reader, the way the former made the latter feel.

B.1. complex_full

You will receive as input a .csv table with the following structure: book_title,sent_id,sentence The table includes the review of one book (identified by "book_title"), split into sentences. You will have to produce as output another .csv table with the following structure: sent_id,label You will assign the label to each sentence (even when the sentence is broken or incomplete) by following these guidelines.

When the sentence does not express an evaluation of the reviewed book, assign the label "no_val".

When the sentence expresses an evaluation of the reviewed book, you will have to choose between six different labels.

1. "aesthetic": Any evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...).

The following three labels do not refer to features present in the text (such as formal values), but rather to the impact it had on the reader. They are divided into:

2. "ind_pragmatic": Evaluations regarding the impact of the book on the reader's life, the existential "lessons" that the latter learned from the former.

3. "ind_emotional": Evaluation regarding the emotional impact of the book on the reader, the way the former made the latter feel.

4. "ind_cognitive": Evaluation regarding the cognitive impact of a book on a reader, the information that the latter extracted from the former or the intellectual stimulation they experienced while reading it.

The last two labels are: 5. "social": This value concerns the impact that a book has had not on a single reader, but on a community of readers (references to literary awards, to the popularity of the book). Of particular interest here are all those judgments that seek to enact (or to reafÏrm) a canonization of the judged work, that is, to place it on the roster of 'important' readings.

6. "generic_val": Into this category fall all those evaluations related to the work as a whole ("beautiful book, " "highly recommended, " "an unbearable read, " and so on). Also, tag with this category all evaluative sentences that do not fall into any of the above categories.

In assigning one (and only one) of these labels to each sentence, please follow these generic guidelines:

-Work on individual sentences. ONLY in cases where one sentence is incomprehensible without the next, or expresses a concept that necessarily requires continuation in the next, treat the two as a single block and assign them the same label.

-If judgments related to several categories are made in a sentence, tag it with the category that seems the most important to you.

-Judgements regarding other books than the one reviewed, or judgments related to past readings of the same book must be tagged as "no_val".

-What does NOT constitute an evaluation: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries.-"ind_pragmatic" must contain an explicit reference to the reviewer's real life or experience.

-"aesthetic" must have explicit reference to literary art (plot, characters, style).

-An evaluation must specify, explicitly, its object.

-An evaluation may have neutral, or ambiguous ('mixed feelings') valence.

-Any comparison, resulting in the priority of one over the other, between the book in question and other cultural products is to be considered evaluative.

-All references to story, characters, style, or any other features that relate back to the writing are tagged as aesthetic regardless of the simplicity of the rating. Example: I like the story, I like the characters, etc.

-"generic_val" consists of all those statements not accompanied by explanation, i.e., expressions of appreciation not related to specific aspects of the book or to specific effects it had on the reader.

B.2. complex_3-class

You will receive as input a .csv table with the following structure:

book_title,sent_id,sentence The table includes the review of one book (identified by "book_title"), split into sentences. You will have to produce as output another .csv table with the following structure: sent_id,label You will assign the label to each sentence (even when the sentence is broken or incomplete) by following these guidelines.

When the sentence does not express an evaluation of the reviewed book, assign the label "no_val".

When the sentence expresses an evaluation of the reviewed book, you will have to choose between two different labels.

Assign the label "eval_generic" for any evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...).

The other label, "eval_individual" does not refer to features present in the text (such as formal values), but rather to the impact it had on the reader. It can be used for:

1. Evaluations regarding the impact of the book on the reader's life, the existential "lessons" that the latter learned from the former.

2. Evaluation regarding the emotional impact of the book on the reader, the way the former made the latter feel.

3. Evaluation regarding the cognitive impact of a book on a reader, the information that the latter extracted from the former or the intellectual stimulation they experienced while reading it.

The label "eval_generic" can also be interpreted in two additional ways:

1. This label concerns the impact that a book has had not on a single reader, but on a community of readers (references to literary awards, to the popularity of the book). Of particular interest here are all those judgments that seek to enact (or to reafÏrm) a canonization of the judged work, that is, to place it on the roster of 'important' readings.

2. Into this category fall all those evaluations related to the work as a whole ("beautiful book, " "highly recommended, " "an unbearable read, " and so on). Also, tag with this category all evaluative sentences that do not fall into any of the above cases.

In assigning one (and only one) of these labels to each sentence, please follow these generic guidelines:

-If judgments related to several categories are made in a sentence, tag it with the category that seems the most important to you.

-Judgements regarding other books than the one reviewed, or judgments related to past readings of the same book must be tagged as "no_val".

-What does NOT constitute an evaluation: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries.

-"eval_individual" must contain an explicit reference to the reviewer's real life or experience.

-An evaluation must specify, explicitly, its object.

-An evaluation may have neutral, or ambiguous ('mixed feelings') valence.

-Any comparison, resulting in the priority of one over the other, between the book in question and other cultural products is to be considered evaluative.

-All references to story, characters, style, or any other features that relate back to the writing are tagged as "eval_generic" regardless of the simplicity of the rating. Example: I like the story, I like the characters, etc.

-"eval_generic" can consist of all those statements not accompanied by explanation, i.e., expressions of appreciation not related to specific aspects of the book or to specific effects it had on the reader.

B.3. complex_binary

You will receive as input a .csv table with the following structure:

When the sentence does not express an evaluation of the reviewed book, assign the label "no_val".

When the sentence expresses an evaluation of the reviewed book, you will have to use the label "val".

Assign this label for any evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...).

Assign the label "val" also when the evaluation does not refer to features present in the text (such as formal values), but rather to the impact it had on the reader. It can be used for:

1. Evaluations regarding the impact of the book on the reader's life, the existential "lessons" that the latter learned from the former.

2. Evaluation regarding the emotional impact of the book on the reader, the way the former made the latter feel.

3. Evaluation regarding the cognitive impact of a book on a reader, the information that the latter extracted from the former or the intellectual stimulation they experienced while reading it.

The label "val" can also be interpreted in two additional ways: 1. This label concerns the impact that a book has had not on a single reader, but on a community of readers (references to literary awards, to the popularity of the book). Of particular interest here are all those judgments that seek to enact (or to reafÏrm) a canonization of the judged work, that is, to place it on the roster of 'important' readings.

In assigning one (and only one) of these labels to each sentence, please follow these generic guidelines:

-If judgments related to several categories are made in a sentence, tag it with the category that seems the most important to you.

-Judgements regarding other books than the one reviewed, or judgments related to past readings of the same book must be tagged as "no_val".

-What does NOT constitute an evaluation: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries.

-An evaluation must specify, explicitly, its object.

-An evaluation may have neutral, or ambiguous ('mixed feelings') valence.

-Any comparison, resulting in the priority of one over the other, between the book in question and other cultural products is to be considered evaluative.

-All references to story, characters, style, or any other features that relate back to the writing are tagged as "val" regardless of the simplicity of the rating. Example: I like the story, I like the characters, etc.

-"val" can consist of all those statements not accompanied by explanation, i.e., expressions of appreciation not related to specific aspects of the book or to specific effects it had on the reader.

B.4. simple_full

You will receive as input a .csv table with the following structure:

-"no_val", when the sentence does not evaluate the reviewed book; -"aesthetic", any evaluation concerning the specifics of literary language, both in its formal and content aspects;

-"ind_pragmatic", evaluations regarding the impact of the book on the reader's life; -"ind_emotional": it is about the value the reader places on the work based on what it made him or her feel. It can range from aspects more related to the book itself, to more intimate and personal issues;

-"ind_cognitive": in this category fall all considerations of a book's ability to teach the reader something or stimulate him intellectually;

-"social": this value concerns the impact that a book has had not on a single reader, but on a community of readers (references to literary awards, to the popularity of the book); -"generic_val": tag with this label all evaluative sentences that do not fall into any of the above categories.

B.7. procedural_full

Guideline 1: You will have to treat each sentence as a single unit of meaning. Therefore, you will assign a label based on the sentence alone. Only in cases where one sentence is incomprehensible without the next, treat the two as a single block and assign them the same label.

Guideline 2: You will have to assign one (and only one) label to each sentence. If several labels can be assigned to one sentence, choose the label that fits best with the sentence.

Guideline 3: Possible labels are: "no_val", "aesthetic", "ind_pragmatic", "ind_emotional", "ind_cognitive", "social", "generic_val". Labels can be grouped into two main categories: nonevaluative sentences and evaluative sentences. When assigning a label to a sentence, you will have to (1) identify the best-fitting category and (2) choose the best-fitting label.

Possible labels are categorized and described here below. Category 1: Non-evaluative sentences.

A non-evaluative sentence is a sentence that does not express an evaluation of the reviewed book, or that does not explicitly describe the impact it had on the reader.

Category 1, Label 1: "no_val" Use this label for all non-evaluative sentences. Use it also when the sentence expresses evaluations regarding other books than the one reviewed, or evaluations related to past readings of the reviewed book. Use it also in the case of: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries.

Category 2: Evaluative sentences. An evaluative sentence expresses an evaluation of the reviewed book, or it explicitly describes the impact it had on the reader. An evaluative sentence may have positive, negative, or ambiguous ('mixed feelings') valence. An evaluative sentence must specify, explicitly, its object. Any comparison, resulting in the priority of one over the other, between the reviewed book and other cultural products is to be considered as an evaluative sentence.

Category 2, Label 1: "aesthetic" Use this label when the sentence explicitly expresses an evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...). The evaluation must have explicit reference to literary art (e.g., plot, characters, style).

Category 2, Label 2: "ind_pragmatic" Use this label when the sentence explicitly describes the impact the reviewed book had on the reader. In particular, use it for sentences regarding the impact of the book on the reader's life, the existential 'lessons' that the reader learned from the book. The sentence must contain an explicit reference to the reviewer's real life or experience.

Category 2, Label 3: "ind_emotional" Use this label when the sentence explicitly describes the impact the reviewed book had on the reader. In particular, use it for sentences regarding the emotional impact of the book on the reader, the way the former made the latter feel. Still, be careful in not assigning this label to sentences that use an emotional language in a generic and/or metaphorical way (e.g. "I loved this book, " "I enjoyed the reading").

Category 2, Label 4: "ind_cognitive" Use this label when the sentence explicitly describes the impact the reviewed book had on the reader. In particular, use it for sentences regarding the importance of the information that the reader extracted from the book, or the intellectual stimulation the reader experienced while reading it. Category 2, Label 5: "social" Use this label when the sentence evaluates a book not based on the experience of a single reader, but on the experience of a community of readers (e.g., references to literary awards, to the popularity of the book). Of particular interest here are all those judgments that enact (or reafÏrm) a canonization of the judged work, placing it on the roster of 'important' readings.

Category 2, Label 6: "generic_val" Use this label for all those statements not accompanied by explanation, i.e., expressions of appreciation not related to specific aspects of the book or to specific effects it had on the reader (e.g., "beautiful book, " "highly recommended, " "an unbearable read"). Also, tag with this category all sentences that do not fall into any of the above categories.

B.8. procedural_3-class

You will receive as input a .csv table with the following structure:

Guideline 2: You will have to assign one (and only one) label to each sentence. If several labels can be assigned to one sentence, choose the label that fits best with the sentence. Guideline 3: Possible labels are: "no_val", "eval_generic", and "eval_individual". They are described here below. Label 1: "no_val" Use this label for all non-evaluative sentences. A non-evaluative sentence is a sentence that does not explicitly express an evaluation of the reviewed book, or that does not explicitly describe the impact it had on the reader. Use it also when the sentence expresses evaluations regarding other books than the one reviewed, or evaluations related to past readings of the reviewed book. Use it also in the case of: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries. Label 2: "eval_individual" Use this label when the sentence explicitly describes the impact the reviewed book had on the reader.

In particular, use it for sentences regarding the impact of the book on the reader's life, the existential 'lessons' that the reader learned from the book. The sentence must contain an explicit reference to the reviewer's real life or experience.

Use it also for sentences regarding the emotional impact of the book on the reader, the way the former made the latter feel. Still, be careful in not assigning this label to sentences that use an emotional language in a generic and/or metaphorical way (e.g. "I loved this book, " "I enjoyed the reading").

Finally, use it for sentences regarding the importance of the information that the reader extracted from the book, or the intellectual stimulation the reader experienced while reading it.

Label 3: "eval_generic" Use this label for all evaluative sentences. An evaluative sentence is a sentence that explicitly expresses an evaluation of the reviewed book. An evaluative sentence may have positive, negative, or ambiguous ('mixed feelings') valence. An evaluative sentence must specify, explicitly, its object. Any comparison, resulting in the priority of one over the other, between the reviewed book and other cultural products is to be considered as an evaluative sentence.

Use this label when the sentence explicitly expresses an evaluation concerning the specifics of literary language, both in its formal aspects (use of rhetorical figures, writing style, etc...) and content aspects (character or plot construction, narratological features, etc...). The evaluation must have explicit reference to literary art (e.g., plot, characters, style).

In addition, use this label when the sentence evaluates a book not based on the experience of a single reader, but on the experience of a community of readers (e.g., references to literary awards, to the popularity of the book). Of particular interest here are all those judgments that enact (or reafÏrm) a canonization of the judged work, placing it on the roster of 'important' readings.

Finally, use this label for all those statements not accompanied by explanation, i.e., expressions of appreciation not related to specific aspects of the book or to specific effects it had on the reader (e.g., "beautiful book, " "highly recommended, " "an unbearable read"). Also, tag with this category all evaluative sentences that do not fall into any of the above cases.

B.9. procedural_binary

Guideline 2: You will have to assign one (and only one) label to each sentence. If several labels can be assigned to one sentence, choose the label that fits best with the sentence.

Guideline 3: Possible labels are: "no_val" and "val". They are described here below. Label 1: "no_val" Use this label for all non-evaluative sentences. A non-evaluative sentence is a sentence that does not explicitly express an evaluation of the reviewed book, or that does not explicitly describe the impact it had on the reader. Use it also when the sentence expresses evaluations regarding other books than the one reviewed, or evaluations related to past readings of the reviewed book. Use it also in the case of: interpretations ("the author wishes to express... " and the like), personal anecdotes ("I first read the book when I was in college"), plot summaries. Label 2: "val" Use this label for sentences expressing the impact of the reviewed book on the reader and for all evaluative sentences.

Use this label when the sentence explicitly describes the impact the reviewed book had on the reader. In particular, use it for sentences regarding the impact of the book on the reader's life, the existential 'lessons' that the reader learned from the book. The sentence must contain an explicit reference to the reviewer's real life or experience.

Finally, use it for sentences regarding the importance of the information that the reader extracted from the book, or the intellectual stimulation the reader experienced while reading it.

Use this label also for evaluative sentences. An evaluative sentence is a sentence that explicitly expresses an evaluation of the reviewed book. An evaluative sentence may have positive, negative, or ambiguous ('mixed feelings') valence. An evaluative sentence must specify, explicitly, its object. Any comparison, resulting in the priority of one over the other, between the reviewed book and other cultural products is to be considered as an evaluative sentence.