=Paper= {{Paper |id=Vol-2414/paper9 |storemode=property |title=Can Models of Author Intention Support Quality Assessment of Content? |pdfUrl=https://ceur-ws.org/Vol-2414/paper9.pdf |volume=Vol-2414 |authors=Arlene J. Casey,Bonnie Webber,Dorota Glowacka |dblpUrl=https://dblp.org/rec/conf/sigir/CaseyWG19 }} ==Can Models of Author Intention Support Quality Assessment of Content?== https://ceur-ws.org/Vol-2414/paper9.pdf

Can Models of Author Intention Support
Quality Assessment of Content?

A J Casey1 , Bonnie Webber1 , and Dorota Glowacka2
1
University of Edinburgh {a.j.casey,bonnie}@inf.ed.ac.uk
2
University of Helsinki glowacka@cs.helsinki.fi

Abstract. Academics seek to find, understand and critically review the
work of other researchers through published scientific articles. In recent
years, the volume of available information has significantly increased,
partly due to technological advancements and partly due to pressures
on academics to ‘publish or perish’. This amount of papers presents a
challenge not only for the peer-review process but also for readers, partic-
ularly inexperienced readers, to find publications of high quality. Whilst
one might rely on citation or journal rankings to help guide this decision,
this approach may not be completely reliable due to biased peer-review
processes and the fact that the citation count of an article does not per
se indicate its quality. Here, we analyse how expected author intentions
in a Related Work section can be used to indicate its quality. We show
that author intentions can predict the quality with reasonable accuracy
and propose that similar approaches could be used in other sections to
provide an overall picture of quality. This approach could be useful in
supporting peer-review processes and for a reader in prioritising articles
to read.

Keywords: Article Quality · Author Intentions · Supporting peer-review

1 Introduction

Recent years have seen an increase in the volume of scientific publications. The
amount of published material poses a challenge for the reader, in particular an
inexperienced one, who must navigate this overwhelming wealth of material to
find relevant and high quality content. Another challenge is for the peer-review
process. There is only a limited pool of experts to undertake peer-review and
the high volume of submitted material puts pressure on this limited resource.
Having automated ways to assess quality could support the peer-review process
and help the overwhelmed reader to prioritise their ever growing reading list.
Automating judgement of quality in research is challenging as it requires
knowledge. Bridges [2] describes this judgement of research quality as a connoisseur-
ship which draws on one’s own knowledge and experience of the field. This, in
turn, not only allows one to comment on specific features but also gives one the
ability to appreciate the overall composition of the text. It is recognised that
it would be difficult, if not impossible, to try to emulate this level of human
2 Casey et al.

judgement in an automated fashion. We propose that considering how argu-
ment intentions are represented linguistically and quantifying the depth of this
representation may help to build quality indicators that could prove useful in
supporting the peer-review process or to help readers identify better reading
material. The intuition behind using argument elements to define quality has
support in existing literature with essay scores shown to be linked to argumen-
tative elements identified through discourse analysis [4, 15].
Based on this premise, we consider Related Work sections from published
papers as a case study. We assess these sections rating them as Good (G), Av-
erage (Avg) or Poor (P). We use Related Work sections annotated with author
intentions designed to give content feedback [5]. We analyse the relationship of
these author intentions and the quality ratings, showing that quality and author
intention occurrence are related, predicting with reasonable accuracy the quality
rating of a Related Work.

2 Related Work

Peer-review, generally accepted as the gold standard of assessing quality, is not
without issue. There are problems of bias, publication delays, problems with
detecting fraud and/or errors, and unethical practices [18]. Metrics, such as
citations and download counts, have also been considered as indicators of quality.
But these too have known issues such as dependence of the size of discipline, and
they take time to accumulate. Authors and research teams have been known to
carry out unnecessary self-citations to increase their own citations [8]. Despite
these problems, we do not believe peer-review or using citation measures should
be replaced. Rather, we see our work as an additional tool. It could, for example,
be used for triage: if our tool rates a paper Poor or Good, perhaps it needs only
one reviewer to confirm it, with a second one only needed if the first reviewer
disagrees with the automated assessment. Papers rated Average would always
have two reviewers. This indication of quality could also be used alongside such
measures as citation count to help a reader in prioritising which papers to read
first.
Automated recognition of author intentions contained in scientific publication
has been successful in the past, as in Argument Zoning (AZ) [17]. Also supporting
our idea that author intentions can be linked to better Related Work sections
is other recent work [7, 14]. These works show that author goals (intentions)
identified within a text can be reliably linked to human essay scores. Burstein et
al. [3] take this a step further and use discourse analysis to label what they call
essay-specific goals, e.g. thesis aim or conclusion. They propose missing labels
could be used by students to identify aspects that need improvement in their
essay. This relates to our idea that missing author intentions may point to poorer
quality material. Whilst these works use the individual labels within their schema
to highlight specific missing intentions, our work could be seen as an extension,
using the combination of author intentions to suggest an overall indication of
quality.
Can Models of Author Intention Support Quality Assessment of Content? 3

3 Methods
3.1 Author Intentions in Related Work
The author intention labelled data we use is from [5]. They use a data-set from
[13] consisting of scientific published papers from the ACL anthology [1]. The
labels, based on qualities that Kamler and Thomson [9] have argued should be
present in Related Work, try to encapsulate neutral citations, those that provide
mere description compared to those that highlight gaps or problems, along with
identifying where an author talks about their own work and how this relates to
the cited work or background in general.
The author intention labels used can be found in Table 1. Certain labels
from the original schema were rare and were collapsed into frequent categories.
These included sentences positive about a citation/field, works that author’s
work builds on, uses or is similar to; and comparison of two cited works as
described in the description field of Table 1.
Table 1. Related Work Author Intention Labels

Label Description
BG-NE Description of the state of the field, describing/listing known methods
or common knowledge. No evidence i.e. citation is not included
BG-EP As above but evidence provided i.e.citation included
BG-EVAL Author highlights a positive or shortcoming/problem or gap in the
field
CW-D Describes cited work or compares two cited works, this could be
specific details, or very high level details or nothing more than a
reference for further information
A-CW-U Author’s work uses/builds/similar to a cited work
CW-EVAL A positive or shortcoming/problem or gap about the cited work is
highlighted
A-DESC Author describes their work with no linguistic marking to other’s
work or being different
A-GAP Author specifically says they address a gap or highlights the novelty
of their work
A-CW-D Author’s highlights how their work is different to cited work
TEXT Sentence provides information about what will be discussed in the
next section

3.2 Assessing Quality
An experiment was set up to rate the quality of each Related Work section from
the data set in [5]. Participants were presented with the Title, Abstract and
Related Work section and asked to rate the quality into Poor(P), Good(G) or
Average(Avg). Besides this, they were asked (i) if there was enough previous
work material; (ii) how well the author related their work to the previous work;
and (iii) whether it was clear how the author’s work differed from previous work.
However, for this work we only use the quality rating given by the participants.
Guidance given to participants suggested that it was not enough to list previous
work, but that authors should demonstrate the relation of cited work to their own
4 Casey et al.

work. This guidance also indicated that conference papers are usually limited in
length so an in-depth explanation of state of the art is not expected.
There were six assessors: four experts and two PhD students – all in the
computational linguistics except one student in computer vision. One assessor
rated all items, the others rated ten each. Assessor agreement considered the
differences between the five assessors and the main assessor who looked at all
the articles. Four out of the five assessors were in good agreement with the main
assessor; two were in complete agreement and two agreed on seven out of the
ten papers. The other assessor only agreed in four instances, which is likely due
to them being a PhD student in another area and having less experience with
ACL papers. All disagreements were discussed and agreement reached resulting
in 50 double rated papers and 44 done by one assessor only. This resulted in a
final data set of 94 papers with P-(36%), G-(31%) and Avg-(33%).

4 Mean Label Occurrence in Rated Sections

Table 2. Mean (Var) sentence labels by rating, Significance denoted by * order by
Poor/Avg, Avg/Good, Poor/Good

Label P Avg G Significance
BG-EP 1.2 (0.7) 2 (2) 2.5 (5.1) * - *
BG-NE 2.2 (10) 3.4 (5.4) 2 (4.5) - * -
BG-EVAL 0.8 (1.4) 1.4 (3.7) 1.2 (2.5) - - -
CW-D 8 (46.4) 8 (35.2) 5.6 (20.7) - - -
CW-EVAL 1.3 (2) 2.3 (5.2) 1.3 (3.2) * - -
A-CW-U 0.4 (0.3) 0.57 (0.7) 1 (1.3) - - *
A-DESC 0.5 (0.9) 1.5 (2.4) 1.4 (2.7) * - *
A-CW-D 0.2 (0.2) 1.2 (1.4) 3.7 (3.7) * * *
A-GAP 0.1 (0.3) 0.5 (0.5) 1.4 (1.2) * * *
TEXT 0.2 (0.9) 0.2 (0.2) 0.3 (0.3) - - -

Table 2 shows the mean number of times a label occurs in each section,
grouped by quality rating with variance in brackets. Our intuition is that the
occurance of some labels will vary between the different types of ratings. We use
Welch’s t-test, correct for unequal variances, to test if differences are significant
between the means in the groupings. Each group is tested in order of P/Avg,
Avg/P and P/G, where * denotes the test was significant (p <0.05).
Our background label with evidence (BG-EP) in our P sections is found to
be significantly different to those that occur in Avg or G rated sections. There
is a significant difference in the number of background statements in Avg rated
sections compared to G sections that provide no evidence (BG-NE). Work is not
meant to be cited because it is on the same topic as the citing work, rather it
should be cited because it has implications for the author’s study [10] and the
author should say what these implications are. The findings in Table 2 support
this in terms of significant differences between the mean sentences in a G rated
Can Models of Author Intention Support Quality Assessment of Content? 5

section that describe how the authors work is different to a cited work (A-CW-D)
and how the author’s work fills a gap (A-GAP). Additionally, we see a significant
difference in the number of sentences that describe an author’s work (A-DESC)
in P rated sections compared to both Avg and G sections.

5 Predicting Quality from Annotated Data
Related Work quality is classified into P, Avg or G. We trained a classifier,
experimenting with: SVM (linear kernel), Decision Tree (C4.5) and Linear Lo-
gistic Regression(LLR) [6, 12, 16]. We use feature sets of our annotated labels
only. Whilst there are many other features that we could include, our focus here
is to understand how well our author intentions relate to quality ratings. We use
10-fold cross validation and a majority classifier as our baseline. We report on
how our features rank in terms of importance in our best performing classifier.

Table 4. Ranked Labels-Logistic Re-
gression

Table 3. Classifier Performance, Vari- Ranked labels
ance in brackets 0.32766 A-CW-Diff
0.21277 A-GAP
Classifier Precision Recall Accuracy 0.08085 A-DESC
LibSVM 0.7 (0.01) 0.7 (0.01) 70 (1.9) 0.07021 B-NE
J45 0.6 (0.04) 0.6 (0.05) 57 (5) 0.04681 A-CW-U
Logistic Regression 0.7 (0.02) 0.7 (0.02) 70 (2.2) 0.04468 B-EP
Majority Baseline - .36 36 0.00213 CW-EVAL
-0.00851 B-EVAL
-0.01064 TEXT
-0.03723 CW-D

Table 3 shows precision, recall and accuracy from all three classifiers and
our majority class baseline. To ensure consistency of results, we ran our models
over 10 iterations and report on mean performance (variance in brackets). We
test for any differences between our classifiers using corrected t-test, (p <0.05)
[11]. All classifiers outperform our baseline significantly. Unsurprisingly, SVM
and LLR produce similar results. However, SVM displays marginally less varia-
tion in runs, although there is no significant difference between SVM and LLR.
Accuracy between SVM and LLR is significantly different to that of the decision
tree method. One of the reasons for the latter’s poor performance may be that
the label features are not exclusive. For example, although author gap and dif-
ferences (A-GAP, A-CW-DIFF) are rare in P examples, they are not completely
absent. We do not have any direct systems to compare to but e-rater 2.0 [4]
report agreement between system and human score of essays at 97%. e-rater is,
however, a commercial system built on multiple elements not just author inten-
tions. Whilst we do not achieve this level of accuracy, our results are promising
6 Casey et al.

as a first step and with the addition of other features we could improve the ac-
curacy. For example, we experimented with adding sentence counts and citation
counts and we were able to consistently improve the accuracy by 4%.
Table 4 ranks labels in terms of importance in SVM, showing that an author
highlighting a difference of their work to a cited work or how their work addresses
a gap are the most important labels for distinguishing between Quality ratings.
This seems plausible as we observe these do occur more in ”better” Related Work
sections. These supports the idea from Maxwell [10] who states that cited work
needs to be shown to have implications for the study. It seems that if this type
of connection is missing then the work is rated as poorer.
Finally, for our best performing model SVM, we checked the confusion matrix
for all 10 iterations. We were interested to see if mis-classification was occurring
in the nearest group i.e. G were mis-classified as Avg and not P. We observed
that out of 10 iterations this happened twice – one P section being classified as
G – and 6 times one G document was classified as P. We speculate that we could
improve performance by studying patterns of labels occurring together. When
we considered the mean occurrence and variance of labels in Table 2, we saw
that it is not simply a case of a P section not having any sentences about the
author’s work or never mentioning a gap. We believe there may be more to learn
about patterns that happen with labels occurring together that support better
classification of the different ratings.

6 Conclusions

Using Related Work sections, we have shown that some author intentions differ
significantly across sections rated P, Avg and G. These author intentions show
promise as being viable indicators of quality of the content. We speculate that
these different rated sections will have co-occurrence patterns of labels that may
provide stronger indications of differences between the quality ratings – an aspect
we intend to investigate in the future. Our study does have limitations of the
small sample size – 94 papers and only one domain is considered. Our choice
of section Related Work is also one that does not occur in every domain. Our
prediction of quality rating is consistently accurate at 70% with only author
intentions as features. Whilst this does not match commercial tool accuracy, such
as e-rater (97%), it is a very promising result that could possibly be improved
with additional features. Reaching human level of judgement for peer-review in
scientific papers is most likely impossible. For example, it is hard to tell what
is missing, specifically what has not been addressed or identify something that
is incorrect – these aspects might still require a human expert. Nonetheless,
we believe that this type of quality rating, if developed at a section specific
level, could prove useful in supporting peer-review, directing where reviewers
time should be focused and on which papers. In addition, it could help a reader
prioritise their reading list of papers.
Can Models of Author Intention Support Quality Assessment of Content? 7

References
1. Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Pow-
ley, B., Radev, D., Tan, Y.F.: The ACL anthology reference corpus: A reference
dataset for bibliographic research in computational linguistics. In: LREC 2008
(2008), http://www.lrec-conf.org/proceedings/lrec2008/pdf/445 paper.pdf
2. Bridges, D.: Research quality assessment in education: impossible science, possible
art? British Educational Research Journal (2009)
3. Burstein, J., Marcu, D., Knight, K.: Finding the WRITE stuff: Automatic identi-
fication of discourse structure in student essays. IEEE Intelligent Systems (2003)
4. Burstein, J., Chodorow, M., Leacock, C.: Automated essay evaluation: The crite-
rion online writing service. AI Magazine 25, 27–36 (09 2004)
5. Casey, A.J., Webber, B., Glowacka, D.: A framework for annotating related works,
to support feedback to novice writers. In: Proceedings of the 13th Linguistic Anno-
tation Workshop held in conjunction with ACL 2019 (LAW-XIII 2019). Association
for Computational Linguistics, Florence, Italy (Aug 2019)
6. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
7. Ghosh, D., Khanam, A., Han, Y., Muresan, S.: Coarse-grained argumentation fea-
tures for scoring persuasive essays. In: Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Volume 2: Short Papers). pp.
549–554. Association for Computational Linguistics, Berlin, Germany (Aug 2016).
https://doi.org/10.18653/v1/P16-2089
8. Glanzel, W., Debackere, K., Thijs, B., Schubert, A.: A concise review on the role
of author self-citations in information science, bibliometrics and science policy.
Scientometrics (2006)
9. Kamler, B., Thomson, P.: Helping doctoral students write: Pedagogies for super-
vision. Routledge (2006). https://doi.org/10.4324/9780203969816
10. Maxwell, J.A.: Literature reviews of, and for, educational research: A commentary
on boote and beile’s “scholars before researchers”. Educational Researcher 35(9),
28–31 (2006). https://doi.org/10.3102/0013189X035009028
11. Nadeau, C., Bengio, Y.: Inference for the generalization error. In: Proceed-
ings of the 12th International Conference on Neural Information Processing
Systems. pp. 307–313. NIPS’99, MIT Press, Cambridge, MA, USA (1999),
http://dl.acm.org/citation.cfm?id=3009657.3009701
12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publish-
ers Inc., San Francisco, CA, USA (1993)
13. Schäfer, U., Spurk, C., Steffen, J.: A fully coreference-annotated corpus of scholarly
papers from the ACL anthology. In: Proceedings of COLING 2012: Posters. pp.
1059–1070. The COLING 2012 Organizing Committee, Mumbai, India (dec 2012),
https://www.aclweb.org/anthology/C12-2103
14. Song, Y., Heilman, M., Beigman, B., Deane, K.P.: Applying argumentation
schemes for essay scoring. In: Proceedings of the First Workshop on Argu-
mentation Mining. pp. 69–78. Association for Computational Linguistics (2014),
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.672.5185
15. Song, Y., Heilman, M., Beigman Klebanov, B., Deane, P.: Applying argu-
mentation schemes for essay scoring. In: Proceedings of the First Workshop
on Argumentation Mining. pp. 69–78. Association for Computational Linguis-
tics, Baltimore, Maryland (Jun 2014). https://doi.org/10.3115/v1/W14-2110,
https://www.aclweb.org/anthology/W14-2110
8 Casey et al.

16. Sumner, M., Frank, E., Hall, M.A.: Speeding up logistic model tree induction.
PKDD LNCS 3721, 675–683 (2005), https://hdl.handle.net/10289/1446
17. Teufel, S.: Argumentative zoning: Information extraction from scientific text. Ph.D.
thesis, University of Edinburgh (1999)
18. Walker, R., da Silva, P.R.: Emerging trends in peer review-a survey. Frontiers in
Neuroscience (2014)