-

Can Models of Author Intention Support Quality Assessment of Content?

A J Casey

Bonnie Webber

bonnieg@inf.ed.ac.uk 0

Dorota Glowacka

glowacka@cs.helsinki.fi 1 0 University of Edinburgh 1 University of Helsinki

Academics seek to nd, understand and critically review the work of other researchers through published scienti c articles. In recent years, the volume of available information has signi cantly increased, partly due to technological advancements and partly due to pressures on academics to `publish or perish'. This amount of papers presents a challenge not only for the peer-review process but also for readers, particularly inexperienced readers, to nd publications of high quality. Whilst one might rely on citation or journal rankings to help guide this decision, this approach may not be completely reliable due to biased peer-review processes and the fact that the citation count of an article does not per se indicate its quality. Here, we analyse how expected author intentions in a Related Work section can be used to indicate its quality. We show that author intentions can predict the quality with reasonable accuracy and propose that similar approaches could be used in other sections to provide an overall picture of quality. This approach could be useful in supporting peer-review processes and for a reader in prioritising articles to read.

Article Quality Author Intentions Supporting peer-review

Recent years have seen an increase in the volume of scienti c publications. The amount of published material poses a challenge for the reader, in particular an inexperienced one, who must navigate this overwhelming wealth of material to nd relevant and high quality content. Another challenge is for the peer-review process. There is only a limited pool of experts to undertake peer-review and the high volume of submitted material puts pressure on this limited resource. Having automated ways to assess quality could support the peer-review process and help the overwhelmed reader to prioritise their ever growing reading list.

Automating judgement of quality in research is challenging as it requires knowledge. Bridges [ 2 ] describes this judgement of research quality as a connoisseurship which draws on one's own knowledge and experience of the eld. This, in turn, not only allows one to comment on speci c features but also gives one the ability to appreciate the overall composition of the text. It is recognised that it would be di cult, if not impossible, to try to emulate this level of human judgement in an automated fashion. We propose that considering how argument intentions are represented linguistically and quantifying the depth of this representation may help to build quality indicators that could prove useful in supporting the peer-review process or to help readers identify better reading material. The intuition behind using argument elements to de ne quality has support in existing literature with essay scores shown to be linked to argumentative elements identi ed through discourse analysis [ 4, 15 ].

Based on this premise, we consider Related Work sections from published papers as a case study. We assess these sections rating them as Good (G), Average (Avg) or Poor (P). We use Related Work sections annotated with author intentions designed to give content feedback [ 5 ]. We analyse the relationship of these author intentions and the quality ratings, showing that quality and author intention occurrence are related, predicting with reasonable accuracy the quality rating of a Related Work. 2

Related Work

Peer-review, generally accepted as the gold standard of assessing quality, is not without issue. There are problems of bias, publication delays, problems with detecting fraud and/or errors, and unethical practices [ 18 ]. Metrics, such as citations and download counts, have also been considered as indicators of quality. But these too have known issues such as dependence of the size of discipline, and they take time to accumulate. Authors and research teams have been known to carry out unnecessary self-citations to increase their own citations [ 8 ]. Despite these problems, we do not believe peer-review or using citation measures should be replaced. Rather, we see our work as an additional tool. It could, for example, be used for triage: if our tool rates a paper Poor or Good, perhaps it needs only one reviewer to con rm it, with a second one only needed if the rst reviewer disagrees with the automated assessment. Papers rated Average would always have two reviewers. This indication of quality could also be used alongside such measures as citation count to help a reader in prioritising which papers to read rst.

Automated recognition of author intentions contained in scienti c publication has been successful in the past, as in Argument Zoning (AZ) [ 17 ]. Also supporting our idea that author intentions can be linked to better Related Work sections is other recent work [ 7, 14 ]. These works show that author goals (intentions) identi ed within a text can be reliably linked to human essay scores. Burstein et al. [ 3 ] take this a step further and use discourse analysis to label what they call essay-speci c goals, e.g. thesis aim or conclusion. They propose missing labels could be used by students to identify aspects that need improvement in their essay. This relates to our idea that missing author intentions may point to poorer quality material. Whilst these works use the individual labels within their schema to highlight speci c missing intentions, our work could be seen as an extension, using the combination of author intentions to suggest an overall indication of quality.

Methods Author Intentions in Related Work

The author intention labelled data we use is from [ 5 ]. They use a data-set from [ 13 ] consisting of scienti c published papers from the ACL anthology [ 1 ]. The labels, based on qualities that Kamler and Thomson [ 9 ] have argued should be present in Related Work, try to encapsulate neutral citations, those that provide mere description compared to those that highlight gaps or problems, along with identifying where an author talks about their own work and how this relates to the cited work or background in general.

The author intention labels used can be found in Table 1. Certain labels from the original schema were rare and were collapsed into frequent categories. These included sentences positive about a citation/ eld, works that author's work builds on, uses or is similar to; and comparison of two cited works as described in the description eld of Table 1.

3.2 Assessing Quality

An experiment was set up to rate the quality of each Related Work section from the data set in [ 5 ]. Participants were presented with the Title, Abstract and Related Work section and asked to rate the quality into Poor(P), Good(G) or Average(Avg). Besides this, they were asked (i) if there was enough previous work material; (ii) how well the author related their work to the previous work; and (iii) whether it was clear how the author's work di ered from previous work. However, for this work we only use the quality rating given by the participants. Guidance given to participants suggested that it was not enough to list previous work, but that authors should demonstrate the relation of cited work to their own work. This guidance also indicated that conference papers are usually limited in length so an in-depth explanation of state of the art is not expected.

There were six assessors: four experts and two PhD students { all in the computational linguistics except one student in computer vision. One assessor rated all items, the others rated ten each. Assessor agreement considered the di erences between the ve assessors and the main assessor who looked at all the articles. Four out of the ve assessors were in good agreement with the main assessor; two were in complete agreement and two agreed on seven out of the ten papers. The other assessor only agreed in four instances, which is likely due to them being a PhD student in another area and having less experience with ACL papers. All disagreements were discussed and agreement reached resulting in 50 double rated papers and 44 done by one assessor only. This resulted in a nal data set of 94 papers with P-(36%), G-(31%) and Avg-(33%). 4

Mean Label Occurrence in Rated Sections

Table 2 shows the mean number of times a label occurs in each section, grouped by quality rating with variance in brackets. Our intuition is that the occurance of some labels will vary between the di erent types of ratings. We use Welch's t-test, correct for unequal variances, to test if di erences are signi cant between the means in the groupings. Each group is tested in order of P/Avg, Avg/P and P/G, where * denotes the test was signi cant (p <0.05).

Our background label with evidence (BG-EP) in our P sections is found to be signi cantly di erent to those that occur in Avg or G rated sections. There is a signi cant di erence in the number of background statements in Avg rated sections compared to G sections that provide no evidence (BG-NE). Work is not meant to be cited because it is on the same topic as the citing work, rather it should be cited because it has implications for the author's study [ 10 ] and the author should say what these implications are. The ndings in Table 2 support this in terms of signi cant di erences between the mean sentences in a G rated section that describe how the authors work is di erent to a cited work (A-CW-D) and how the author's work lls a gap (A-GAP). Additionally, we see a signi cant di erence in the number of sentences that describe an author's work (A-DESC) in P rated sections compared to both Avg and G sections. 5

Predicting Quality from Annotated Data

Related Work quality is classi ed into P, Avg or G. We trained a classi er, experimenting with: SVM (linear kernel), Decision Tree (C4.5) and Linear Logistic Regression(LLR) [ 6, 12, 16 ]. We use feature sets of our annotated labels only. Whilst there are many other features that we could include, our focus here is to understand how well our author intentions relate to quality ratings. We use 10-fold cross validation and a majority classi er as our baseline. We report on how our features rank in terms of importance in our best performing classi er.

Table 3 shows precision, recall and accuracy from all three classi ers and our majority class baseline. To ensure consistency of results, we ran our models over 10 iterations and report on mean performance (variance in brackets). We test for any di erences between our classi ers using corrected t-test, (p <0.05) [ 11 ]. All classi ers outperform our baseline signi cantly. Unsurprisingly, SVM and LLR produce similar results. However, SVM displays marginally less variation in runs, although there is no signi cant di erence between SVM and LLR. Accuracy between SVM and LLR is signi cantly di erent to that of the decision tree method. One of the reasons for the latter's poor performance may be that the label features are not exclusive. For example, although author gap and differences (A-GAP, A-CW-DIFF) are rare in P examples, they are not completely absent. We do not have any direct systems to compare to but e-rater 2.0 [ 4 ] report agreement between system and human score of essays at 97%. e-rater is, however, a commercial system built on multiple elements not just author intentions. Whilst we do not achieve this level of accuracy, our results are promising as a rst step and with the addition of other features we could improve the accuracy. For example, we experimented with adding sentence counts and citation counts and we were able to consistently improve the accuracy by 4%.

Table 4 ranks labels in terms of importance in SVM, showing that an author highlighting a di erence of their work to a cited work or how their work addresses a gap are the most important labels for distinguishing between Quality ratings. This seems plausible as we observe these do occur more in "better" Related Work sections. These supports the idea from Maxwell [ 10 ] who states that cited work needs to be shown to have implications for the study. It seems that if this type of connection is missing then the work is rated as poorer.

Finally, for our best performing model SVM, we checked the confusion matrix for all 10 iterations. We were interested to see if mis-classi cation was occurring in the nearest group i.e. G were mis-classi ed as Avg and not P. We observed that out of 10 iterations this happened twice { one P section being classi ed as G { and 6 times one G document was classi ed as P. We speculate that we could improve performance by studying patterns of labels occurring together. When we considered the mean occurrence and variance of labels in Table 2, we saw that it is not simply a case of a P section not having any sentences about the author's work or never mentioning a gap. We believe there may be more to learn about patterns that happen with labels occurring together that support better classi cation of the di erent ratings. 6

Conclusions

Using Related Work sections, we have shown that some author intentions di er signi cantly across sections rated P, Avg and G. These author intentions show promise as being viable indicators of quality of the content. We speculate that these di erent rated sections will have co-occurrence patterns of labels that may provide stronger indications of di erences between the quality ratings { an aspect we intend to investigate in the future. Our study does have limitations of the small sample size { 94 papers and only one domain is considered. Our choice of section Related Work is also one that does not occur in every domain. Our prediction of quality rating is consistently accurate at 70% with only author intentions as features. Whilst this does not match commercial tool accuracy, such as e-rater (97%), it is a very promising result that could possibly be improved with additional features. Reaching human level of judgement for peer-review in scienti c papers is most likely impossible. For example, it is hard to tell what is missing, speci cally what has not been addressed or identify something that is incorrect { these aspects might still require a human expert. Nonetheless, we believe that this type of quality rating, if developed at a section speci c level, could prove useful in supporting peer-review, directing where reviewers time should be focused and on which papers. In addition, it could help a reader prioritise their reading list of papers.

1. Bird , S. , Dale , R. , Dorr , B. , Gibson , B. , Joseph , M. , Kan , M.Y. , Lee , D. , Powley , B. , Radev , D. , Tan , Y.F. : The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics . In: LREC 2008 ( 2008 ), http://www.lrec-conf.org/proceedings/lrec2008/pdf/445 paper.pdf

2. Bridges , D.: Research quality assessment in education: impossible science , possible art? British Educational Research Journal ( 2009 )

3. Burstein , J. , Marcu , D. , Knight , K. : Finding the WRITE stu : Automatic identication of discourse structure in student essays . IEEE Intelligent Systems ( 2003 )

4. Burstein , J. , Chodorow , M. , Leacock , C. : Automated essay evaluation: The criterion online writing service . AI Magazine 25 , 27 { 36 (09 2004 )

5. Casey , A.J. , Webber , B. , Glowacka , D.: A framework for annotating related works, to support feedback to novice writers . In: Proceedings of the 13th Linguistic Annotation Workshop held in conjunction with ACL 2019 ( LAW-XIII 2019) . Association for Computational Linguistics , Florence, Italy (Aug 2019 )

6 . Chang , C.C. , Lin , C.J.: LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology 2 , 27 :1{ 27 : 27 ( 2011 )

7. Ghosh , D. , Khanam , A., Han, Y. , Muresan , S. : Coarse-grained argumentation features for scoring persuasive essays . In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) . pp. 549 { 554 . Association for Computational Linguistics, Berlin, Germany (Aug 2016 ). https://doi.org/10.18653/v1/ P16 -2089

8. Glanzel , W. , Debackere , K. , Thijs , B. , Schubert , A. : A concise review on the role of author self-citations in information science, bibliometrics and science policy . Scientometrics ( 2006 )

9. Kamler , B. , Thomson , P. : Helping doctoral students write: Pedagogies for supervision . Routledge ( 2006 ). https://doi.org/10.4324/9780203969816

10. Maxwell , J.A. : Literature reviews of, and for, educational research: A commentary on boote and beile's \scholars before researchers" . Educational Researcher 35 ( 9 ), 28 { 31 ( 2006 ). https://doi.org/10.3102/0013189X035009028

11. Nadeau , C. , Bengio , Y. : Inference for the generalization error . In: Proceedings of the 12th International Conference on Neural Information Processing Systems . pp. 307 { 313 . NIPS'99, MIT Press, Cambridge, MA, USA ( 1999 ), http://dl.acm.org/citation.cfm?id= 3009657 . 3009701

12. Quinlan , J.R. : C4 . 5: Programs for Machine Learning . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA ( 1993 )

13. Schafer, U., Spurk , C. , Ste

, J.: A fully coreference-annotated corpus of scholarly papers from the ACL anthology . In: Proceedings of COLING 2012: Posters . pp. 1059 { 1070 . The

COLING

2012

Organizing

Committee , Mumbai, India (dec 2012 ), https://www.aclweb.org/anthology/C12-2103

14. Song , Y. , Heilman , M. , Beigman , B. , Deane , K.P. : Applying argumentation schemes for essay scoring . In: Proceedings of the First Workshop on Argumentation Mining . pp. 69 { 78 . Association for Computational Linguistics ( 2014 ), http://citeseerx.ist.psu.edu/viewdoc/summary?doi =10.1.1.672.5185

15. Song , Y. , Heilman , M. ,

Beigman

Klebanov , B. , Deane , P. : Applying argumentation schemes for essay scoring . In: Proceedings of the First Workshop on Argumentation Mining . pp. 69 { 78 . Association for Computational Linguistics, Baltimore, Maryland (Jun 2014 ). https://doi.org/10.3115/v1/ W14 -2110, https://www.aclweb.org/anthology/W14-2110

16. Sumner , M. , Frank , E. , Hall , M.A. : Speeding up logistic model tree induction . PKDD LNCS 3721 , 675 { 683 ( 2005 ), https://hdl.handle. net/10289/1446

17. Teufel , S. : Argumentative zoning: Information extraction from scienti c text . Ph.D. thesis , University of Edinburgh ( 1999 )

18. Walker , R., da Silva, P.R. : Emerging trends in peer review-a survey . Frontiers in Neuroscience ( 2014 )