1. Introduction

Conference and Labs of the Evaluation Forum, September

Accenture at CheckThat! 2023: Impacts of Back-translation on Subjectivity Detection

Sieu Tran

Paul Rodrigues

Benjamin Strauss

Evan M. Williams

1 0 Accenture , 1201 New York Ave NW, Washington, DC 20005 , United States 1 Carnegie Mellon University , 5000 Forbes Avenue, Pittsburgh, PA 15213 , United States

2023

1 8 21

This paper discusses the CLEF CheckThat! Lab Task 2 on Subjectivity in News Articles, and our approach on using back-translation to augment the minority classes in Arabic, English, Turkish, German, Italian, and Dutch to distinguish subjective and objective statements. While we find that back-translation works well for other tasks in the fact-checking pipeline, we find that it does not work as well for subjectivity detection. This paper begins to examine several reasons why back-translation as an NLP data augmentation strategy could inhibit subjectivity detection.

eol>subjectivity detection opinion detection news analysis data-driven journalism

1. Introduction

children, including one entitled What Is White Privilege?" is labeled as ’Objective’, rather than ’Subjective’. As the sentence contains specific, falsifiable claims, this seems to be a reasonable labeling. However, the characterization of the books as tools of ‘Leftist indoctrination’, is clearly a subjective editorialization on the part of the author. This highlights the inherent ambiguity present in the task and underscores a core challenge that the annotators, and the models both face in learning a clear decision boundary.

In this work, we describe the back-translation augmentation strategies and models employed by Team Accenture’s submissions to Task 2. Team Accenture’s back-translation and transformer approach yielded the 3rd highest submissions in Arabic, 4th in Turkish, 5th in Dutch, and 8th in German and English. While back-translation has been shown to be an efective means of NLP data augmentation to improve checkworthiness identification [ 7 ], we speculate that the approach may reduce the the ability of models to generalize in a subjectivity detection task and explore some reasons why this may be the case.

2. Exploratory Analysis

Table 1 shows the number of samples and unique word counts for each of the datasets provided. We see that Italian had the largest number of samples in training (1,613). However, Arabic had the highest count of unique words (12,181), while German (4,622) and Dutch (3,944) had the lowest. Assuming consistent data collection methodology and annotation standards across languages, we would hypothesize that a larger quantity of unique words would yield higheraccuracy models. The sample size of all languages in this task is relatively small compared to the other tasks in the CheckThat Lab.

As shown in Figure 1, all of the datasets provided by the CheckThat! organizers had label bias which skewed each dataset towards sentences labeled as ’objective’.

Transformer models utilize WordPiece tokenization schemes that are dependant on the model being evaluated. At the time of pre-training, the WordPiece algorithm determines which pieces of words will be retained, and which will be discarded. An Unknown (UNK) token is utilized as a placeholder in the lexicon, and used to represent WordPiece tokens received in novel input that did not get utilized at model creation.

The proportion of out-of-vocabulary tokens are have been shown to inversely correlates to overall accuracy [ 8 ], so we explore proportions of UNK in each dataset to ensure our models are not excluding too many tokens from any language. We present our analysis in Table 2. Most notably, Arabic training set has the highest WordPiece count of 43,601. Since the unknown token rates are mostly negligible between all languages, we expect count and diversity of Wordpiece would influence model performance the most. Unexpectedly, the RoBERTa tokenizers we used did not return UNK tokens on any dataset provided by the CLEF CheckThat! organizers.

3. Transformer Architectures and Pre-Trained Models

In this work, we utilize BERT and RoBERTa models. The Bidirectional Encoder Representation Transformer (BERT) is a transformer-based architecture that was introduced in 2018 [ 9 ]. BERT has had a substantial impact on the field of NLP, and achieved state of the art results on 11 NLP benchmarks at the time of its release. RoBERTa, introduced by [ 10 ], modified various parts of BERTs training process. These modifications include more training data, more pre-training steps with bigger batches over more data, removing BERT’s Next Sentence Prediction, training on longer sequences, and dynamically changing the masking pattern applied to the training data [ 10 ].

For the Arabic Dataset, we used lanwuwei/GigaBERT-v4-Arabic-and-English [ 11 ], which was trained on a large-scale corpus (Arabic version of OSCAR, an Arabic Wikipedia dump, and Gigaword) with ∼ 10B tokens. The model showing state-of-the-art zero-shot transfer performance from English to Arabic on information extraction tasks. The Arabic model contains a vocabulary of length ∼ 21,000 and ∼ 26,000 for English and Arabic respectively. For English, we used roberta-large [ 10 ]. The English RoBERTa model contains 50,265 WordPieces. For Turkish, German, and Italian, we used dbmdz/bert-base-turkish-cased [ 12 ], dbmdz/bert-basegerman-uncased [ 13 ], and dbmdz/bert-base-italian-xxl-uncased [ 14 ], respectively. The vocabulary sizes of the Turkish, German, and Italian models are respectively 32,000, 31,102, and 32,102. For Dutch, we used GroNLP/bert-base-dutch-cased [ 15 ], which has a vocabulary size of 30,073. The foundation model for each language was selected based on models we have used in the past. Recognizing that this was a problem that should not benefit from case signaling, we chose the uncased variant for any new model.

For experimentation and comparison to roberta-large, we also fine-tune the pre-trained model on subjectivity/style classification task, cfl/bert-base-styleclassification-subjective-neutral [ 16 ]. This BERT-based model has been fine-tuned on the Wiki Neutrality Corpus (WNC) - a parallel corpus of 180,000 biased and neutralized sentence pairs along with contextual sentences and metadata. The model can be used to classify text as subjectively biased vs. neutrally toned.

4. Method 4.1. Data Augmentation

For each language, augmentation and training were done via back-translation into the respective language using AWS translation. We back-translated the minority class in each dataset, which is always the subjective documents. We appended back-translated subjective documents to the training set. In our 2021 experiment [ 7 ], we found that this form of augmentation resulted in a significant increase in recall and F1-score for the positive class. We did not use any dataset outside the one provided by the organizers for data augmentation.

In this work, we fine-tune lanwuwei/GigaBERT-v4-Arabic-and-English at diferent levels of data augmentation and compare performances on the gold test set provided by the organizer.

Table 3 shows the BLEU score for each back-translation scheme. Table 4 show training sample size before and after data augmentation and Table 5 shows the number of new tokens acquired after back-translation for each language. The higher the score, the more consistent or similar the translation to the original text. For Arabic and Italian, BLEU scores decrease as more pivot languages are used for back-translation, as we would expect. As a perfect translation would not provide variation in the training samples, and a low BLEU score may not provide consistent variation, this may suggest there is a sweet spot to BLEU score in a NLP data augmentation task to provide diverse word selection but consistent translations.

4.2. Classification

For all BERT and RoBERTa models utilized across all languages, we added an additional meanpooling layer and dropout layer on top of the model prior to the final classification layer. Adding these additional layers has been shown to help prevent over-fitting while fine-tuning. We used an Adam optimizer with a learning rate of 2 − 5 and an epsilon of 1.5 − 8. We use a binary cross-entropy loss function, 4 epochs, and a batch size of 32.

Arabic Dutch English German Italian Turkish

SUBJ 280 OBJ 905 SUBJ 311 OBJ 489 SUBJ 298 OBJ 532 SUBJ 308 OBJ 492 SUBJ 382 OBJ 1231 SUBJ 378

OBJ 422

5. Results

Table 6 and 7 contains all model performance on the test set provided by the organizers. We find that our Arabic model has an accuracy of 0.800 with a weighted average F1-score of 0.816. Our English model had an accuracy of 0.696 with a weighted average F1-score of 0.687. For Turkish, we had an accuracy of 0.788 and a weighted average F1-score of 0.784. German received an accuracy of 0.337 and an F1-score of 0.174. Italian had an accuracy of 0.689 and F1 of 0.706. Finally, our Dutch model had an accuracy of 0.646 and a weighted F1-score of 0.618.

Table 8 and 9 shows Arabic model’s performance on the gold test set with diferent level of data augmentation.

Unique tokens in source

Unique tokens in MT

New Tokens

in MT

6. Discussion

We observe that a specialized style-classification model outperformed the RoBERTa-large model quite significantly as seen in Table 10 and 11. This is likely because for a subjectivity classification task there is a heavy emphasis on vocabulary and terminology, which is a lacking in the relatively small training set provided. The raw RoBERTa did not have enough training vocabulary to outperform a specialized model. We also observe a diminishing return when over augment with the Arabic training set. As mentioned before, vocabulary plays a key role and augmenting with several pivot languages may have afected the data quality, potentially removing keywords that determine subjectivity. Look at the example below of a document labeled subjective after only one translation from Arabic to English: "Are there any resolutions that the Security Council may issue to ensure that Egypt’s water The second round of back-translation (Arabic > English > Spanish > English) then produces: "Is there a resolution that the Security Council can issue to ensure that Egypt’s water quota in the Nile River is not afected?" And the third (Arabic > English > French > English) produces: "Are there resolutions that the Security Council could adopt to ensure that Egypt’s share of water in the Nile is not afected?" By the second or third translation, the tone of the statement has shifted towards much more objective. This results in much lower model performance. We can see the results of these experiments in Table 8.

Due to extremely low sample size on the subjective class, we augmented Arabic and Italian training data three times. Table 12 shows the average cosine similarity score between each translation results to the original and the weighted average sentiment score of the pivoting English back-translation based on the Vader Lexicon [ 17 ]. For Arabic, there was no notable diference between the scores. However, for Italian, cosine similarity shows small decreases as more layers of back-translation are added, indicating a small level of semantic drift. Additionally, mean sentiment score decreases indicating subjectivity-level of the lexicon decreases as well.

Our paper suggests there may be a ’sweet spot’ in BLEU score for data agumentation for back-translation, where a perfect translation would not add suficient noise to the training data and a poor translation would not add suficient context. We would recommend exploration of the BLEU score space as an optimization problem in future work.

7. Conclusion

We have described the back-translation augmentation strategies and models employed by Team Accenture’s submissions to Task 2. Team Accenture’s back-translation and foundation model approach yielded the 3rd highest submissions in Arabic, 4th in Turkish, 5th in Dutch, and 8th in German and English. In future work, we hope to explore in more detail to what extent back-translation data augmentation can inhibit subjectivity detection systems.

[1]

Cambria ,

Poria ,

Gelbukh ,

Thelwall , Sentiment analysis is a big suitcase , IEEE Intelligent Systems 32 ( 2017 ) 74 - 80 . doi: 10 .1109/MIS. 2017 . 4531228 .

[2]

Chaturvedi , E. Cambria,

R. E.

Welsch ,

Herrera , Distinguishing between facts and opinions for sentiment analysis: Survey and challenges , Information Fusion 44 ( 2018 ) 65 - 77 .

[3]

L. L.

Vieira ,

C. L. M.

Jeronimo ,

C. E.

Campelo ,

L. B.

Marinho , Analysis of the subjectivity level in fake news fragments , in: Proceedings of the Brazilian Symposium on Multimedia and the Web , 2020 , pp. 233 - 240 .

[4]

C. L.

Jeronimo ,

L. B.

Marinho ,

C. E.

Carmpelo ,

Veloso , A. S. da Costa Melo, Characterization of fake news based on subjectivity lexicons ., J. Data Intell . 1 ( 2020 ) 419 - 441 .

[5]

Kasnesis ,

Toumanidis ,

C. Z.

Patrikakis , Combating fake news with transformers: A comparative analysis of stance detection and subjectivity analysis , Information 12 ( 2021 ) 409 .

[6]

Galassi ,

Ruggeri , A. B.-C. no,

Alam ,

Caselli ,

Kutlu ,

J. M.

Struss ,

Antici ,

Hasanain ,

Köhler ,

Korre ,

Leistra ,

Muti ,

Siegel ,

M. D.

Turkmen ,

Wiegand , W. Zaghouani, Overview of the CLEF-2023 CheckThat! lab task 2 on subjectivity in news articles , in: Working Notes of CLEF 2023- Conference and Labs of the Evaluation Forum , CLEF ' 2023 , Thessaloniki, Greece, 2023 .

[7]

Williams ,

Rodrigues ,

Tran , Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation , 2021 . arXiv: 2107 . 05684 .

[8]

G. A.

Aye ,

Kim ,

Li , Learning autocompletion from real-world datasets , in: 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , IEEE, 2021 , pp. 131 - 139 .

[9]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[10]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer , V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach , CoRR abs/ 1907 .11692 ( 2019 ). URL: http://arxiv.org/abs/ 1907 .11692. arXiv: 1907 .11692.

[11]

Lan ,

Chen ,

Xu ,

Ritter , An empirical study of pre-trained transformers for Arabic information extraction , arXiv preprint arXiv: 2004 . 14519 ( 2020 ).

[12]

Schweter , BERTurk - BERT models for turkish, 2020 . URL: https://doi.org/10.5281/zenodo. 3770924. doi: 10 .5281/zenodo.3770924.

[13]

Chan ,

Schweter , T. Möller, German's next language model , 2020 . arXiv: 2010 .10906.

[14]

Schweter , Italian

BERT

and ELECTRA models , 2020 . URL: https://doi.org/10.5281/zenodo. 4263142. doi: 10 .5281/zenodo.4263142.

[15] W. de Vries , A. van Cranenburgh , A.

Bisazza , T.

Caselli , G. van Noord, M.

Nissim , Bertje: A Dutch BERT model, 2019 . arXiv: 1912 .09582.

[16]

Sundararajan ,

Taly ,

Yan , Axiomatic attribution for deep networks , 2017 . arXiv: 1703 . 01365 .

[17]

Hutto , E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text , Proceedings of the International AAAI Conference on Web and Social Media 8 ( 2014 ) 216 - 225 . URL: https://ojs.aaai.org/index.php/ICWSM/article/view/14550. doi: 10 .1609/icwsm.v8i1. 14550 .