<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tyche at Factify 2022: Fusion Networks for Multi-Modal Fact-Checking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nainesh Hulke</string-name>
          <email>nainesh.18je0513@cse.iitism.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bharath Raj Siva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ankesh Raj</string-name>
          <email>ankesh.18je0122@cse.iitism.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ali Asgar Saifee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Multimodal Fact-Checking, Fusion Network, Fake News Detection</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology (Indian School of Mines)</institution>
          ,
          <addr-line>Dhanbad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our approach for the multimodal fact-checking (Factify) challenge at AAAI 2022. Fake news has the potential to harm people and society. To combat fake news, fact-checking becomes crucial. False claims are prevalent in both visual and textual forms. Multimodal techniques can merge information from both these modalities. In our approach, we treated this challenge as a multi-class classification task. We extracted textual and visual features from the texts and images, respectively. We used a single BERT module as a text feature extractor for both claim and reference text so that the extracted feature representations show the perspective of the same model on both claim and reference texts rather than two feature representations from two diferent models. The same goes for the image feature extractor. EficientNet-B3 was used for extracting image features. Then the features were passed through a proposed fusion module and the classifier. The purpose of the fusion module is to enable the model to learn semantic similarities between texts and images. We achieved the best F1 score of 0.692 on the test set of the FACTIFY dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        News and information spread quickly through the internet and social media. This has also led
to a vast increase in the spread of misinformation and fake news. As per a study, Facebook
engagements with fake news sites average roughly 70 million per month[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Fake news has
the potential to harm people and society. This spread has been blamed for incidents ranging
from ethnic violence, inter-racial violence, and religious conflicts to mass riots. For example,
some studies argue that fake news is gradually becoming an essential aspect of South Africa’s
xenophobia discourse[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        There has been a lot of recent work on fact-checking claims. Early works on fact-checking
focused on using textual information extracted from the text of the article, such as statistical text
features[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], emotional information[4][5][6], or integrating metadata with the text[7]. Although
1All authors contributed equally.
the textual content can be a significant indicator for fake news detection, it is not suficient
when used alone. Some researchers have proposed systems that use the credibility of the pages
that post the news[8] or profile characteristics of the users that shared the post to detect the
articles that contain manipulated content[9][10].
      </p>
      <p>Online articles and posts usually contain more information in the form of images and social
context that can be useful for fake news detection. Online news contains images that generally
attract the attention of users. Images in fake and real news may follow diferent patterns or be
modified to attract users’ attention and make them share them. There are several reasons why
an image may be deemed fake. In most cases, this involves digital manipulation, e.g., cropping,
splicing, etc. However, there are cases when an image is entirely legitimate, but it is published
alongside some text that does not reflect its content accurately.</p>
      <p>Hence, it is essential that a system also exploits information extracted from the images for
efective fake news detection. Visual information can complement the textual one for fake
news detection. Multimodal systems can merge information from both these modalities. It
is also capable of disambiguating the wrong classifications and improving the results using
combined modalities. Some researchers have proposed multimodal systems that combine textual
and visual information for determining whether an article or a post is fake or not[11][12] or
combined textual, visual, and semantic information for fact-checking claims using a multimodal
architecture[13].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Early attempts to use deep learning methods to counter fake news employed data of a single
modality. Vo et al. proposed a novel framework, Fact-checking Response Generator (FCRG)[14],
to generate a fact-checking response tweet to combat fake news on Twitter. They showed
distinguishing linguistic features of fact-checking tweets compared with Normal and Random
Replies.</p>
      <p>Wang et al.[7] introduced a new dataset LIAR for fake news detection and proposed a CNN
model to integrate metadata with text. The proposed method outperforms text-only models
strictly, suggesting the use of multiple modalities to express the news article’s intent more
clearly.</p>
      <p>News articles often contain images and text; hence, multimodal deep learning methods have
produced good results. Wang et al. [15] proposed the EANN (event adversarial neural network)
model with three major components: the multimodal feature extractor, the fake news detector,
and the event discriminator. The textual and visual latent feature representations are learned
and concatenated to form the final multimodal feature representation.</p>
      <p>Multimodal approaches that use features from the diferent modalities and feed them to
diferent types of networks have also been successful. Gallo et al.[ 16] introduced multimodal
fusion networks where the text and image feature representations are concatenated and passed
through a neural network. The proposed method achieves the state-of-the-art results on the
UPMC Food-101 dataset.</p>
      <p>Recent research in multimodal fake news detection by Giachanou et al.[13] proposed a
multimodal multi-image network for fake news detection. The proposed system combines textual,
t
x
e
T
m
il
a
C
t
x
e
T
t
n
e
m
u
c
o
D
e
g
a
m
I
m
li
a
C</p>
      <p>TeExxttFraecattourre DoCclauimmeTnetxTteFxetaFteuaretusres</p>
      <p>Fusion Module
ImaEgxetraFcetaotrure DColacuimmIemnatgImeaFgeeatFuereatsures
visual, and semantic information. For the textual representation, BERT-Base [17] was used
to capture the underlying semantic and contextual meaning. Image tags were extracted from
multiple images that the articles contained using the VGG-16 model [18] for the visual
representation. The semantic information was represented by the image-text similarity calculated
using the title and image tags embeddings cosine similarity. All these features are concatenated
to make the final prediction.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>
        Our work is based on the FACTIFY dataset [19] [
        <xref ref-type="bibr" rid="ref4">20</xref>
        ]. In our approach, we extract the textual
and visual features from the claim and reference texts and images, respectively. We propose
a fusion module for the model to learn semantic similarities between the texts and images.
The extracted features from this module are concatenated and passed through a classifier. Our
approach aims to train a model that can use the relationship between the diferent modalities,
i.e., text and image. Our model mainly consists of three diferent components Figure-1.
• Text Feature Extractor
• Image Feature Extractor
• Fusion Module
      </p>
      <p>We used a single text feature extractor for both claim and reference text so that the extracted
feature representations show the perspective of the same model on both the texts rather than
two feature representations from two diferent models. A similar argument can be applied to
images as well.
txeT rse</p>
      <p>u
im ta
laC eF
t
x
tneT rse
e tu
coum eaF
D
e
ag se
m ru
I t
ilam eaF
C</p>
      <p>s
ten traue
um eF
Figure 3: TecoDxeagt_X2_Image_X2 Fusion Module
m
I
3.1. Text Feature Extractor ttcxeouenTm trseaueF rsyeeenLa
D
We use the pre-trained BidirectionaDl Encoder Representations from Transformers (BERT) [17]
base model because of its ability to tgeaacskle a broad set of NLryeP tasks. We fine-tuned the model for
m re a
wouhricshpewcieficutaseskd,oafretegxitvfeenatiunreAepxpternadcilIaCmtitaueFixoAn.. Specific details abseLneDout the BERT model architecture,</p>
    </sec>
    <sec id="sec-4">
      <title>3.2. Image Feature Extractor txeT rse</title>
      <p>
        u
im ta
la e
We wanted to have a model with hCigFh accuracy on ImageNet [
        <xref ref-type="bibr" rid="ref5">21</xref>
        ] that can be trainable with
limited hardware resources, so we used the EficientNetB3 architecture, pre-trained on ImageNet,
and replaced the last dense layer with a dense layer of size 256 followed by a batch normalization
layer. The EficientNetB3 has 12 million parameters and achieves a top-1 accuracy of 81.6% on
ImageNet. It balances the trade-of between classification accuracy and computational eficiency
for our task.
      </p>
      <sec id="sec-4-1">
        <title>3.3. Fusion Module</title>
        <p>The extracted feature outputs are combined using a fusion module. We propose two methods of
fusion with diferent intuitions on the output.</p>
        <p>• Text_Image_X2 : In this method Figure-2., the claim text features and claim image
features are concatenated and passed through a dense layer of size 256 to give the claim
features. Similarly, we also get the reference features. This captures the similarity between
text and image features of both claim and reference.
• Text_X2_Image_X2 : In this method Figure-3, the claim text features and reference text
features are concatenated and passed through a dense layer of size 256 to give the text
features. Similarly, we also get the image features. This captures the similarity between
text features of claim and reference and image features of claim and reference.</p>
        <p>The output features are then concatenated and passed through a dense layer of size 5. Finally,
a Softmax layer is added to get the class probabilities.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <sec id="sec-5-1">
        <title>4.1. Data Preprocessing</title>
        <sec id="sec-5-1-1">
          <title>4.1.1. Text Preprocessing</title>
          <p>The FACTIFY dataset has both text and images; therefore, preprocessing was performed
separately.</p>
          <p>
            First, the text was converted to lowercase, removing all redundant elements, such as ASCII
characters and weblinks. The stop words (i.e., articles, connectors, prepositions, and others)
were removed since they do not contain any information and the semantics of sentences remain
intact. We used the nltk [
            <xref ref-type="bibr" rid="ref6">22</xref>
            ] stopwords list to remove stopwords. The words of a sentence were
then lemmatized. Lemmatization involves using a vocabulary and morphological analysis of
words to remove inflectional endings only and return the base or dictionary form of the word.
Thus, the size of the text was reduced, and the semantics of the sentence were intact. Only
important words remained.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>4.1.2. Image Preprocessing</title>
          <p>
            The training data have a variable aspect ratio and size, so we adopted a stage-dependent
rescaling policy. We re-scale an image in the training stage so that the min side is 256 pixels wide
while keeping the initial image’s aspect ratio. As a result, the spatial information was not lost,
and at the same time, the input to models was within trainable limits. Various other augments
from Albumentations [
            <xref ref-type="bibr" rid="ref7">23</xref>
            ] were adopted to prevent overfitting. Since all the images are similar
to real-world images, all arguments were considered to mirror real-world images.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Experimental Settings, Hyperparameter Tuning, and Results</title>
        <p>
          We trained the text and image feature extractor together. The preprocessed images were resized
to 256×256 before passing them through the image feature extractor model. We used the initial
learning rate of 10−5 with a linear scheduler. For the BERT model, we used the initial learning
rate of 3×10−5 and kept decreasing it using the scheduler. The BERT sequence size was set to
64. A learning rate of 10−5 was used for the fusion module. F1 loss was used as the metric and
Adam[
          <xref ref-type="bibr" rid="ref8">24</xref>
          ] as the optimizer.
        </p>
        <p>The best way to maximize the F1 metric would be to minimize the F1 loss, defined as the
1−F1 score. The issue is that the F1 score is not diferentiable. We took inspiration from Lee et
0.62
0.60
0.58
s
s
lo0.56
n
o
i
t
ilad0.54
a
v0.52
0.50
0.48</p>
        <p>Loss Curves</p>
        <p>
          Textx2-Imagex2
(Text-Image)x2
al.[
          <xref ref-type="bibr" rid="ref9">25</xref>
          ] and implemented it for multiclass classification. We accept probabilities instead of actual
counts of true positive, true negative, false positive, or false negative. Say the class ”refute” is
predicted with probability 0.2, while the true label is ”refute”. Then we calculate true positive
as 0.2 and false negative as 0.8.
        </p>
        <p>We trained the model using the NVIDIA Tesla P100 GPU with 16 GB of memory and also
used RTX5000, having 32 GB of memory.</p>
        <p>Text_X2_Image_X2 fusion module was trained for 18 epochs, and the model with Text_Image_X2
fusion module was trained for 17 epochs, after which the decrease in validation loss Figure-4
became insignificant.</p>
        <p>We also tried diferent architectures, as mentioned in Table 1. We observed that the
textonly classifier performs better than the image-only classifier indicating that text provides
better insights to news articles. We trained on full data (i.e., text and image combined)
using EficientNetB 3 for image and BERT for text which is then followed by an early fusion
module [16]( a fusion module that concatenates all the extracted features together and passes
them through a Dense layer). Then we tried our proposed models Text_X2_Image_X2 and
Text_Image_X2 and found that these outperformed both the early fusion models and the models
trained on a single modality.</p>
        <p>The Text_X2_Image_X2 model gave an F1 score of 0.679 on the test set performing better
than the Text_Image_X2 model, which gave a score of 0.664. This was due to the way we
concatenate the extracted features. Semantic similarities between similar data(text to text and
image to image) lead to a better estimation of diferences between claim and reference. We
ensembled the models doing a weighted average with diferent ratios and found an increase in
the F1 score to 0.692 on the test set (equal weights).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion and Future Work</title>
      <p>Combating fake news is crucial as it can destroy one’s credibility and hurt many people. In
this paper, we focused on countering the problem of fake news using fact-checking with the
help of the FACTIFY dataset. We proposed two novel methods for combining the multimodal
features. The Text_X2_Image_X2 method leverages the similarity between the claim and
document text features and claim and document image features to predict the required class.
The Text_Image_X2 method leverages the similarity between the text and image claim features
and text and image document features to predict the required class. Both the proposed methods
outperform the existing early fusion network. Our proposed methods showed that combining
similar features separately( EficientNetB 3 + Bert-base + Text_X2_Image_X2 and EficientNetB 3
+ Bert-base + Text_Image_X2) is more efective than combining all the features together(
EficientNetB 3 + Bert-base + Early Fusion ). Future work expects that using deeper CNN models
for image feature extractors can improve model performance. Similarly, we can use diferent
variants of BERT for a more robust text feature extractor. Playing with the neural networks
used in fusion modules and classifiers can also give better results.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Acknowledgement</title>
      <p>We are indebted to team CyberLabs for their generous support. Also, our work was made
possible with the support of JarvisLabs.ai. We thank them for providing a cloud GPU instance
for our experiments. We also thank the organisers of DE-FACTIFY 2022 for giving us the
opportunity to work on the dataset.
[4] O. Ajao, D. Bhowmik, S. Zargari, Sentiment aware fake news detection on online social
networks, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), IEEE, 2019, pp. 2507–2511.
[5] B. Ghanem, P. Rosso, F. Rangel, An emotional analysis of false information in social media
and news articles, ACM Transactions on Internet Technology (TOIT) 20 (2020) 1–18.
[6] A. Giachanou, P. Rosso, F. Crestani, Leveraging emotional signals for credibility
detection, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2019, pp. 877–880.
[7] W. Y. Wang, ” liar, liar pants on fire”: A new benchmark dataset for fake news detection,
arXiv preprint arXiv:1705.00648 (2017).
[8] K. Popat, S. Mukherjee, A. Yates, G. Weikum, Declare: Debunking fake news and false
claims using evidence-aware deep learning, arXiv preprint arXiv:1809.06416 (2018).
[9] A. Giachanou, E. A. Ríssola, B. Ghanem, F. Crestani, P. Rosso, The role of personality and
linguistic patterns in discriminating between fake news spreaders and fact checkers, in:
International Conference on Applications of Natural Language to Information Systems,
Springer, 2020, pp. 181–192.
[10] K. Shu, S. Wang, H. Liu, Understanding user profiles on social media for fake news
detection, in: 2018 IEEE Conference on Multimedia Information Processing and Retrieval
(MIPR), IEEE, 2018, pp. 430–435.
[11] S. Singhal, R. R. Shah, T. Chakraborty, P. Kumaraguru, S. Satoh, Spotfake: A multi-modal
framework for fake news detection, in: 2019 IEEE fith international conference on
multimedia big data (BigMM), IEEE, 2019, pp. 39–47.
[12] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural
networks for multi-modal fake news detection, in: Proceedings of the 24th acm sigkdd
international conference on knowledge discovery &amp; data mining, 2018, pp. 849–857.
[13] A. Giachanou, G. Zhang, P. Rosso, Multimodal multi-image fake news detection, 2020, pp.</p>
      <p>647–654. doi:10.1109/DSAA49011.2020.00091.
[14] N. Vo, K. Lee, Learning from fact-checkers: Analysis and generation of fact-checking
language, in: Proceedings of the 42nd International ACM SIGIR Conference on Research
and Development in Information Retrieval, 2019, pp. 335–344.
[15] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann: Event adversarial neural
networks for multi-modal fake news detection, 2018, pp. 849–857. doi:10.1145/3219819.
3219903.
[16] I. Gallo, G. Ria, N. Landro, R. L. Grassa, Image and text fusion for upmc food-101 using
bert and cnns, in: 2020 35th International Conference on Image and Vision Computing
New Zealand (IVCNZ), 2020, pp. 1–6. doi:10.1109/IVCNZ51579.2020.9290622.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[18] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
recognition, arXiv preprint arXiv:1409.1556 (2014).
[19] S. Mishra, S. Suryavardan, A. Bhaskar, P. Chopra, A. Reganti, P. Patwa, A. Das,
T. Chakraborty, A. Sheth, A. Ekbal, C. Ahuja, Factify: A multi-modal fact verification
dataset, in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and Hate
Speech Detection, CEUR, 2022.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Text Feature Extractor Details</title>
      <p>The model used was the BASE version of the BERT model having the following parameters:
English language uncased, 12 hidden layers (L), 768 hidden sizes (H), 12 self-attention heads (A),
30522 words dictionary (vocab size), 110 million parameters in total. Additionally, a Dropout
layer with a probability of 0.3, a Dense layer of size 256, and a Batch-Normalization layer are
added to the final layer corresponding to the CLS token.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Allcott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gentzkow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Trends in the difusion of misinformation on social media</article-title>
          ,
          <source>Research &amp; Politics</source>
          <volume>6</volume>
          (
          <year>2019</year>
          )
          <fpage>2053168019848554</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Chenzi</surname>
          </string-name>
          ,
          <article-title>Fake news, social media and xenophobia in south africa</article-title>
          ,
          <source>African Identities</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poblete</surname>
          </string-name>
          , Information credibility on twitter,
          <source>in: Proceedings of the 20th international conference on World wide web</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Patwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suryavardan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reganti</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ekbal</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ahuja</surname>
          </string-name>
          ,
          <article-title>Benchmarking multi-modal entailment for fact verification</article-title>
          , in: Proceedings of De-Factify: Workshop on Multimodal Fact Checking and
          <article-title>Hate Speech Detection</article-title>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition</article-title>
          , Ieee,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Klein</surname>
          </string-name>
          , E. Loper,
          <article-title>Natural language processing with Python: analyzing text with the natural language toolkit, ”</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Buslaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Iglovikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Khvedchenya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Druzhinin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          ,
          <article-title>Albumentations: fast and flexible image augmentations</article-title>
          ,
          <source>Information</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>125</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <article-title>A surrogate loss function for optimization of   score in binary classification with imbalanced data</article-title>
          ,
          <source>arXiv preprint arXiv:2104.01459</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>