<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Multi-modal Classification of Violent Events using Image Captioning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Vallejo-Aldana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrián Pastor López-Monroy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esaú Villatoro-Tello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Mathematics Research Center (CIMAT)</institution>
          ,
          <addr-line>Guanajuato</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>This research paper presents our involvement in the collaborative evaluation campaign of DAVINCIS@IberLEF 2023. Our focus lies on tackling the Violent-Event Identification (VEI) task, wherein we employ a multi-modal approach that combines textual input with image captions extracted from visual data. The obtained results demonstrate a competitive performance, as we achieved the top position in the VEI task with an average F1 score of 0.92638.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multi-modal models</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Data Oversampling</kwd>
        <kwd>Parameter Tuning</kwd>
        <kwd>Image Captioning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        multitask approach that captures pertinent information, efectively combining the VEI and VEC
tasks into a single model achieving the best results in Violent-Event-Identification (VEI). In the
VEC task, the utilization of a prompt-based approach, as detailed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], yielded the highest
performance. Additionally, in the Violent-Events Categorization shared task, strategies such as
data augmentation through back-translation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], noise-reduction [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and model ensembles [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
were employed to enhance the achieved results.
      </p>
      <p>
        In the 2023 edition of the DA-VINCIS@IberLEF [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] challenge, a set of images accompanied
by corresponding social media text is presented as the input data. In this work, we want to
explore how to use visual and textual information together to create a multi-modal approach to
solve both sub-tasks. We will explore the importance of parameter tuning during training to
obtain better results at the inference stage. In this research paper, we propose to use modern
Transformer architectures like RoBERTa [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to address both sub-tasks, extracting important
features from the images. To accomplish this task, we employ BLIP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], a pre-trained model
specifically designed for image captioning. Subsequently, we merge these generated captions
with their respective text counterparts using a designated text separator. We would like to see
how visual information complements textual information and helps to improve classification
performance. To boost the classification performance of the model we use data oversampling, a
weighted loss function, and an ensemble configuration. Our proposal obtained first place in the
Violent-Event-Identification (VEI) shared task (sub-task 1) with an F1 score of 0.92638 and an
F1 score of 0.84207 for Violent-Event-Categorization (VEC) shared task (sub-task 2).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description and Data</title>
      <p>
        The DA-VINCIS@IberLEF 2023 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is composed of two tasks: (1) Violent-Event-Identification
(VEI) which consists in detecting whether the input data (text and image) contains information
about a violent event and (2) Violent-Event-Categorization (VEC) aiming to detect the violent
event sub-type (Trafic Accident, Murder, Robbery, Other ). The training dataset comprises 2996
examples; for sub-task 1 we have 1277 positive examples and 1719 negative examples. For
sub-task 2 we have the categories percentages shown in Table 1.
      </p>
      <p>Violent-Events Categories</p>
      <p>Trafic Accident</p>
      <p>Murder
Robbery</p>
      <p>Other</p>
      <p>Categories Percentage
31.38 %
6.01 %
5.24 %
57.38 %</p>
      <p>From Table 1 we observe that sub-task 2 is quite challenging due to the low number of
examples of classes Murder and Robbery. The Violent-Event-Categorization task is designed
to be a multi-label task, however, the number of input texts that belong to multiple classes
is around 1% of all the dataset. Hence we decided to treat the VEC problem as a multi-class
classification task. Each text example is associated with one or multiple images related to the
tweet content, for the training data set we have 4259 images for all the text examples.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        To determine how to use the visual and textual information in a way that can be used to accurately
identify and categorize violent events, we propose multiple multi-modal approaches either
combining the outputs of the model’s representation vectors or by joining textual descriptions
of each one of the images related to a text. To extract the important information from the
images, we propose diferent approaches such as using a Convoutional Neural Network (CNN)
or a pre-trained image captioning model such as BLIP [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. With a viable multi-modal setup,
our proposition entails the utilization of data oversampling, a weighted-loss function, and a
model ensemble configuration. These subtle modifications in model training have the potential
to substantially enhance the model’s performance.
3.1. Image Feature Extraction
• Convolutional Neural Networks: Our first approach to extracting important features
from the images is to use a Convolutional Neural Network (CNN), for this purpose, we
used a pre-trained version of Inception-v3 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], on the IMAGENET [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] data set. We then
ifne-tune this model to either detect or categorize violent events. To create a single image
per text, we concatenate the corresponding images for each tweet to form a final image
that is fed to the Convolutional Neural Network. The image pre-processing steps applied
to the image are the ones described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
• Image Captioning: We evaluate another approach to extracting important information
from the visual data using image captioning. To this aim, we used a BLIP[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] pre-trained
model applied to each one of the images obtained from the training data. This generates a
caption describing each one of the images. To use this information in the same language
as the tweets, we use a pre-trained Marian Neural Machine Translation Model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to
translate the captions from English to Spanish. To join the information obtained from
diferent images we use the connecting word and (y in Spanish) to generate a single
sentence for each one of the input texts. To correct some of the generated captions we use
regular expressions to eliminate repeating words one after the previous one. An example
of the connected sentence and the correction using regular expressions is shown below.
      </p>
      <p>Original obtained caption: two ak ak ak
ak ak ak ak ak ak ak ak ak ak ak ak ak ak
ak and a man with a bald haircut and a
bald face Caption after correction with
regular expressions: two ak and a man
with a bald haircut and a bald face.</p>
      <p>
        The histograms depicted in Figure 1 display the lengths of tweets and captions in the
training dataset. It is evident that both tweets and captions are relatively short, thereby falling
well within the text-length limitations of the Transformer model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] (approximately 512
tokens) and allowing for easy handling.
      </p>
      <sec id="sec-3-1">
        <title>3.2. Multi-Modal Approaches</title>
        <p>
          The textual information contains the most relevant information about violent events
identification and categorization (See EXP6 Section 4.1.2). We use a RoBERTa [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] model using the base
configuration (Embedding dimension of 768) using pre-trained weights within a Spanish tweet
domain described in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. To increase the information related to an event, we incorporate the
visual information with diferent multi-modal approaches using the features from a CNN or
from an image captioning model as described in Subsection 3.1.
        </p>
        <p>
          • Concatenating CNN pooler output, with separated text models: We conduct full
ifne-tuning on two distinct RoBERTa [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] models. The initial model, referred to as
RoBERTaTweet (EXP6), is employed solely for tweet analysis to create a classifier for violent events.
The second model, known as RoBERTa-Captions (EXP5), is utilized exclusively for image
caption analysis. Additionally, an Inception-v3 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] model is incorporated to process the
accompanying images. The representation vector of the captions model (Embedding size
of 768), the text model (Embedding size of 768), and the pooler output of the CNN (Vector
size of 2048) are concatenated into a single vector that is then passed to a classification
head consisting of a Multi-layer Perceptron [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] whose outputs are the class probabilities
for each tweet (EXP7). The illustration for this proposal is described in Figure 2
        </p>
        <p>• Using Captions and Text in the same sentence to train a single model: The second
approach to merge the visual and the textual part is by connecting the captions and
the tweet of the input data using a separator (&lt;/s&gt;&lt;/s&gt; for RoBERTa model). This new
representation is then passed to a single Transformer model and fine-tuned for each one
of the two sub-tasks.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Proposed Randomized Data Oversampling, Weighted-Loss function and</title>
      </sec>
      <sec id="sec-3-3">
        <title>Ensemble Configuration.</title>
        <p>Aside from the proposed-multi modal configurations, we believe that by tuning the parameters
of the loss function and randomly increasing the number of examples of the under-represented
categories we can increase the performance of the proposed models. In this work, we propose
an approximation of the weights that may help to reduce the impact of the class imbalance.
We propose to use data oversampling as well to make sure that at least one example of the
under-represented categories is contained in a training batch.</p>
        <p>• Proposed Randomized Data Oversampling: As described in Section 2, the dataset for
Violent-Event-Categorization (VEC) has a high level of imbalance. Therefore we propose
a randomized data oversampling strategy. To this aim, we consider the under-represented
classes for Murder and Robbery. For each positive example  ∈ { , }
we take a random number  ∼  (0, 1) and an oversampling factor  and replicate the
number of copies of  by  . This results in a higher number of positive examples of the
under-represented classes. To avoid any bias, after performing this data oversampling,
we shufle the data randomly. This oversampling procedure is only applied to the training
dataset.
(RIGHT) the randomized data oversampling. From this figure, we see that after this oversampling
procedure, the dataset has significantly more positive examples for classes Murder and Robbery.
• Weighted Loss-Function: To reduce the impact of class imbalance for classes Murder
and Robbery in VEC, we propose a modification of the Cross-Entropy Loss function
adding weights to balance the importance of the under-represented classes. We use the
expression of the Cross-Entropy Loss used by [16] that has the following formula.
 (, ) = {1, . . .  , } where
 = −  log
︃(</p>
        <p>(, )
∑︀=1 (, )
)︃
· I̸=
Where  is the number of classes,  and  correspond to the input and the target
respectively.</p>
        <p>The weights  were adjusted according to the following formula.</p>
        <p>Where  is the probability of selecting a positive example of class  obtained by.
 ≈
 =
∑︀∈ I()=</p>
        <p>||
Where  represents the original dataset without data oversampling. From experimental
results, the weights used in this work are presented in Table 2</p>
        <p>Class
Trafic Accident</p>
        <p>Murder
Robbery</p>
        <p>Other</p>
        <p>Associated Loss Weight
0.15
1.8
1.7
0.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section we present the results of the proposed models and proposed strategies in a custom
dataset created from the original training dataset, these experiments helped us to determine the
importance of parameter tuning to boost model performance and the best multi-modal approach
for each one of the tasks to solve. In this section, we present the oficial results obtained for the
VEI and VEC shared tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Results in Custom Dataset</title>
        <p>In order to evaluate the proposed models for sub-tasks 1 and 2, we created a stratified partition
of the original training dataset consisting of 80% of total examples for training and the rest
20% for validation. We used all labeled examples available to train the final models for the final
submission.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. TEXT-ONLY experiments</title>
          <p>
            To test the impact of parameter tuning on model performance we conducted an experiment
to test how a Weighted Loss Function, Data Oversampling, and the combination of these two
strategies may help to improve the classification capacities of a model. Especially when there is a
high imbalance among classes like in sub-task 2. For these experiments, we set the oversampling
factor  to 10. We tested the impact on the performance of the model in sub-task 2 by using a
Weighted Loss Function (EXP1), a Randomized Data Oversampling (EXP2), and the combination
of these two strategies (EXP3). To enable comparison, we conducted tests by training a model
without utilizing any of the suggested strategies to enhance its performance (EXP0). For this
comparison, we consider just the textual information and we use a RoBERTa [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] model with
pre-trained weights from [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] as our base model. The learning rate for this experiment is set to
9− 6 with a weight decay of 0.1 and using AdamW [18] as model optimizer.
          </p>
          <p>Boosting Strategy</p>
          <p>None
Weighted Cross-Entropy Loss (WCEL)
Randomize Data Oversampling (RDO)</p>
          <p>RDO+WCEL</p>
          <p>Experiment ID</p>
          <p>EXP0
EXP1
EXP2
EXP3</p>
          <p>From Table 3 we see that the combination of the Weighted Loss Function and the Data
Oversampling increased the model capabilities (EXP3). It is worth mentioning that each one of
the strategies increased significantly the classification performance being the
Weighted-LossFunction (EXP1) the boosting strategy with the highest increase on its own.</p>
          <p>In this case, we see that by using these two boosting strategies together we obtained a better
performance in Violent Event Categorization. While the enhancement in model performance
may not be significantly superior compared to using any of the individual proposed strategies,
it is evident that each strategy contributes to the model’s ability to identify important features
in a distinct manner. The Weighted-Cross-Entropy approach assigns greater significance
to underrepresented classes, whereas data oversampling ensures a higher representation of
examples from these classes in the training batches.</p>
          <p>As stated in [17], model ensembles may be useful to increase the classification performance
of the models. In this case, we obtained a 0.01 − 0.02 increase when using ensembles compared
to not using them. The inference time is not compromised as the ensemble procedure can be
done in parallel.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Multi-modal configuration</title>
          <p>
            The DA-VINCIS@IberLEF 2023 challenge [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] ofers textual and visual data for each of the shared
tasks, making it crucial to identify an appropriate multi-modal configuration that can efectively
handle the VEC and VEI tasks. To obtain the best multi-modal configuration we obtained the
prediction performance of each one of the extracted features and the multi-modal approaches
proposed in this work for both VEI and VEC. We consider the Macro Averaged F1 as our metric
because is the same metric considered to evaluate the final submissions. All models were trained
using data oversampling combined with a weighted loss function (EXP3). The models were
trained using AdamW [18] as optimizer except for the Inception-v3 model where Adam [19] is
used as optimizer. Table 4 shows the results from five runs with the diferent proposals.
          </p>
          <p>
            Model
Inception-V3[
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]
RoBERTa-Captions
          </p>
          <p>RoBERTa-Tweet</p>
          <p>Outputs concatenation
Joining Caption and Tweet with separator</p>
          <p>Experiment ID</p>
          <p>EXP4
EXP5
EXP6
EXP7
EXP8</p>
          <p>
            From Table 4 we see that the features obtained from the images do not provide suficient
information to either detect or categorize violent events (EXP4). However, when this
information is combined with the tweet using a separator as in our proposal (EXP8), it slightly
increases the classification performance of the text model. This may be due to the inherent
Self-attention mechanism of all the Transformer models. Self-attention is an attention
mechanism that establishes connections between various positions within a single sequence, enabling
the computation of a representation for the entire sequence [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Therefore a Transformer
model is able to determine which information (from text and image) is important to create the
representation vector of the sequence to correctly identify or categorize a violent event.
          </p>
          <p>To demonstrate this, we employ an integrated gradients approach to extract significant
features from the texts and ascertain the words that are pertinent for classifying the presence
of a violent event.</p>
          <p>Figure 5 reveals that certain tokens within the caption have an influence on the model to
infer a positive prediction, indicating its capability to identify the crucial features within the
combined input text and generated caption. Based on the observations from Figure 5, it becomes
apparent that words such as "accidente" (accident) and "carretera" (road) play a crucial role in
the model’s ability to recognize a Violent Event mentioned in the tweet. Notably, some of these
important words are included in the generated caption. Figure 6 illustrates a tweet in which no
violent events were detected. It is noteworthy that words like "emoji" play a role in the model’s
negative prediction. Similarly to the positive example, some of the significant words utilized to
classify the tweet as a "Non Violent" example are present within the caption generated by the
associated image.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Oficial submissions to the Violent-Event-Detection shared task (Sub-task 1)</title>
        <p>We did two submissions for sub-task 1, the first one denoted by RUN1 corresponds to an ensemble
of five models with diferent weight initialization, and the second submission ( RUN2) corresponds
to a single model used for inference.</p>
        <p>The oficial results are shown in Table 5. We see that our proposals obtained the best results
among the rest of the participants either with the single model proposal or the ensemble boosting
strategy.</p>
        <p>For the second sub-task referring to Violent-Events Categorization, we did two submissions
using the same model configurations as mentioned in Section 4.2. The oficial results are shown
in Table 6.</p>
        <p>The results shown in Table 6 correspond to the final results released for sub-task 2. In this
task, our proposal obtained seventh place with an average F1 score of 0.842074 among all the
classes. The close diference between the results obtained by our proposal and the winning
team suggests that with further parameter adjustments, the F1 score could be increased. The
boosting strategies presented in this study have the potential to be adjusted in order to further
enhance model performance. For instance, instead of data oversampling, data augmentation
techniques like using generative models such as GPT-3 [20] can be employed to increase text
diversity. This approach enables the model to learn broader patterns, thereby improving the
categorization of Violent Events. Similarly, in the case of Weighted-Cross-Entropy, evolutionary
strategies like the Covariance Matrix Adaptation (CMA-ES) [21] algorithm can be utilized to
determine optimal weights that maximize the F1 score for the under-represented classes.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Error Analysis</title>
        <p>According to the results obtained from the predictions made by the proposed model, the classes
Murder and Robbery have the lowest F1 score of all classes in the violent event categorization.
In this section, we present some misclassified examples of these two classes. The first example
corresponds to an example of the Murder class that was misclassified and assigned to the Other
class.</p>
        <p>As we see from the picture associated with the text, it does not provide any useful information
that could be related to a murder. The text contains words like investigación (investigation)
suggesting that the content described in the tweet is not a confirmed fact.</p>
        <p>el @usuario inició investigación por delito de
homicidio, luego de un hecho registrado en la 5
de mayo, donde fallecieron dos personas ’en el
área se recopilan indicios’ señala la entidad
#exitosanoticias url &lt;/s&gt;&lt;/s&gt; un grupo de
personas de pie alrededor de una calle</p>
        <p>The information shown below is labeled as a positive example of Robbery, however, the
model predicted this example as Murder. It is suggested that the word disparo (shot) is mostly
associated by the model with murder. The image related to the text does not provide a lot of
information similar to the previous example. This shows that the visual part lacks information
to complement the textual information.
naucalpan edomex emoji coche de policía emoji
emoji ambulancia emoji una mujer recibió un
disparo durante un asalto a bordo de una combi</p>
        <p>los agresores huyeron fue en la esquina de
avenida naucalpan y calle allende, en la colonia
hidalgo @usuario url &lt;/s&gt;&lt;/s&gt; un hombre está
durmiendo en el suelo en un autobús</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This paper describes our participation at the DA-VINCIS@IberLEF 2023 challenge on the
ViolentEvent-Identification and Violent-Event-Categorization sub-tasks. Our approaches rely on optimal
parameter tuning and multi-modal strategies to solve both sub-tasks. The proposed solutions
showed a great performance in the VEI task obtaining the best results in the challenge. For the
Violent-Event-Categorization task, further parameter tuning is required to increase model results.
Based on the experimental findings presented in this research paper, it is evident that achieving
better model performance necessitates greater diversity in text generation. Consequently,
we propose the incorporation of data augmentation strategies, such as back-translation or
generative models, to generate data specifically for the underrepresented classes. In order
to obtain more precise weights for the cross-entropy loss, we hypothesize that employing
evolutionary strategies will lead to further improvements in model performance.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors thank CONACYT, INAOE and CIMAT for the computer resources provided through
the INAOE Supercomputing Laboratory’s Deep Learning Platform for Language Technologies
(Laboratorio de Supercómputo: Plataforma de Aprendizaje Profundo) with the project
"Identification of Aggressive and Ofensive text through specialized BERT’s ensembles" and CIMAT Bajio
Supercomputing Laboratory (#300832). Esaú Villatoro-Tello, was supported by Idiap Research
Institute during the elaboration of this work.
nisms, Technical Report, Cornell Aeronautical Lab Inc Bufalo NY, 1961.
[16] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, et al., Pytorch: An imperative style, high-performance deep
learning library, Advances in neural information processing systems 32 (2019).
[17] J. Briskilal, C. Subalalitha, An ensemble model for classifying idioms and literal texts using
bert and roberta, Information Processing &amp; Management 59 (2022) 102756.
[18] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint
arXiv:1711.05101 (2017).
[19] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint
arXiv:1412.6980 (2014).
[20] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
neural information processing systems 33 (2020) 1877–1901.
[21] N. Hansen, S. D. Müller, P. Koumoutsakos, Reducing the time complexity of the
derandomized evolution strategy with covariance matrix adaptation (cma-es), Evolutionary
computation 11 (2003) 1–18.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          , “
          <article-title>dark participation” without representation: A structural approach to journalism's social media crisis</article-title>
          ,
          <source>Social Media+ Society</source>
          <volume>8</volume>
          (
          <year>2022</year>
          )
          <fpage>20563051221129156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Veenstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alajmi</surname>
          </string-name>
          ,
          <article-title>Twitter as “a journalistic substitute”? examining# wiunion tweeters' behavior and self-perception</article-title>
          ,
          <source>Journalism</source>
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <fpage>488</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jarquín-Vásquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I. H.</given-names>
            <surname>Farías</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arellano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          , L. V. nor
          <string-name>
            <surname>Pineda</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. y Gómez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Sanchez-Vega</surname>
          </string-name>
          ,
          <article-title>Overview of DA-VINCIS at IberLEF 2023: Detection of Aggressive and Violent Incidents from Social Media in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Arellano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Villaseñor</given-names>
            <surname>Pineda</surname>
          </string-name>
          , M. Montes y Gómez,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanchez-Vega</surname>
          </string-name>
          ,
          <article-title>Overview of da-vincis at iberlef 2022: Detection of aggressive and violent incidents from social media in spanish (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vallejo-Aldana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Villatoro-Tello</surname>
          </string-name>
          ,
          <article-title>Leveraging events sub-categories for violent-events detection in social media</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2022</year>
          ),
          <source>CEUR Workshop Proceedings. CEUR-WS. org</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Prompt based framework for violent event recognition in spanish</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2022</year>
          ),
          <source>CEUR Workshop Proceedings. CEUR-WS. org</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Turón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcıa-Pablos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zotova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cuadros</surname>
          </string-name>
          , Vicomtech at da-vincis:
          <article-title>Detection of aggressive and violent incidents from social media in spanish</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2022</year>
          ),
          <source>CEUR Workshop Proceedings. CEUR-WS. org</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12888</fpage>
          -
          <lpage>12900</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alemi</surname>
          </string-name>
          ,
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>31</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition</article-title>
          , Ieee,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Junczys-Dowmunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grundkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dwojak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Neckermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Seide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Germann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bogoychev</surname>
          </string-name>
          , et al.,
          <source>Marian: Fast neural machine translation in c++</source>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>00344</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J. M. Pérez</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Furman</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Alemany</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Luque</surname>
          </string-name>
          ,
          <article-title>Robertuito: a pre-trained language model for social media text in spanish</article-title>
          ,
          <source>arXiv preprint arXiv:2111.09453</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rosenblatt</surname>
          </string-name>
          ,
          <article-title>Principles of neurodynamics. perceptrons and the theory of brain mecha-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>