<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HomoCIC at HOMO-LAT 2025: Retrieval Augmented Classification using Transformer Architectures and Vector Knowledgebases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Cardoso-Moreno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Moreno-Mendieta</string-name>
          <email>lmorenom2021@cic.ipn.mxLuisMoreno-Mendieta</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diana Jiménez</string-name>
          <email>dianaljl.99@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Alberto Torres-León</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Valdez-Rodríguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional, Center for Computing Research, Computational Cognitive Science Laboratory</institution>
          ,
          <addr-line>Mexico, City, 07700</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents HomoCIC's approach to the HOMO-LAT 2025 task, focusing on polarity detection in LGBT+ related content from Latin American Reddit posts. Our methodology combines retrieval augmented classification using transformer architectures with vector knowledge databases to address the challenges of cross-dialect hate speech detection. We implemented a comprehensive preprocessing pipeline to handle social media text artifacts, including URL removal, markdown cleaning, emoji conversion to Spanish, and whitespace normalization. Our approach utilizes two embedding models: Amazon's Titan Text Embeddings V2 and a fine-tuned BETO model; the resulting embeddings were stored in an OpenSearch vector database. Classification was performed using k-Nearest Neighbors (kNN) with additional filtering by country and keyword to account for regional linguistic variations across Latin American countries. The fine-tuned BETO model achieved an F1-score of 0.4661, substantially outperforming the general-purpose Titan model (F1-score: 0.3792) by approximately 0.1 points. Our results demonstrate the importance of domain-specific fine-tuning and the feasibility of using retrieval augmented classification for polarity detection in multilingual, cross-dialectal scenarios; however, the moderate performance levels suggest that working with limited training data and cross-dialect classification presents substantial dificulties for current NLP approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;hate speech detection</kwd>
        <kwd>transformer architectures</kwd>
        <kwd>vector databases</kwd>
        <kwd>BETO</kwd>
        <kwd>Spanish NLP</kwd>
        <kwd>retrieval augmented classification</kwd>
        <kwd>embedding models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent years have seen a rise in hate speech expressions, primarily driven by increased social media
activity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The European Union defines hate speech as: “All conduct publicly inciting to violence
or hatred directed against a group of persons or a member of such a group defined by reference to
race, colour, religion, descent or national or ethnic” [2]. Given that hate expressions contain ofensive
and harmful content targeting specific communities, they tend to cause harm and conflict, and such
expressions can easily spread widely due to existing prejudices [3]. Homophobic hate speech holds
particular significance, considering that LGBT+ individuals face substance abuse disorders, mental
health challenges, workplace discrimination, and restricted access to healthcare services [4, 5]. This
becomes even more critical within Mexican society, where drug consumption represents a social problem
afecting not only the LGBT+ community [6].
      </p>
      <p>Within this framework, the HOMO-MEX task [7, 8] emerged as part of IberLEF (Iberian Languages
Evaluation Forum) [9]. The task’s primary goal involves developing Natural Language Processing (NLP)
systems capable of identifying LGBT+ related hate speech in Spanish tweets, regardless of how subtle
the expression might be. The 2024 edition of HOMO-MEX [8] featured three tracks: a multi-class hate
speech detection track mapping tweets to three categories (LGBT+phobic, not LGBT+phobic, and not
LGBT+related); a multilabel hate speech detection track with possible classes including Lesbophobia,
Gayphobia, Biphobia, Transphobia, Other LGBT+phobia, and Not LGBT+related. The third track focused
on classifying song lyrics containing LGBT+phobic hate speech, with classes defined as LGBT+phobic
and Not LGBT+phobic. This marked the first inclusion of such a track in the HOMO-MEX task,
acknowledging the dificulty of identifying hate speech in songs since detection depends on the context
and culture in which the songs were created.</p>
      <p>For the year 2025, the HOMO-MEX task has evolved into HOMO-LAT [10], as part of the IberLEF 25
edition [11]; the task was expanded to not only include mexican texts, but instead texts from several
countries are included, for instance: Argentina, Chile, Colombia, and Mexico. There are two tracks for
this task, in the first one, given a text, the country it came from and a keyword (a hint to what segment
of the LGBT+ community the text is directed to), one must determine if the text had either positive
(POS), neutral (NEU) or negative (NEG) polarity against the LGBT+ community. Both datasets, train
and testing, were conformed by the same countries. Track 2, however, required to perform the same
classification, with the diference that in training there were texts from Argentina, Chile, Colombia,
and Mexico, while the testing had texts from Bolivia, Costa Rica, Cuba, Dominican Republic, Ecuador,
El Salvador, Guatemala, Honduras, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, Uruguay, and
Venezuela. Track 1 is considered the Multi-dialect polarity detection track (Multi-class), while track 2 is
a Cross-dialect polarity detection (Multi-labeled) task.</p>
      <p>This paper presents our approach to tackle these hate expressions against the kLGBT+ community,
consisting on the creation of vector databases for later classification of test texts using the k-Nearest
Neighbors (kNN) algorithm based on similarity measures between contextual embeddings. The
emebeddings were created with two approaches: i) using a propietary model from Amazon with multilingual
support and ii) by first fine-tuning a BETO model to this particular dataset and, once trained, get the
contextual embeddings from BETO.</p>
      <p>The manuscript follows this structure: Section 2 presents a brief literature review for hate speech
detection, first providing a general overview and then focusing specifically on the HOMO-MEX task;
Section 3 details our approach, covering preprocessing, models and metrics; Section 4 presents the
obtained results; finally, Section 5 emphasizes the significance of our approach and indicates future
research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>This section provides a concise overview of existing research concerning hate speech and homophobic
language. The review begins with general approaches to the task and subsequently addresses works
specifically related to the HOMO-MEX task.</p>
      <sec id="sec-2-1">
        <title>2.1. General Overview</title>
        <p>Conventional Machine Learning (ML) models have been frequently combined with Natural Language
Processing (NLP) preprocessing strategies. For example, in [12], a voting-based ensemble composed
of Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR) was utilized in
conjunction with character-level and word-level n-grams, as well as syntactic n-grams. This system
was designed for the Hate Speech Spreaders (HSSs) profiling task at PAN CLEF 2021 and reported
accuracy scores of 0.73 for English and 0.83 for Spanish. Likewise, [13] implemented three
treebased classifiers—RF, Light Gradient Boosting Machine (LightGBM), and CatBoost—optimized using
Bayesian search, employing unigram and bigram features, achieving accuracies ranging from 0.85 to
0.87 depending on the model.</p>
        <p>Convolutional Neural Networks (ConvNets) have also seen widespread application in hate speech
detection tasks. Ribeiro and da Silva [14] introduced a ConvNet architecture for hate speech classification
within the SemEval-2019 Task 5. Their model incorporated pre-trained word embeddings such as GloVe
and FastText (300 dimensions), yielding F1-scores between 0.48 and 0.69. Similarly, Siino et al. [15]
addressed the HSSs task using a ConvNet with a single convolutional layer, reaching an accuracy of
0.79 in a multilingual context (English and Spanish), with language-specific scores of 0.85 and 0.73 for
Spanish and English, respectively.</p>
        <p>The study in [2] proposed the A-stacking classifier, an ensemble method incorporating a Recurrent
Neural Network (RNN) for word embedding generation, a Long Short-Term Memory (LSTM) unit, and
a softmax output layer. This architecture was evaluated across multiple datasets under both in-domain
and cross-domain configurations. In a related efort, Corazza et al. [ 16] developed a modular neural
model consisting of an RNN layer, a 100-neuron dense layer, and a final output unit. This system
supported both word-level and tweet-level representations and was validated on English, German, and
Italian datasets.</p>
        <p>In the SemEval-2019 competition, the NULI team [17] fine-tuned the Bidirectional Encoder
Representations from Transformers (BERT) model to identify hate speech. Their minimal preprocessing
pipeline included emoji normalization, hashtag segmentation, and lowercasing, which contributed to
their first-place ranking in the competition.</p>
        <p>Finally, Caselli et al. [18] introduced HateBERT, a variant of BERT retrained on the Reddit Abusive
Language English (RAL-E) corpus for English abusive language detection. Evaluated alongside the
original BERT on several datasets, HateBERT consistently outperformed the baseline model, demonstrating
the advantages of task-specific pretraining.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. HOMO-MEX Literature Review</title>
        <p>The HOMO-MEX task was introduced in 2023 during the IberLEF forum [7]. It involved the construction
of a Mexican Spanish corpus of tweets containing terms associated with the LGBT+ community for the
purpose of hate speech detection. The vocabulary selection process included slurs, slang, and general
terminology gathered from social media platforms such as Twitter, Facebook, and Instagram. Variants
of key terms were also accounted for; for example, the word puto included derived forms such as pute
and putx (feminized), putito, putín (diminutive), and putote, putón (augmentative), among others.</p>
        <p>Following the compilation of terms, a large-scale web scraping process was conducted, yielding
706,886 tweets in Mexican Spanish. From this dataset, 11,000 tweets were manually annotated into
three categories: LGBT+phobic, non-LGBT+phobic, and not relevant to the LGBT+ context. Moreover,
under a multilabel classification setting, tweets could be tagged with one or more of the following:
Gayphobia, Lesbophobia, Biphobia, Transphobia, or other forms of LGBT+phobia.</p>
        <p>Among the submissions to this initial edition, the work of [19] utilized standard NLP preprocessing
and feature extraction methods, including Bag of Words (BoW), term frequency (TF), and inverse
document frequency (IDF), resulting in TF-IDF representations. The models applied were a Linear
Support Vector Machine (LSVM) and a Bagging ensemble using LSVM as the base classifier. Rivadeneira
et al. [20] focused exclusively on the multilabel task, training separate classifiers for each LGBT+phobic
category. They employed n-gram features from word tokens and weighted TF-IDF BoW representations,
using Random Forest and SVM classifiers.</p>
        <p>In [21], transformer-based models such as BERT and RoBERTa demonstrated strong performance in
both multiclass and multilabel subtasks, efectively identifying LGBT+phobic content. Similarly, the
study in [6] employed BERT models for the first track, reporting a Macro F1 score of 0.73. Additionally,
Rosauro and Cuadros [22] applied BETO [23], RoBERTuito [24], and mDeBERTa [25]—all BERT-derived
models—across both tracks, achieving Macro F1 scores of 0.84 and 0.68, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposal</title>
      <sec id="sec-3-1">
        <title>3.1. Preprocessing</title>
        <p>This section provides insight on our proposal for both tracks of the 2025 edition of HOMO-LAT [10].
The data for this task consist on Reddit posts from diferent Latin American countries, including:
Argentina, Chile, Colombia, and Mexico for task 1.</p>
        <p>Several issues arise when working with this kind of structured data, for instance, the length of posts
is variable between each other; since Reddit is a social network there is not a homogeneous, nor formal,
writing style, i.e., each user may write as they wish; slang is primarily used; scrapped text tends to
come with some formatting tags and marks, links to other resources (news, webpages, images, links to
other users or subreddits); etc. Therefore, preprocessing becomes an imperative step that needs to be
handled with care.</p>
        <p>Our preprocessing pipeline consisted of the following steps (each explained in its own section):
• Remove URLs
• Remove quote markers
• Remove Markdown
• Remove List Prefixes
• Remove Separator Lines
• Remove Sarcasm Tags
Remove URLs Two cases were handled when dealing with URLS: i) plain URLs, for which the URL
was simply replaced by the word ENLACE: https://example.com -&gt; ENLACE; ii) URLs formatted
with markdown, where the user can provide a caption that will be displayed on top of the URL, in this
case we decided to remove the URL from the metadata and keep only the captions: [Caption](url)
-&gt; CAPTION.</p>
        <p>Remove quote markers Quote blocks are typically used in Reddit when referring to another post
or a given source, a quote block uses (in plain text) the greater than (&gt;) symbol as the start of each
of its lines, therefore, this removal implicates the detection of lines starting with that symbol and
the corresponding removal of it along with any blank characters. This way, only the actual text was
preserved.</p>
        <sec id="sec-3-1-1">
          <title>Remove Markdown</title>
          <p>Remove all Markdown markers from text.
Remove List Prefixes Removal of Markdown list prefixes (-, *, +, 1., etc.) from each line of text.
Remove Separator Lines Some users use their own kind of separator lines to separate sections
inside their posts, these vary from person to person but a common example might be: lines with only
equals (=) or dash (− ) symbols. Since these separators are user specific there is no automated way of
detecting them, our proposal considers a line to be a separator if it contains only one of the following
symbols: -_*=#\+~|/()[]{}, or if the count of these special characters is larger than the count of
alphanumeric characters.</p>
          <p>Remove Sarcasm Tags People in Reddit tend to use the \s tag to emphazise that the content they
are about to express is to be understood as sarcasm; given the nature of this task, we consider that
preservation of the tone is mandatory whenever possible. Therefore, we decided to change the tag for
the word SARCASMO: \s -&gt; SARCASMO.</p>
          <p>Anonymize Reddit Reference Reddit usernames, usually found in the form /u/username or
u/username were replaced by the word USUARIO; subreddit mentions, in the form /r/subreddit
or r/subreddit were replaced by SUBREDDIT; Twitter-like usernames, in the form @username
were replaced by USUARIO; lastly, for any other possible reference the default was defined as OTRA
REFERENCIA.</p>
          <p>Convert Emojis to Spanish We used the emoji module for Python to, first, convert the emojis to
English text; then, we defined a dictionary to map each English emoji translation to Spanish. Some
examples are:
• confused face: cara confundida
• worried face: cara preocupada
• slightly frowning face: cara ligeramente fruncida
• frowning face: cara fruncida
• face with open mouth: cara con boca abierta
• hushed face: cara callada
• astonished face: cara asombrada
Limit Repeated Characters Is common use in social networks to use repeated characters as a way
to intensify the corresponding expression. Therefore, for each word that presents repeated characters
we trimmed the repetitions to two. Example: puuuuutooo -&gt; puutoo.</p>
          <p>Normalize Whitespaces Whitespaces are mostly used as a formatting element but they do not add
nor modify the semantics of the message, therefore a normalization of whitespaces has been carried on
with the following steps:
• Replace various types of whitespace with standard space
• Replace multiple spaces with single space
• Remove spaces at the beginning and end of lines
• Limit consecutive blank lines, conserving at most one blank line to delimit paragraphs
• Remove leading and trailing whitespaces for the entire text</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Models Used</title>
        <p>Our approach consisted in using diferent models to generate semantic contextual embeddings for
a numeric representation of each text, once the embeddings were generated they were stored in an
OpenSearch [26] Vector Knowledge Database; classification was then performed using the k-Nearest
Neighbors (kNN) algorithm by embedding the test texts, placing them in the same vector space of the
database and assigning the class with the most votes. We decided to use two models for the embeddings:
• Titan Text Embeddings V2 [27], which is a propietary model from Amazon
• BETO [28]</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Titan Text Embeddings V2</title>
          <p>The Titan model was presented by Amazon on April 2024, it is a model specifically designed to generate
text embeddings, it is a multilingual model with support for over 100 languages, including Spanish.
Since this model is propietary no fine tuning can be performed, therefore, after preprocessing, texts were
directly converted into vector embeddings. Performance validation was carried on with the complete
dev set.
3.2.2. BETO
For BETO, we fine tuned the model with the training dataset; the dev set was partitioned equally into
two subsets. The first subset monitored the fine tuning process, while the second subset evaluated
the performance of the kNN classifier. After fine tuning, we froze the best performance weights and
extracted contextual embeddings from the [CLS], since it provides sentence-level representations of
the texts, for both the training set and the first partition of the dev set. These embeddings were indexed
in the OpenSearch vector database. Finally, we used the second partition of the dev set to perform
classification via kNN search over the stored embeddings.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Classification using kNN</title>
        <p>The HOMO-LAT competition asked for the models to classify a text into one of three polarity labels:
positive (POS), neutral (NEU) and negative (NEG). In addition to the texts, the dataset provided the
country of the subreddits where the text was written and a keyword, a reference to which segment of
the LGBT+ community the text was talking about.</p>
        <p>Considering that slang and expressions change between countries, even when in all these countries
the prevalent language is Spanish, we decided to further filter the search over the vector database by
country and keyword on top of the semantic similarity search usually performed, allowing for a more
ifne grained classification.</p>
        <p>In Listing 1 we present the information stored in the OpenSearch database. The size of the embedding
depends on the model: 1536 for Titan and 768 for BETO, the metric used for semantic search is Cosine
Similarity; additional to the embedding, we also stored the country and keyword of the text (for filtering
the search), the sentiment which corresponds to the ground truth’s polarity from the dataset and the
preprocessed text.</p>
        <p>kNN classification was enhanced with filters, for Task 1 we restricted vectors used for classification
to those that had either the same keyword (segment of the LGBT+ community) or the same country
(from which the text came from).
},
"metadata": {</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Listing 1: OpenSearch Index Configuration</p>
      <p>The moderate performance scores obtained by both models can be attributed to several factors
inherent to the dataset composition and the challenging nature of the classification task. An analysis
of the dataset reveals that the neutral (NEU) class was significantly overrepresented in the training
and development datasets, with 3174 and 888 samples respectively, creating a class imbalance that
influenced the model’s decision-making process. For the positive (POS) class the number of samples
were 633 for the training dataset and 80 for the development set, while the positive (NEG) class had 1960
samples in the training dataset and 475 in the development dataset. It is important to note that in our
proposal we decided to run data augmentation by combining both the training and development dataset
into a single one before performing classification in the test dataset. Therefore, the class distribution on
our custom training set is:
• NEU: 4062 samples
• POS: 713 samples
• NEG: 2435 samples</p>
      <p>This imbalance led to a systematic bias where both models, particularly BETO, exhibited a tendency
to classify ambiguous or borderline cases as neutral, resulting in numerous false positive predictions for
the NEU class. The higher precision (0.4762) compared to recall (0.4641) for the BETO model suggests
that while the model was reasonably confident in its positive predictions, it struggled to capture all
instances of the minority classes due to the overwhelming presence of neutral samples.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work presents an approach for polarity detection in LGBT+ related content from Latin American
Reddit posts, combining retrieval augmented classification with transformer architectures and vector
knowledge databases. Our methodology demonstrates the feasibility of using preprocessed text
embeddings stored in OpenSearch vector databases for k-Nearest Neighbors classification, particularly when
enhanced with country and keyword filtering to account for regional linguistic variations.</p>
      <p>The preprocessing pipeline proved crucial for handling the inherent challenges of social media text,
including variable post lengths, informal writing styles, slang usage, and formatting artifacts. Our
systematic approach to removing URLs, quote markers, markdown elements, and normalizing emojis
to Spanish text representations significantly improved the quality of the input data for downstream
processing.</p>
      <p>The comparative analysis between Amazon’s Titan Text Embeddings V2 and the fine-tuned BETO
model reveals the importance of domain-specific adaptation. The fine-tuned BETO model achieved an
F1-score of 0.4661, substantially outperforming the general-purpose Titan model (F1-score: 0.3792) by
approximately 0.1 points. This performance gap underscores the value of task-specific fine-tuning for
contextual embeddings, even when compared to multilingual models with broader language support.
The integration of additional metadata filtering by country and keyword within the vector similarity
search represents a novel contribution to the field, acknowledging that expressions and slang vary
significantly across Latin American countries despite sharing Spanish as the prevalent language. This
approach enables more fine-grained classification that considers both semantic similarity and regional
linguistic characteristics.</p>
      <p>Despite these methodological contributions, the overall F1-scores obtained by our models and similar
results reported by other competitors indicate significant challenges inherent to this task. The moderate
performance levels suggest that working with limited training data and cross-dialect classification
presents substantial dificulties for current NLP approaches. The scarcity of labeled data, particularly for
underrepresented dialects and nuanced expressions of polarity toward LGBT+ communities, constrains
the ability of transformer models to learn robust representations. Furthermore, the cross-dialect nature
of Track 2, where models trained on data from Argentina, Chile, Colombia, and Mexico were evaluated
on texts from fifteen diferent Latin American countries, highlights the complexity of generalizing
across regional linguistic variations. To address these limitations, future work should focus on data
augmentation techniques specifically designed for dialectal variations, exploration of few-shot learning
approaches that can better leverage limited training examples, and the development of transfer learning
strategies that can more efectively bridge the gap between source and target dialects. Additionally,
incorporating external linguistic resources and dialect-aware pre-training could potentially improve
model robustness when faced with the inherent data sparsity and cross-dialectal challenges characteristic
of this domain.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Claude in order to perform: Grammar and
spelling check and Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments References</title>
      <p>The authors gratefully acknowledge the Instituto Politécnico Nacional (Secretaría Académica, COFAA,
SIP under Grant 20230140, Centro de Investigación en Computación) and the Secretaría de Ciencia,
Humanidades, Tecnología e Innovación (SECIHTI) for their economic support to develop this work.
[2] S. Agarwal, C. R. Chowdary, Combating hate speech using an adaptive ensemble learning
model with a case study on covid-19, Expert Systems with Applications 185 (2021) 115632.
URL: https://www.sciencedirect.com/science/article/pii/S0957417421010265. doi:https://doi.
org/10.1016/j.eswa.2021.115632.
[3] F. Alkomah, X. Ma, A literature review of textual hate speech detection methods and datasets,
Information 13 (2022). URL: https://www.mdpi.com/2078-2489/13/6/273. doi:10.3390/info13060273.
[4] K. I. Fredriksen-Goldsen, H.-J. Kim, S. E. Barkan, A. Muraco, C. P. Hoy-Ellis, Health disparities
among lesbian, gay, and bisexual older adults: Results from a population-based study, American
journal of public health 103 (2013) 1802–1809.
[5] K. I. Fredriksen-Goldsen, L. Cook-Daniels, H.-J. Kim, E. A. Erosheva, C. A. Emlet, C. P. Hoy-Ellis,
J. Goldsen, A. Muraco, Physical and mental health of transgender older adults: An at-risk and
underserved population, The Gerontologist 54 (2014) 488–500.
[6] M. Shahiki-Tash, J. Armenta-Segura, Z. Ahani, O. Kolesnikova, G. Sidorov, A. Gelbukh, Lidoma
at homomex2023@ iberlef: Hate speech detection towards the mexican spanish-speaking lgbt+
population. the importance of preprocessing before using bert-based models, in: Proceedings of
the Iberian Languages Evaluation Forum (IberLEF 2023), 2023.
[7] G. Bel-Enguix, H. Gómez-Adorno, G. Sierra, J. Vásquez, S. T. Andersen, S. Ojeda-Trueba, Overview
of homo-mex at iberlef 2023: Hate speech detection in online messages directed toowards the
mexican spanish speaking lgbtq+ population, Natural Language Processing 71 (2023).
[8] H. Gómez-Adorno, G. Bel-Enguix, H. Calvo, J. Vásquez, S. T. Andersen, S. Ojeda-Trueba, T.
Alcántara, M. Soto, C. Macias, Overview of homo-mex at iberlef 2024: Hate speech detection towards
the mexican spanish speaking lgbt+ population, Natural Language Processing 73 (2024).
[9] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
[10] G. Bel-Enguix, H. Gómez-Adorno, S. Ojeda-Trueba, G. Sierra, J. Barco, E. Lee, J. Dunstan, R.
Manrique, Overview of HOMO-LAT at IberLEF 2025: Human-centric polarity detection in Online
Messages Oriented to the Latin American-speaking lgbtq+ populaTion, Procesamiento del lenguaje
natural 75 (2025) –.
[11] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[12] F. Balouchzahi, H. L. Shashirekha, G. Sidorov, Hssd: Hate speech spreader detection using n-grams
and voting classifier., in: CLEF (Working Notes), 2021, pp. 1829–1836.
[13] E. Roberts, Automated hate speech detection in a low-resource environment, Journal of the Digital</p>
      <p>Humanities Association of Southern Africa 5 (2024).
[14] A. Ribeiro, N. Silva, Inf-hateval at semeval-2019 task 5: Convolutional neural networks for
hate speech detection against women and immigrants on twitter, in: Proceedings of the 13th
International Workshop on Semantic Evaluation, 2019, pp. 420–425.
[15] M. Siino, E. Di Nuovo, I. Tinnirello, M. La Cascia, et al., Detection of hate speech spreaders using
convolutional neural networks., in: CLEF (Working Notes), 2021, pp. 2126–2136.
[16] M. Corazza, S. Menini, E. Cabrio, S. Tonelli, S. Villata, A multilingual evaluation for online hate
speech detection, ACM Transactions on Internet Technology (TOIT) 20 (2020) 1–22.
[17] P. Liu, W. Li, L. Zou, Nuli at semeval-2019 task 6: Transfer learning for ofensive language detection
using bidirectional transformers, in: Proceedings of the 13th international workshop on semantic
evaluation, 2019, pp. 87–91.
[18] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, Hatebert: Retraining bert for abusive language
detection in english, arXiv preprint arXiv:2010.12472 (2020).
[19] C. Macias, M. Soto, T. Alcántara, H. Calvo, Impact of text preprocessing and feature selection on
hate speech detection in online messages towards the lgbtq+ community in mexico, in: Proceedings
of the Iberian Languages Evaluation Forum (IberLEF 2023), 2023.
[20] E. Rivadeneira-Pérez, M. de Jesús García-Santiago, C. Callejas-Hernández, Cimat-nlp at
homomex2023@ iberlef: Machine learning techniques for fine-grained speech detection task (2023).
[21] M. G. Yigezu, O. Kolesnikova, G. Sidorov, A. Gelbukh, Transformer-based hate speech detection
for multi-class and multi-label classification (2023).
[22] C. F. Rosauro, M. Cuadros, Hate speech detection against the mexican spanish lgbtq+ community
using bert-based transformers (2023).
[23] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
evaluation data, arXiv preprint arXiv:2308.02976 (2023).
[24] J. M. Pérez, D. A. Furman, L. A. Alemany, F. Luque, Robertuito: a pre-trained language model for
social media text in spanish, arXiv preprint arXiv:2111.09453 (2021).
[25] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021).
[26] OpenSearch Contributors, Opensearch: Open source search and analytics suite, 2021. URL: https:
//opensearch.org/.
[27] Amazon Web Services, Titan embed text v2, 2023. URL: https://us-east-1.console.aws.amazon.com/
bedrock/home?region=us-east-1#/model-catalog/serverless/amazon.titan-embed-text-v2:0.
[28] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
evaluation data, in: PML4DC at ICLR 2020, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Utami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Hartanto</surname>
          </string-name>
          ,
          <article-title>Systematic literature review of hate speech detection with text mining</article-title>
          ,
          <source>in: 2020 2nd International Conference on Cybernetics and Intelligent System (ICORIS)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICORIS50180.
          <year>2020</year>
          .
          <volume>9320755</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>