<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UZH at PAN-2023: Profiling Cryptocurrency Influencers using Ensemble of Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abhinav Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Le Hoang Minh Trinh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Afshan Anam Saeed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In the era of social media, the impact of social content can be massive, especially when cryptocurrency investors rely on peer advises. The crypto market has been shown to be volatile, and certain social media users have a stronger influence in the market through their social media posts than others. In this paper, we aim to classify the social media based cryptocurrency influencers into categories depicting their influence in the crypto market based on their English tweets. The task is performed under low resource setting due to limited data availability, and is done using fine tuning approaches for few shot learning. We fine tune various large language models including Bert, Roberta and Electra. We additionally fine tune a combination of models to create ensemble models to take advantages of multiple pre-trained models for our classification task. We find that the ensemble models provide better test accuracy than the single models for few shot learning tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Few Shot Learning</kwd>
        <kwd>Transformers</kwd>
        <kwd>Fine Tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Cryptocurrencies, characterized by their decentralized and digital nature, have garnered
immense attention from investors seeking lucrative opportunities. However, the volatile and
often chaotic nature of cryptocurrency markets poses challenges for investors looking to make
informed investment decisions. Unfortunately, traditional forecasting approaches that rely
solely on numerical historical data are often insuficient in capturing the complex dynamics
and sentiment surrounding cryptocurrencies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In recent years, social media platforms have emerged as influential spaces where discussions
and opinions about cryptocurrencies flourish. Tweets and online conversations about various
cryptocurrencies have the power to shape public perception and drive market sentiment. This
phenomenon is particularly evident during significant events, such as major price fluctuations
or the introduction of new cryptocurrencies. As a result, conventional forecasting methods that
overlook the impact of social media sentiment fail to capture crucial information [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Behavioral finance theories suggest that individuals are prone to making decisions based on
biases and herd behavior when faced with uncertainty. In the context of cryptocurrencies, the
abundance of tweets and social media discussions creates an environment where investors are
susceptible to the influence of others. The so-called "hype" surrounding certain cryptocurrencies,
often fueled by social media activity, can trigger herd behavior and impact investment decisions
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Studies have shown that cryptocurrencies valued and endorsed by experts tend to be more
successful, underscoring the role of social media in shaping investor sentiment.
      </p>
      <p>In volatile environments, where rapid price fluctuations and market uncertainty prevail,
social media platforms become even more important for investors. Users seek advice, opinions,
and insights from other social media users to gain a better understanding of market trends
and potential investment opportunities. The real-time nature of social media, coupled with
the ability to connect with a large community of investors, makes it an attractive source of
information and a platform for sharing experiences. However, caution must be exercised, as the
reliability and expertise of social media users can vary significantly.</p>
      <p>In this work, we aim to categorize users into an influencer profile based on their English
tweets using Transformer Encoder models. Since availability of social media tweets specific
to Cryptocurrency influencers is limited, the goal of this project is to use this limited amount
of data to create a model that is able to profile users based on their influence in the crypto
market. We will begin by introducing Few-Shot Learning, Ensemble Modeling techniques, and
the dataset. We would then discuss the methodology used, which includes details about the
hugging face pre-trained models taken under consideration. We conclude by describing the
results of our single body and two body models respectively.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>2.1. NLP in Finance</title>
        <p>The field of finance has undergone a significant transformation in recent years, with traditional
forecasting techniques being complemented by innovative approaches that leverage the power
of natural language processing (NLP). By incorporating NLP into financial analysis, professionals
can gain valuable insights into market trends, investor sentiment, and the forecasting of future
trends. NLP techniques, such as sentiment analysis, event detection, trend analysis, and risk
management, enhance decision-making processes and enable more accurate forecasting. In the
present context, we will be using fine-tuning approaches of NLP in understanding the coherence
within user tweets of a specific class, so as to identify the users significantly impacting the
cryptocurrency market.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Few Shot Learning</title>
        <p>Obtaining large amounts of data for training large models is a challenging and expensive
afair. Within the realm of natural language processing (NLP) tasks, the focus is on eficiently
solving new tasks using only a small number of labeled examples. Recent advancements in
self-supervised pre-training of transformer models, which employ language modeling objectives,
have yielded remarkable achievements in learning general-purpose parameters applicable to
diverse downstream NLP tasks. Nevertheless, despite the benefits of pre-training, these models
are not optimized for fine-tuning with limited supervision. Consequently, they still require
substantial amounts of task-specific data to achieve satisfactory performance.</p>
        <p>
          Few-shot learning tackles the challenge of training classifiers with minimal amounts of
training data. This includes the extreme case of zero-shot learning, where no labeled data is
available for training purposes, and few-shot classification, where classifiers are trained with
only a few labeled examples per class. Few-shot learning typically relies on representing task
labels in a textual format, such as their names or concise descriptions. Its applications span
various domains and are widely used in image classification, sentiment classification from short
text, and object recognition [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Ensemble Models</title>
        <p>The instances in Few Shot Learning can depend on Fine-tuning of a single pre-trained
transformer model. To leverage the collective intelligence of multiple transformer models, Ensemble
Deep Learning methods exist that combine the predictions of multiple individual models to
obtain enhanced and more accurate results. This method allows for more robust and reliable
predictions and the diversity in the models can help compensate for the weaknesses or biases of
individual models, resulting in better overall performance. Due to its multiple model strength,
this method is good for handling complex tasks and is a valuable technique for tasks where
accurate predictions are essential.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>Two dataset files are provided by the shared task organisers, namely, train_text and train_truth.
The train_text dataset provided contains the following features - ’Twitter User Id’, ’Text’ (English
tweets), and ’Tweet Id’. For every Twitter user ID, the labels are mentioned in the train_truth
dataset. There are five classes present in the train_truth dataset, namely, ’no influencer’, ’nano’,
’micro’, ’macro’, and ’mega’. These labels are grouped in the order of the User Id’s influence
on their followers, with ’no influencer’ being the least influential and ’mega’ being the most
influential. Consequently, it depicts how much of an influence the user has in the crypto market.
There are multiple tweets given for each user, ranging from 1-10 tweets per user, and 32 users
given for every class label.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data pre-processing</title>
        <p>In this section, we pre-process the tweets which we got for training our model. We joined the
train_text and train_truth dataset into a single data frame on the ’User Ids’ and removed the
’Tweet Ids’ from the resulting data frame, as the identifier of each tweet was not needed for our
approach.</p>
        <p>Twitter dataset underwent the following process:
• Noise data removal
• Short tweet removal
1) Noise data removal: Here noise means words or any text that doesn’t add any relevance to
the classification task. Based on this reasoning, we removed URLs present in the tweets as they
are mostly short links or dead links. User mentions starting with ’@’ are also removed from the
tweets.</p>
        <p>2) Short tweet removal: Short tweets often lack suficient information, serving as noise in
datasets. Removing them enhances data quality by preserving more relevant information for
processing. Some short tweets are merely retweets or emotional responses, lacking substantial
content. The limited information in these tweets poses a challenge for models to discern
meaningful patterns. We remove any tweet that is shorter than five words.</p>
        <p>These features are input into various models and outputs are used to validate the models. In
our experiment, we found two body ensemble model more accurate than other and in rest of
the article we will focus on that. First we will talk about pre-trained models that we used and
then we discuss how we combined the output of these models to get the desired result.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Problem Modeling</title>
        <p>
          Our initial proposal for the given Few-Shot classification problem is to make use of standard
ifne-tuning pre-trained language models individually. Thus, we use pre-trained Large Language
Models (LLMs) on tweets or for general text classification. Most of these models have more
than 350M parameters and the evaluation was done on a subset of the dataset, which we took
to be the test dataset. The models utilised are: Twitter RoBERTa Large [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], RoBERTa Large [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
Electra Large [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], BERTweet Large [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] finetuned on TweetNER7 dataset [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]. For simplicity,
this last model will be referred to as ’BERTweet Large’ throughout the rest of this paper.
        </p>
        <p>To evaluate the fine-tuning results on the test data, we used two strategies—referred to as the
Concatenation Method and the Single Tweet Classification Method—each gauged on accuracy
and F1 score metrics. Additionally, to prevent overfitting, early stopping was employed.</p>
        <p>We used two approaches for evaluating our fine-tuning results on the test data. Both these
approaches were evaluated based on the accuracy and F1 score. Early stopping was added to
avoid over-fitting:</p>
        <p>Concatenation Method: For a given user, we concatenate their tweets up to a maximum
token size of 512. Each model then classifies these concatenated tweets into their corresponding
labels. We ensure the tweets are separated by a model-specific separation token (e.g., ’[SEP]’
for BERT and ’&lt;/s&gt;’ for RoBERTa).</p>
        <p>Single Tweet Classification Method: This strategy involves each model classifying tweets
individually, grouping them based on user id. To determine a user’s label, we use majority
voting among the classified tweets. In case of a tie, we extract the final label by averaging the
probabilities.</p>
        <p>From our preliminary analysis, it became apparent that the performance of an individually
ifne-tuned model left room for improvement. In addition, we observed that the Concatenation
Method consistently outperformed the Single Tweet Classification Method. With these findings
in mind, we sought to further enhance our performance and thus, we formulated a novel
approach, which we refer to as the "Two Body Model".</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Pretrained Models Used</title>
        <sec id="sec-3-4-1">
          <title>TwHIN-BERT</title>
          <p>
            TwHIN-BERT [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] stands for Twitter Heterogeneous Information Network BERT. It’s a
pretrained language model built on BERT’s architecture, trained on 7 billion multilingual tweets.
Enhanced with social media-specific features, it is highly efective in capturing the nuances of
tweets. The dataset used for training is of high quality, having been thoroughly preprocessed to
remove noisy or irrelevant tweets.
          </p>
        </sec>
        <sec id="sec-3-4-2">
          <title>Twitter RoBERTa Large</title>
          <p>
            The Twitter RoBERTa Large model [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], is a RoBERTa based language model, specifically
finetuned for Twitter data. Initially trained on a corpus of 90 million tweets, the model has since
been continuously updated on a rapidly growing Twitter dataset, reaching a total of 154 million
tweets by the time of our usage. This model’s specialization and ongoing adaptation make it
particularly efective at understanding the dynamic nature of discourse on Twitter. Its ability to
stay current is especially valuable in fast-evolving fields such as cryptocurrencies.
          </p>
        </sec>
        <sec id="sec-3-4-3">
          <title>DistilBERT</title>
          <p>
            DistilBERT, a streamlined variant of BERT by Hugging Face [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], has 40% fewer parameters,
thus ofering speed and eficiency benefits. Despite its smaller size, it retains 95% of BERT’s
performance on certain benchmarks. Due to its balance between performance and computational
needs, DistilBERT serves as a reliable baseline for assessing larger language models, particularly
in resource-constrained environments.
          </p>
        </sec>
        <sec id="sec-3-4-4">
          <title>ELECTRA-Small</title>
          <p>
            ELECTRA-Small is a compact version of the ELECTRA model [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. Its novel two-step approach
enables eficient data use during training while its smaller size leads to less computational
requirement. Despite its compactness, ELECTRA-Small maintains competitive performance,
making it an excellent baseline for evaluating the eficiency of larger transformer models.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Two Body Model</title>
        <p>In our methodology, we adopt a novel approach named the "Two Body Model". This approach
involves extracting the embeddings of the CLS tokens from the last hidden layer of a pair
of Transformer Encoder Models. We experimented with two sets of pairs: a pair of smaller
models (Electra Small + DistilBERT) for computational eficiency, and a pair of larger models
(TwHIN-BERT + Twitter RoBERTa Large), which bear the advantage of being pre-trained on
Twitter data. The larger models, besides being pre-trained, also possess a more significant
number of parameters. This feature allows them to capture more complex and subtle signals,
leading to potentially richer and more insightful embeddings.</p>
        <p>
          Once the embeddings are extracted, they are concatenated and fed into a standard
MultiLayer Perceptron that employs the Gaussian Error Linear Unit (GELU) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] activation function.
The motivation behind GELU is to bridge stochastic regularizers, such as dropout, with
nonlinearities, i.e., activation functions. The output of this model provides the classification of the
level of influence of the tweet’s author based on a concatenation of their tweets (using the
Concatenation Method).
        </p>
        <p>The primary design of this configuration is to harness the unique strengths of each component
model within the pairs. The ultimate goal is to optimize performance for our specific task,
which is Twitter data classification. Through this methodology, we capitalize on the specialized
capabilities of the selected models, thereby ensuring that our approach is well-equipped to
handle the intricacies and subtleties of Twitter discourse.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Head First Fine-tuning (HeFit)</title>
        <p>
          Within our experiments, we adapted a recent alternative two stage fine-tuning procedure by
Head-First Fine-Tuning (HeFiT) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] which demonstrated increase ability of Language Models
to adapt to the Twitter Domain during fine-tuning.
        </p>
        <p>The HeFiT approach is a two-stage process designed to gradually adapt the model to the
specific task at hand. In the first stage, only the parameters of the new classification head
are updated, keeping the pre-trained transformer encoder parameters frozen. This allows
the new classification head to learn to make predictions using the existing representations
produced by the pre-trained model. After this stage, which in our case lasted for 3 epochs, all
model parameters are unfrozen and updated during training. This allows for a more nuanced
adaptation of the model to the specific characteristics of the task and the data, as the encoders
bodies can now adjust its representations to better suit the classification problem. An additional
benefit we found with this approach was selected is signs of more stabile training and consistent
performance improvements when the labeled data is scarce.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.7. Experimental Setup</title>
        <p>
          We incorporated a gradient accumulation technique to deal with the constraints of GPU memory.
This technique allows us to efectively increase the batch size without exceeding the memory
capacity of our GPU. Specifically, we set the mini-batch size to 1 and used an accumulation step
of 16, efectively simulating a batch size of 16. This was done to ensure that our model, which
has a significant number of parameters, could be eficiently trained on a T4 GPU which has
only 16 GB of memory. For updating network weights, gradient descent is used with Adam
optimisation [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. We have initialized learning rate with value of 0.000003 which get updated
while training.
        </p>
        <p>The entire fine-tuning process was carried out for a total of 20 epochs. As a result of this
procedure, our model was not only able to leverage the power of two pre-trained language
models but also efectively adapt to the specific characteristics of Twitter text data, demonstrating
the eficacy of our Two Body Model approach.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <sec id="sec-4-1">
        <title>4.1. Standard Fine-tuning</title>
        <p>Our evaluation results from the two fine-tuning strategies—Concatenation Method and Single
Tweet Classification Method—are presented in Tables 1 and 2, respectively. From these results,
it’s evident that the Concatenation Method yields higher accuracy and F1 scores for the BERT
and RoBERTa Models. Conversely, the Electra model demonstrates superior performance when
using the Single Tweet Classification Method, achieving the highest accuracy score of 0.60
amongst all models. However, the overall accuracy and F1 scores gleaned from fine-tuning
these pre-trained models are less than optimal. This may be attributed to the limited size of our
labeled dataset, as well as the relative low number of parameters in these models, which can
impede their capability to accurately classify labels.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Two-body Models</title>
        <p>We used the Two-body Models method to accomplish the classification task to take advantage
of multiple models at the same time. In our project, we worked on two combinations of models.
Model 1 includes a combination of DistilBERT and ELECTRA small. Model 2 includes Twitter
RoBERTa Large and TwHIN-BERT. The ensemble models underwent training for a total of 20
epochs. Evaluation of the two models on test dataset is reported in Table 3.</p>
        <p>We see that Model 2 performed better than Model 1. Upon the completion of the final epoch,
Model 2 yielded a training loss of 0.242 and achieved a training accuracy of 91.7%. Further
validation of the model on a separate validation set resulted in a validation loss of 1.18 and a
validation accuracy of 66.7%.</p>
        <p>A detailed performance evaluation of the TwHIN BERT and Twitter RoBERTa Large was
carried out by examining precision, recall, and f1-score metrics across diferent classes, as seen
in Table 4.</p>
        <p>The ’no influencer’ category saw a perfect precision of 1.00, indicating that when the model
predicted a ’no influencer’, it was always correct. However, the model only correctly
identiifed 33% of the actual ’no influencer’ instances, suggesting room for improvement in recall.
Consequently, the f1-score, which balances both precision and recall, stood at 0.50 for this
category. The model demonstrated a balanced performance for the ’nano’ category, achieving
both precision and recall of 0.67. This resulted in an f1-score of 0.67. For the ’micro’ category, the
model had a lower precision of 0.60, implying that 40% of the ’micro’ predictions were incorrect.
Nevertheless, it showed a perfect recall of 1.00, correctly identifying all actual ’micro’ instances.
This resulted in a relatively high f1-score of 0.75. The ’macro’ category exhibited the same
pattern as the ’no influencer’ class, with a precision of 1.00, but a lower recall of 0.33, leading to
an f1-score of 0.50. Finally, the ’mega’ category mirrored the ’micro’ class performance with a
precision of 0.60, a perfect recall of 1.00, and consequently, an f1-score of 0.75.</p>
        <p>The confusion matrix (Table 5) provides an in-depth look into the model’s predictions. It
shows that all instances of ’mega’ and ’micro’ categories were predicted correctly. On the
contrary, in the ’macro’ and ’no influencer’ categories, the model correctly classified only one
out of three instances, misclassifying the remaining ones. The ’nano’ category saw two correct
predictions, with one instance misclassified.</p>
        <p>In the shared-task oficial result, Model 2 reported a Macro F1 score of 50.21. Our model was
able to achieve a commendable performance, demonstrating its ability to efectively leverage
the unique strengths of the TwHIN-BERT and Twitter RoBERTa Large models in a synergistic
manner. This result further underscores the efectiveness of our novel approach to ensemble
modeling and the potential of the HeFiT method.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, we find from the accuracy and F1 scores that the fine-tuned two-body models are
better models than single fine-tuned models. Amongst the single models, we find that Electra
gave us significant performance as compared to the other models when using the majority
voting approach to classify users based on their tweets. Amongst the two-body models, we
obtained the best accuracy with the TwHIN BERT and Twitter RoBERTa large combination.
While the model showed robust performance for certain categories, the performance was lower
for others. These observations suggest possible avenues for further refinement of the model.
Specifically, eforts should be made to improve the recall for the ’no influencer’ and ’macro’
categories and the precision for the ’micro’ and ’mega’ categories. Addressing these issues will
likely lead to an improvement in the overall accuracy as well as individual precision and recall
metrics for each class in future iterations of the model.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>For the future work, we suggest using prompting methods on the fine tuned single and two body
models to check for increase in accuracy scores. Manual Prompting methods did not give us
good results on the pre-trained models and thus it would be interesting to see the improvements.
Additionally, higher parameter models like t5 can be worked with individually or in ensembles
to make a better classification model.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Our deepest appreciation goes to the Department of Computational Linguistics at the
University of Zurich, whose provision of essential technical infrastructure was instrumental in our
completion of the task. We are equally grateful to Simon Clematide and Andrianos Michail for
their insightful advice and technical recommendations, which significantly enriched our work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sawhney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Thakkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neerkaje</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guhathakurta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chava</surname>
          </string-name>
          ,
          <article-title>Tweet based reach aware temporal attention network for NFT valuation, in: Findings of the Association for Computational Linguistics: EMNLP 2022, Association for Computational Linguistics</article-title>
          , Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>6321</fpage>
          -
          <lpage>6332</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-emnlp.
          <volume>471</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sawhney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wadhwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Shah</surname>
          </string-name>
          , FAST:
          <article-title>Financial news and tweet based time aware network for stock trading</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2164</fpage>
          -
          <lpage>2175</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>185</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-main.
          <volume>185</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chinea-Rios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. la Peña</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franco-Salvador</surname>
          </string-name>
          ,
          <article-title>Zero and few-shot learning for author profiling</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2204</volume>
          .
          <fpage>10543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Loureiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>Timelms: Diachronic language models from twitter</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2202</volume>
          .
          <fpage>03829</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Electra:
          <article-title>Pre-training text encoders as discriminators rather than generators</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2003</year>
          .10555.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          , A. T. Nguyen,
          <article-title>BERTweet: A pre-trained language model for English Tweets</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ushio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Neves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <article-title>Named entity recognition in Twitter: A dataset and analysis on short-term temporal shifts, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th</article-title>
          <source>International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online only,
          <year>2022</year>
          , pp.
          <fpage>309</fpage>
          -
          <lpage>319</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .aacl-main.
          <volume>25</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ushio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <surname>T-NER</surname>
          </string-name>
          :
          <article-title>An all-round python library for transformer-based named entity recognition, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . eacl-demos.7. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-demos.
          <volume>7</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Malkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Florez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McWilliams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            .
            <surname>El-Kishky</surname>
          </string-name>
          ,
          <article-title>Twhin-bert: A socially-enriched pre-trained language model for multilingual tweet representations</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2209</volume>
          .
          <fpage>07562</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper</article-title>
          and lighter,
          <year>2020</year>
          . arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <article-title>Gaussian error linear units (gelus</article-title>
          ),
          <source>arXiv: Learning</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Michail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Konstantinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clematide</surname>
          </string-name>
          , Uzh_clyp at semeval
          <article-title>-2023 task 9: Head-first finetuning and chatgpt data generation for cross-lingual learning in tweet intimacy prediction</article-title>
          ,
          <source>arXiv preprint arXiv:2303.01194</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1412</volume>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>