<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Classification of the Economic Activity of a Company Using ML and DL Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roi Santos-Ríos</string-name>
          <email>roi.santos.rios@udc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Coruña (España)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Clasificación Nacional de</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Doctoral Symposium on Natural Language Processing</institution>
          ,
          <addr-line>25</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Natural Language Processing</institution>
          ,
          <addr-line>Text Classification, Machine Learning, Data Generation</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidade da Coruña, Centro de Investigación CITIC - Grupo LYS, Facultad de Informática, Campus de Elviña</institution>
          ,
          <addr-line>15071 - A</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present in his work our contribution to the CIDMEFEO project, developed in collaboration with the Spanish National Statistics Institute (INE). Our work focuses on the development of a text classification prototype for the identification and labeling of the diferent economical activities performed by Spanish companies. Such classification is made according to the so-called CNAE standard, which defines 629 hierarchically-ordered economical activities, taking as input the descriptions given by the companies. The great variability of the length and quality of these descriptions, together with the unbalanced nature of the datasets available for the task, make this task very dificult.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        Text classification, a core task in NLP, has advanced significantly with both traditional and modern
methods. Earlier approaches such as Naive Bayes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], SVM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and decision trees [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] relied on
handcrafted features and bag-of-words or TF-IDF models. These techniques performed reasonably well,
but often struggled with understanding the deeper semantics of language. FastText [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], developed by
Facebook AI, improved on these by using subword information and word embeddings, ofering faster
and more eficient classification, especially for large datasets. However, while it enhanced generalization,
it still fell short in handling complex contextual meanings.
https://dunque.github.io/ (R. Santos-Ríos)
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>
        The introduction of modern neural networks, particularly recurrent neural networks (RNN) and
convolutional neural networks (CNN), further improved text classification by capturing sequential
and local patterns in text. However, transformer-based models such as BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], RoBERTa [6], and
GPT [7] revolutionized the field by learning contextualized word representations and handling
longrange dependencies. Pre-trained on massive datasets, these models have become the state-of-the-art,
delivering superior performance across various classification tasks, from sentiment analysis to topic
categorization, with minimal fine-tuning. Their ability to generalize with limited labeled data has
established transformers as the leading architecture in NLP.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and Methods</title>
      <p>Three diferent approaches have been considered: a baseline classifier using FastText; a more advanced
Deep Learning (DL) based approach using transformers (BERT-type models); and finally, a system using
Large Language Models (LLM), both small local models and API-based ones.</p>
      <p>However, before continuing, we first need to have a deep look at our working dataset, analyzing it
and applying the necessary preprocessing techniques in order to prepare it for the models to be used.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>As explained above, the CNAE (from Clasificación Nacional de Actividades Económicas ) is the Spanish
standard classification system for economic activities. Most of its structure is inherited from the
European Community standard, the so-called Statistical Classification of Economic Activities (NACE,
from its French initials). The current work version is CNAE-2009,2 that is structured in four hierarchical
levels, as shown in Figure 1:</p>
        <p>Section A
Division</p>
        <p>Group</p>
        <p>Class</p>
        <p>0 1 4 5</p>
        <p>1. Section: represented by a letter (21 categories).
2. Division: represented by 2 digits (88 categories).
3. Group: represented by 3 digits, and where the first 2 digits indicate the corresponding division
(272 categories).
4. Class: represented by 4 digits, and where the first 3 digits indicate the corresponding group (629
categories).</p>
        <p>The regulations state that every company must be associated with a single category on each level. This
means that, if the company develops more than one economic activity, the most relevant one must be
chosen. This, together with the fact that the descriptions are usually incomplete or ambiguous, and
that low-level categories are sometimes complex and full of exceptions, makes it a hard classification
problem, even for domain expert humans.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preprocessing</title>
        <p>The first step was to preprocess the data, bringing it to a workable shape. For this purpose, several
datasets provided by INE for the task were joined into a single cohesive file. Next, several preprocessing
2https://www.ine.es/daco/daco42/clasificaciones/cnae09/notasex_cnae_09.pdf (visited on May 2025).</p>
        <p>no. samples per class no. classes pct. classes
and data augmentation techniques were applied to improve and increase the amount of data available.
In order to homogenize the descriptions while keeping as much useful textual information as possible,
we applied the following techniques:
• Converted all text into lowercase.
• Replaced all the accents and non-UTF characters by their UTF equivalents, barring the ñ character
due to being very common in Spanish.
• Removed all numbers.
• Removed all punctuation.</p>
        <p>• Removed line breaks and extra whitespaces.</p>
        <p>Next, we build synthetic descriptions based on the titles and explanatory notes of each category of the
CNAE-2009. We apply data augmentation to enlarge this subset by taking a specialized dictionary of
synonyms (already used by INE for manual classification) and replacing the original words by their
equivalents. This process allows us to increase diversity without afecting the meaning of the samples.
By doing this, we obtained a training set of approximately 3.2 million instances, distributed as shown in
Figure 2.</p>
        <p>However, with CNAE-2009 containing 629 classes, the resulting dataset keeps being imbalanced,
even after adding synthetic data: the most common class accounts for 10.1% of the instances while,
in contrast, the least common accounts only for 0.00013%. As shown in Table 3, there is a big tail of
minority classes, with half of them (45.63%) containing less than 1,000 samples. The main reason behind
this severe imbalance is the structure of the Spanish economy itself, where some activities are much
more common than others.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation methods</title>
        <p>Due to the imbalance present in the dataset, we chose to use the weighted F1 score as our evaluation
metric to measure the performance of our classification models. This is an extension of traditional F1
score for multi-class classification problems, where traditional F1 score is the harmonic mean of two
key metrics: Precision (fraction of true positive predictions out of all positive predictions made by the
model) and Recall (fraction of true positive predictions out of all actual positive instances). Thus, the F1
score balances the trade-of between these two metrics.
FastText
bert-base-uncased
roberta-base
PlanTL-GOB-ES/roberta-base-bne
PlanTL-GOB-ES/roberta-large-bne
PlanTL-GOB-ES/roberta-large-bne-massive</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Development</title>
      <sec id="sec-4-1">
        <title>4.1. First Approach: FastText</title>
        <p>Firstly, we implemented a FastText-based model, which will be used as our baseline. After a previous
tuning of the metaparameters,3 we obtained the results shown in Table 2 for the validation set. We also
tried to initialize the weights of the model with Spanish pre-trained ones, but no improvement was
obtained.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Second Approach: Transformers</title>
        <p>
          After implementing the FastText-based classifier, a baseline model was now available for comparison.
Next, we proceeded with our transformer-based approach by fine-tuning several BERT models that had
been previously trained with Spanish texts:
• bert-base-uncased [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: The first BERT model, pretrained on a large corpus of English data in a
self-supervised fashion.
• roberta-base [6]: A BERT variant improved by training on larger data, removing next
sentence prediction, and using larger batches. This resulted in better performance on language
understanding benchmarks.
• PlanTL-GOB-ES/roberta-base-bne:4 This Spanish language model is based on the RoBERTa
base model and has been pre-trained using the largest Spanish corpus known to date. This corpus
contains 570 GB of clean and deduplicated text processed for this work, and compiled from the
web crawlings performed by the National Library of Spain from 2009 to 2019.
• PlanTL-GOB-ES/roberta-large-bne:5 Based on the roberta-large architecture, and trained with
the same data as roberta-base-bne.
• PlanTL-GOB-ES/roberta-large-bne-massive:6 This is an Intent Classification model for the
Spanish language, fine-tuned from a RoBERTa based model pre-trained on MASSIVE 1.1. This
is a parallel dataset of more than 1M utterances across 52 languages, with annotations for the
Natural Language Understanding tasks of intent prediction and slot annotation.
        </p>
        <p>For the fine-tuning of these models, the dataset, containing more than 3.2 million samples, was split
in two: 90% for training and 10% for validation. Taking into account the imbalanced nature of the
data, we decided to use a stratified cross-validation with 5 folds. This way, we ensure that the model is
trained and validated among all available examples present in the dataset. Furthermore, by averaging
the results of the folds we obtain a better estimate of the performance of our model.</p>
        <p>Table 2 presents the obtained for the validation dataset. As we can see, the bert-base-uncased
severely underperformed, as it was expected, due to it being a smaller model trained in English texts.
3No. of epochs: 10. Learning rate: 0.1. Max. ngram length: 3.
4https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne (visited on May 2025).
5https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne (visited on May 2025).
6https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne-massive (visited on May 2025).</p>
        <p>Surprisingly, roberta-base slightly outperformed roberta-base-bne, despite being trained on English
texts while roberta-base-bne was trained with Spanish ones. The best performance was obtained with
roberta-large-bne, with a bigger architecture, even outperforming roberta-large-bne-massive, which
shows that roberta-large-bne-massive generalizes worse due to being pretrained with a more specific
dataset.</p>
        <p>In Figure 3 we can see the F1 score per class obtained by each model. On the x-axis we plotted each
class, and in the y-axis their F1 score. The classes are ordered by F1 score in each model, from lowest to
highest. This is only meant to be a graphical representation of performance across all classes, not a
strict comparison on how each class performed with each model.</p>
        <p>Next, we chose to delve deeper into roberta-large-bne, as it had achieved the highest score. We studied
the individual F1 score that obtained for each class, as shown in Table 3. A F1 score between 0 and
0.50 means that the classification is worse than random, and a F1 score of 0.50 is as good as a random
guess. Higher scores, around 0.75, represent a decent classification, and those higher than 0.90 involve
a really good classification. These results are related to the amount of samples present per class, as
287 classes have less than 1,000 samples, and out of those classes, only 54 achieved an F1 score higher
than 0. Overall, the model fails to classify around 41% of all classes present in the CNAE, which is not a
desirable result.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Third Approach: LLMs</title>
        <p>Finally, we intented to try out API-based LLMs and some smaller local models that could be fine-tuned
with our available dataset. Regarding the API-based implementation, we tried two strategies, both of
them using the free version of ChatGPT:7
1. A multi-class approach using zero-shot learning. We asked the LLM to categorize a text according to
CNAE-2009 without further information or instructions. This first naive strategy was unsuccessful,
since the system was not able to classify at all.
2. A multi-label approach through hierarchical classification. The LLM was asked to provide the
right section, then the division (choosing from the ones included in the given section), and so on.
We added the titles of each level categories on the prompt, which could be considered few-shot
learning. The performance was again unsatisfying as the system wasn’t able to classify, even
with the given instructions.</p>
        <p>Notice that we could not include the whole explanatory notes of the possible categories in the prompt,
since it exceeded the token limit for the free version of ChatGPT we were using. However, it is likely
that, even with a paid version of any LLM, the prompts would not be able to fit all the information
pertinent to each CNAE code.</p>
        <p>In the case of local models, we have tried the following:
• meta-llama/Llama-2-7b [8]: Llama 2 is an auto-regressive language model that uses an
optimized transformer architecture. We chose the 7 billion parameter version, that supports a wide
variety of languages.
• mistralai/Mistral-7B-v0.1 [9]: Another pre-trained generative LLM with 7 billion parameters
that alleguedly surpass Llama 2’s performance.
• meta-llama/Llama-3.2-3B: An updated version of Llama with a reduced size of parameters to
ift in less powerful hardware.</p>
        <p>Unfortunately, in spite of our attempts, none of these models fitted in our currently available GPUs.
Thus, we could not fine-tune and test any of these models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Future Research</title>
      <p>As we expected due to the imbalanced distribution of samples in the dataset, our FastText model could
outperform the more complex BERT-based approaches. Transformer-based models require more data
to be properly trained; thus, we will need to perform improvements on the dataset or try a diferent
classification approach. Currently, we are working in several points to improve upon these previous
experiments.</p>
      <sec id="sec-5-1">
        <title>5.1. Expanded datasets</title>
        <p>Our first experiment would be to reduce the number of samples in the dominating classes, and find the
threshold were their performance starts to decrease. This would make training take less time, and may
allow the models to be able to focus more on the minority classes. Another improvement would be
to generate entries to reduce the imbalance gap in the dataset. For this purpose, we have created the
base_filtered dataset.</p>
        <p>By making all classes reach at least 2,000–5,000 examples, we expect to improve the ability of the
model to classify them. For this purpose, a prototype version of a data generation tool is in the works.
By using LLMs via prompting, and giving them what information constitutes each class alongside some
examples, they are able to produce working examples, albeit they are ”too perfect”. This generation
method still needs to be improved upon and refined, for it to give more realistic examples. For now, we
have managed to generate 100 samples per class.</p>
        <p>• base_data (3.24M samples): This is the dataset explained in detail in Section 3.1. It is severely
imbalanced.
• base_filtered (2.23M samples): Dataset created from base_data, where clustering was
applied to reduce the sample size of the majority classes in an attempt to combat the imbalance in
the results. Clustering was performed using SentenceTransformers
(distiluse-base-multilingualcased-v2 model), and the 10 majority classes were reduced to 20,000 elements each. However, a
severe imbalance still exists, but less than with base_data.
• generated_data (63K samples): Dataset generated with LLama 70B, containing 100 entries per
CNAE class. Its entries consist exclusively of better-written sentences and are generally longer
than the vast majority of real-life examples of CNAE descriptions provided by companies.
• generated_data_2 (74K samples): A mixture of generated_data and a small dataset consisting
of 10.000 entries proportioned by INE. By mixing these two data sets, the class balance present in
generated_data was lost, but it is not as significantly unbalanced as in base_data.
• codauto_data (3.30M samples): New version of base_data which features some more examples,
but the data distribution is very similar to base_data. It is being used to train the production
version of the Codauto classifier, an internal classifier that INE is developing.</p>
        <p>INE also developed a test dataset in order to standardize the comparisons between models. This set is
made of 1,654 samples distributed among four categories:
• confident_learning (747 samples): This subset has been developed using confidence learning
algorithms on the base_data dataset to extract the most relevant and varied samples.
• handwritten (31 samples): This subset contains very few entries that have been hand-written
by experts at INE. They specifically target categories that have very similar entries, in order to
determine if the models are able to classify samples with subtle diferences.
• queries (246 samples): This subset is made of real world examples of submitted company
activity descriptions. This subset is the most important, as the models’ performance over it reflect
more closely their theoretical performance when deployed.
• train_set (629 samples): This subset is made of examples from base_data, thus when the
models are trained with either base_data or base_filtered, they should show solid results.
The entries in this subset can be the result of data augmentation, so their quality isn’t expected
to be the best.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Further model tests</title>
        <p>We executed the FastText and roberta model with the new datasets, as well as a new LLM:
Salamandra2B [10]. This time we were able to train it with better hardware, alongside some code optimizations.
Table 4 shows the results of the multiple experiments we’ve performed.</p>
        <p>As we can see, FastText still outperforms the rest of the models. Even though, our experiments show
that the Transformer models perform well with the generated datasets, which upon further expanding
could result in the best performing model. We will focus our eforts in improving our dataset generation
tools, to create a more definitive set.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Future work</title>
        <p>As future work, we intend to expand the Transformer-based classifiers as we think its the approach
that has the most potential. The two main approaches we plan on working on are:
• A hierarchical classifier, that will take advantage of the formatting of the CNAE codes. As the
codes are already hierarchical, this could allow for partial classifications following the Section,
Division, Group and lastly Class of every code.
• A ranked classifier with which we would obtain the top k most relevant classifications for a given
example which given the nature of the problem could result in a more lenient system.</p>
        <p>Finally, as the results of our LLM-based approach were underwhelming we are interested in testing
more in depth these models. We need to get more familiarized with the literature on this topic, to be
able to choose the future approaches to take.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>CIDMEFEO funded by the Spanish National Statistics Institute (INE); as well as funding by Xunta de
Galicia (ED431C 2024/02), and Centro de Investigación de Galicia “CITIC”, funded by the Xunta de
Galicia through the collaboration agreement between the Consellería de Cultura, Educación, Formación
Profesional e Universidades and the Galician universities for the reinforcement of the research centres of
the Galician University System (CIGUS).</p>
      <p>CITIC, as a center accredited for excellence within the Galician University System and a member
of the CIGUS Network, receives subsidies from the Department of Education, Science, Universities,
and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the
FEDER Galicia 2021-27 operational program (Ref.ED431G 2023/01)</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>We have used ChatGPT for minor copy-editing and Grammarly for grammar and spelling check. After
using these tools, we have reviewed and edited the content as needed and take full responsibility for
the publication’s content.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, RoBERTa: A robustly optimized BERT pretraining approach, 2019.</p>
      <p>URL: https://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[7] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by
generative pre-training, OpenAI Blog (2018). URL: https://openai.com/index/language-unsupervised/.
[8] H. Touvron, L. Martin, K. Stone, Llama 2: Open foundation and fine-tuned chat models, 2023. URL:
https://arxiv.org/abs/2307.09288. arXiv:2307.09288.
[9] A. Q. Jiang, A. Sablayrolles, A. Mensch, Mistral 7B, 2023. URL: https://arxiv.org/abs/2310.06825.</p>
      <p>arXiv:2310.06825.
[10] A. Gonzalez-Agirre, M. Pàmies, J. Llop, I. Baucells, S. D. Dalt, D. Tamayo, J. J. Saiz, F. Espuña,
J. Prats, J. Aula-Blasco, M. Mina, A. Rubio, A. Shvets, A. Sallés, I. Lacunza, I. Pikabea, J. Palomar,
J. Falcão, L. Tormo, L. Vasquez-Reina, M. Marimon, V. Ruíz-Fernández, M. Villegas, Salamandra
technical report, 2025. URL: https://arxiv.org/abs/2502.08489. arXiv:2502.08489.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Kibriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          , G. Holmes,
          <article-title>Multinomial naive Bayes for text categorization revisited</article-title>
          ,
          <source>in: Australasian Joint Conference on Artificial Intelligence</source>
          , Springer,
          <year>2004</year>
          , pp.
          <fpage>488</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <article-title>Text categorization with Support Vector Machines: Learning with many relevant features</article-title>
          , in: C.
          <string-name>
            <surname>Nédellec</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Rouveirol (Eds.),
          <source>Machine Learning: ECML-98</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>1998</year>
          , pp.
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          ,
          <article-title>Induction of decision trees</article-title>
          ,
          <source>Machine learning 1</source>
          (
          <year>1986</year>
          )
          <fpage>81</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          . doi:
          <volume>10</volume>
          .1162/ tacl_a_
          <fpage>00051</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>