Labelling companies referred to in newspaper articles
                                                                       Amine Nahid
                                                                   Előd Egyed-Zsigmond
                                                                     Sylvie Calabretto
                                                                 amine.nahid@insa-lyon.fr
                                                            elod.egyed-zsigmond@insa-lyon.fr
                                                               sylvie.calabretto@insa-lyon.fr
                                                            Université de Lyon; LIRIS UMR 5205
                                                                        Lyon, France

ABSTRACT                                                                                This usual name list will then be combined to other methods to
There are several domains where establishing links between news-                     propose a global score that predicts a distance between a text and a
paper articles and companies is useful. In this paper, we will present               company, in order to end up with a model that labels a newspaper
the first elements of our solution to predict links between a news-                  article with its corresponding companies among our list.
paper article written in French and a list of companies identified
by their name and activity domain. We base our study on a semi-                      2     STATE OF THE ART
automatically annotated article corpus and the almost complete list                  Matching press articles with the companies they mention is part
of official French company names. We combine statistical linguis-                    of the Named Entity Recognition (NER) domain. The term NER,
tic methods with acronym generation and filtering techniques to                      appeared for the first time in the MUC-6 [5] conference. The task
propose a global score that predicts a distance between a text and a                 of recognising company mentions in texts is hence a sub-problem
company. The main objective of the study presented in this paper                     of NER, where we are interested only in entities representing com-
is the creation of a usual name list for each company in order to                    panies. The issue can be addressed with different approaches. A
improve the labelling of newspaper articles.                                         baseline approach would be searching the official name of the com-
                                                                                     pany in the text. Nonetheless, searching the official name of a
CCS CONCEPTS                                                                         company within a newspaper article might reveal itself inefficient,
• Information systems → Document topic models; Relevance                             given that most companies have usual or common names that
assessment.                                                                          slightly differ from their legal ones. Working on a German corpus,
                                                                                     [3] proposed the use of dictionaries of colloquial names from var-
KEYWORDS                                                                             ious sources, as well as an alias generator that generates an alias
natural language processing, named entity recognition, information                   out of an official denomination (it goes through some classic NLP
retrieval, text mining, text tagging                                                 data cleaning : removal of legal designations, special characters,
                                                                                     geographic indications and token normalisation). There have been
1    INTRODUCTION                                                                    other works elaborating rule based systems, based on heuristics
                                                                                     and/or hand crafted rules on a morphological level [4, 6, 7]. Unfor-
Businesses have always had interest in assessing their performances,
                                                                                     tunately rule based methods are domain and language specific, and
evaluating their financial and public relations situation. Hence, in-
                                                                                     are not portable therefore. There are recently attempts to execute
formation contained in press articles, clients feed-backs, etc. might
                                                                                     generic NER tasks, using deep learning [2], but they usually need
be of strategic importance.
                                                                                     much more training examples than we have, annotated more pre-
Our main problem is to link articles with companies for a very large
                                                                                     cisely. We are also experimenting with CRF (Conditional Random
number of companies registered in France, and identified by their
                                                                                     Fields) based techniques, with promising results. These experiments
unique national identifier (SIREN code) and legal name. However,
                                                                                     will be related in a future paper.
the companies are seldom referenced in the press using their legal
                                                                                        In the following section we propose a statistics based protocol
names, that are often long. Our project is to design a solution to
                                                                                     to tackle the company recognition problem through common name
link economic press articles written in French with a set of compa-
                                                                                     dictionary generation.
nies. We have a semi-automatically annotated article ground truth
corpus and the list of the official denominations of around 30,000
companies registered in France. Our main contribution in this paper                  3     PROPOSED APPROACH
is a protocol to construct the common names of companies given                       In this section we present our company usual name creation method,
their legal name and the set of annotated articles.                                  first based on the official names and then on generated acronyms.
   We carry out our experiments and develop our tools on French
language texts, but most of the methods used can be easily adapted                   3.1    Hypothesis
to other languages.
                                                                                     Since companies are barely referred to by their legal names and are
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   rather known by one or more common names, we need to provide
mons License Attribution 4.0 International (CC BY 4.0)."                             an accurate automatic protocol to generate these common names.
                                                                                                                                      Nahid et al.


Through the observation of the legal names of a set of French
companies, we made the following hypotheses:
      • The common name of a company might be only its legal
        name
      • The common name of a company might be a contiguous
        sequence of terms that form the legal name (a sub word
        𝑛𝑔𝑟𝑎𝑚 of the legal name)
      • The common name might be an acronym of the legal name
        or some part of it.
   With these hypotheses, we aim to implement a common name
generator that operates in two steps: as a first step it generates
the sub-sequences and then the acronyms. The second step is the
search of the best subset of 𝑛𝑔𝑟𝑎𝑚s and acronyms to compose the
common name set.

3.2     Pre-processing
For our study, we have two data sets. The first one catalogues around
30k French companies identified by their SIREN codes (unique
French identifier for businesses and not-for-profit organisations)
and legal names in capital characters with no accents. The second
data set contains around 120 thousand annotated French newspa-           Figure 1: Number of companies per number of referencing
per article URLs, manually labelled with the SIREN code of the           articles
companies they are talking about. Its elements are listed in accor-
dance with the following scheme: id, SIREN code, legal name of
the company, URL address of the article. We developed a scrapper            Since not all companies have the privilege of being talked about
that collected the title and body of the articles when available. We     very often in the press, our ground truth shall be about the same.
included finally only the articles for which we managed to scrap         For our dataset, the graph (cf. Figure 1) shows the number of com-
their content: title and text of the article. That gave us a dataset     panies in function of the number of articles labeled as talking about
with around 58k articles.                                                them. 2375 French companies have more than 6 articles labeled as
   We cleaned the official names from the first data set by removing     talking about them. We shall call these companies well-documented
the punctuation marks, especially the dots, commas and parenthe-         companies and focus our study on them. We consider that for the
ses. However, we chose to keep the hyphens as their use in French        other 𝑙𝑒𝑠𝑠 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies, it is difficult to generate usual
is very common for compound names, considered as single terms            names based on annotated articles.
in our model.
Examples:                                                                3.3    ngram generator
                                                                         We call ngrams all the contiguous sequences of terms contained in
      • For companies without the special characters aforemen-
                                                                         an expression. Our commitment at this is that for each company
        tioned, e.g. ELECTRICITE DE FRANCE, nothing is removed.
                                                                         we generate all possible ngrams for its legal denomination.
      • For CA INDOSUEZ WEALTH (FRANCE), the parentheses
                                                                         For instance for the company "COMPAGNIE DU RHONE", we should
        are irrelevant and would be problematic for the 𝑛𝑔𝑟𝑎𝑚 and
                                                                         generate the following n-grams: “COMPAGNIE", “DU", “RHONE",
        𝑎𝑐𝑟𝑜𝑛𝑦𝑚 generation, the name has therefore got to be trans-
                                                                         “COMPAGNIE DU", “DU RHONE", “COMPAGNIE DU RHONE".
        formed to CA INDOSUEZ WEALTH FRANCE before any
                                                                            In order to filter potentially irrelevant 𝑛𝑔𝑟𝑎𝑚𝑠 we introduced 2
        further process.
                                                                         rules: filter one character long 𝑛𝑔𝑟𝑎𝑚𝑠, filter 𝑛𝑔𝑟𝑎𝑚𝑠 based on their
      • There is a company registered in France with the official
                                                                         frequency in the official name list.
        name: CASINO, GUICHARD-PERRACHON. The comma is to
        be removed (hence CASINO GUICHARD-PERRACHON) as it               3.3.1 Occurrence frequency. For a given 𝑛𝑔𝑟𝑎𝑚 we compute an
        is useless for any future process. However, as written before,   inverse occurrence frequency score 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) depending
        the hyphen is kept because GUICHARD-PERRACHON is                 on the number of times it occurs in the company legal names set.
        actually one name and should not be considered as two            The higher 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) is, the more unique the n-gram is.
        separate terms.
      • Also the dots are removed so as to normalise the acronyms                                     𝑐𝑜𝑢𝑛𝑡 ({𝑓 ∈ 𝐶 |𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 (𝑙𝑛 𝑓 , 𝑛𝑔𝑟𝑎𝑚}))
        in use within the legal names. e.g.: SARL and S.A.R.L.             𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) = 1 −
                                                                                                                     𝑐𝑜𝑢𝑛𝑡 (𝐶)
   For the second data set, we concatenate the titles and bodies                                                                    (1)
under a unique attribute we called 𝑐𝑜𝑟𝑝𝑢𝑠. We also normalise the         Where:
corpus by removing non-printable Unicode characters. The articles           • 𝑛𝑔𝑟𝑎𝑚 is a subsequence of the legal name of the company 𝑐
are then put into an 𝐸𝑙𝑎𝑠𝑡𝑖𝑐𝑠𝑒𝑎𝑟𝑐ℎ index.                                     containing 𝑛 words 0 < 𝑛 ≤ 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡 (𝑙𝑛𝑐 )
Labelling companies referred to in newspaper articles


      • 𝑙𝑛𝑐 , 𝑙𝑛 𝑓 are the legal names of the companies 𝑐, 𝑓                 Similarly to the previous step, we end up with a key-value dictio-
      • 𝐶 is the set of all the companies we have                            nary, called 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 , where the keys are the SIREN codes of the
      • 𝑐𝑜𝑢𝑛𝑡 ({𝑓 ∈ 𝐶 |𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 (𝑙𝑛 𝑓 , 𝑛𝑔𝑟𝑎𝑚𝑐 })) is the size of the sub-   companies and the values are lists of retained acronyms for the
        set of companies from 𝐶 containing 𝑛𝑔𝑟𝑎𝑚𝑐 in their official          given companies.
        name
                                                                             3.5    F-measure
3.3.2 Threshold. Once we have the 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 for an 𝑛𝑔𝑟𝑎𝑚, we
should define a threshold value 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 that determines the           Having these two lists 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 and 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 containing potential
𝑛𝑔𝑟𝑎𝑚s to keep and those to discard according to their 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒.           usual names, we have to create a method to keep those who are
This is our second 𝑛𝑔𝑟𝑎𝑚 filter.                                             actually useful in order to link press articles to the companies. The
                                                                             first thing to do is to merge 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 and 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 into a unique
      𝐼 𝑓 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 𝑡ℎ𝑒𝑛 𝑘𝑒𝑒𝑝 𝑛𝑔𝑟𝑎𝑚                dictionary of "potential common names": 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 . In this dictionary,
                                                        𝑒𝑙𝑠𝑒 𝑑𝑖𝑠𝑐𝑎𝑟𝑑 𝑖𝑡      every company, referred to by its SIREN code, has a list of unique
                                                                             𝑛𝑔𝑟𝑎𝑚𝑠 and acronyms that have been retained after applying the
   After implementation and study we empirically set the
                                                                             filters aforementioned.
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 value to 0.999.
                                                                             The following step is to generate for every "potential common
   At the end of this step, we end up with a key-value dictionary,
                                                                             name" list all the sub-sets. The aim is to select the sub-set that
called 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 , where the keys are the SIREN codes of the com-
                                                                             contains the most relevant common names for every company.
panies and the values are lists of the potentially relevant 𝑛𝑔𝑟𝑎𝑚𝑠
                                                                             To do so, we compute the F-measure of every sub-list given our
for the given company.
                                                                             indexed ground truth. The latter contains articles from the second
                                                                             data set mentioned in the pre-processing part. For each article we
3.4     Acronym generator
                                                                             know to which companies it refers to. We try to find one 𝑛𝑔𝑟𝑎𝑚
Many companies are referred to by an acronym in the press. We                set that we call 𝑈 𝑠𝑢𝑎𝑙𝑁 𝑎𝑚𝑒𝑠𝑐 for each company that maximises
complete therefore our list of 𝑛𝑔𝑟𝑎𝑚𝑠 derived from the legal names           the F-measure when retrieving articles that contain at least one
with acronyms. For each legal name, we want to generate every                𝑛𝑔𝑟𝑎𝑚 ∈ 𝑈 𝑠𝑢𝑎𝑙𝑁 𝑎𝑚𝑒𝑠𝑐 when looking for a given company 𝑐.
potential acronym that might be in use. For instance:                            The F-measure, introduced at the MUC-4 conference [1], is the
      • the SOCIETE FRANCAISE DU RADIOTELEPHONE is often                     harmonic mean of 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑟𝑒𝑐𝑎𝑙𝑙.
        referred to as SFR                                                       We define as 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 the articles that are annotated as
      • SOCIETE DE DISTRIBUTION DE PAPIER would be more                      talking about the company 𝑐 in our dataset. We define as 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖
        known as SODIPA                                                      the set of articles containing at least one of the common names in
The acronym generation is based on the first letter or the two first         the 𝑖 𝑡ℎ sub-set of the set of potential common names of a company 𝑐,
letters of each term of the whole legal name or a part of it.                as found in 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 (𝑐) The 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the ratio of 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐
Let us consider for example the company registered as TONNEL-                amongst 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 , whereas the 𝑟𝑒𝑐𝑎𝑙𝑙 is the ratio of rele-
LERIE FRANCOIS FRERES. Our acronym generator would provide                   vant articles that were actually retrieved from 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 .
the following possibilities:                                                 In fact, we query the press article Elasticsearch index, looking for
      • Based on the whole name: TFF, TOFRFR, TOFRF, TOFFR, TFFR,            articles that contain at least one of the 𝑛𝑔𝑟𝑎𝑚s contained the input
        TFRF, TFRFR, TOFF                                                    𝑛𝑔𝑟𝑎𝑚 sub-set (𝑠𝑢𝑏𝑠𝑒𝑡_𝑛𝑔𝑟𝑎𝑚𝑠𝑐 ). The documents that are returned
      • Based on a part of the name: TF, FF, FFR, FRF, FRF, TFR, TOF,        upon the query are called 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠 (𝑠𝑢𝑏𝑠𝑒𝑡_𝑛𝑔𝑟𝑎𝑚𝑠𝑐 ). The
        TOFR                                                                 ideal case would be that the query returns all articles about the
                                                                             company 𝑐 the input sub-set belongs to; these are what we define
Given our acronym generation protocol, for every company we
                                                                             as 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 .
should end up with a considerable amount of acronyms. More
                                                                             e.g.: For BANQUE PALATINE let us suppose that we have, among
words a legal name has, the longer is the list of acronyms. We have
                                                                             others, the following potential common names sub-set: [PALATINE,
therefore defined some rules to filter the potentially relevant ones.
                                                                             BP], we would retrieve any articles containing PALATINE or BP
      • Any non-alphabetic symbol is removed, as part of data clean-         at least once. Meanwhile the relevant articles are obtained with
        ing for this process.                                                querying all the articles tagged with the SIREN code of BANQUE
      • French stop words are removed from legal names before                PALATINE in our ground truth dataset.
        generating any acronym.                                                  The formulae for the precision, the recall and the F-measure are:
      • No acronym generation for one-word long legal names.
                                                                                                                 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖
      • For each generated acronym, we verify in our ground truth                            𝐹 (𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 ) = 2.                                 (2)
                                                                                                                 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖
        of articles tagged with the concerned company whether
        there is at least one occurrence in the corpora. So, if an           if 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 ≠ 0 and 𝐹 (𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 ) = 0 if 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 +
        acronym occurs at least once then we keep it in our acronym          𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 = 0
        dictionary, else we discard it.                                          Where :
      • For performance reasons, no acronym is generated if the                   • 𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 being the 𝑖 𝑡ℎ sub-set of the set of potential common
        legal name is strictly more than 5 words long. 29 companies                 names of a company 𝑐, as found in 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 (𝑐)
        out of the list of 2357 𝑤𝑒𝑙𝑙 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies. For                  • 0 < 𝑖 ≤ 𝐿𝑐 , 𝐿𝑐 being the number of all the sub-sets of 𝑛𝑔𝑟𝑎𝑚s
        those few, we set their usual names manually.                               for the company 𝑐
                                                                                                                                                          Nahid et al.


Table 1: Mean values of the F-measure computation on the                             issue is of interesting debate, should an article be labelled
“well documented" companies using their common names                                 whenever it mentions a company, or only when the main
(first row) and legal names only (second row)                                        topic is about it?

                                    F-measure     Precision    Recall       5     CONCLUSION AND PERSPECTIVES
 best subset of common names          0.562         0.537      0.765        In this study we proposed a method to build common, usual names
        legal names only              0.446         0.476      0.560        for companies based on an official or legal name list including theirs
           difference                 0.116         0.061      0.205        and a large set of annotated news articles. Our experiments show
                                                                            that the method improves in a significant manner the F-score of
                                                                            the retrieval of relevant articles for a company when looking for
and
                                                                            our improved usual name set instead of only their official name.
                   |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 } ∩ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 }|
  𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 =                                                            The filtering of the best common name sets through a greedy F-
                              |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 |}                      measure based approach is complete, but we can eventually try to
                                                                      (3)
                                                                            give individual scores of each potential usual name and filter them
                                                                            according to that score, instead of studying all possible potential
                |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 } ∩ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 }|
   𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 =                                                    (4)       usual name set.
                            |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 |}                            As our final goal is to label articles with companies they are
Finally, once the computation is finished, we keep the subset that          talking about, we propose to work on other company-article dis-
maximises the F-measure for every company as the definitive list            tance measures. One is based on Part Of Speech tagging based NER
of relevant common names of the concerned company.                          (Named Entity Recognition) analysis combined with machine learn-
The computation of the F-measures is important as the most rele-            ing methods (for instance Conditional Random Fields) to predict
vant list of common names is not necessarily the largest one. In fact,      words in a phrase that have a high probability of being company
adding an 𝑛𝑔𝑟𝑎𝑚 or an 𝑎𝑐𝑟𝑜𝑛𝑦𝑚 does not improve the F-measure                names. This can eliminate some false positives when company
in all cases. Sometimes, adding a new keyword (𝑛𝑔𝑟𝑎𝑚 or acronym)            names are also common words like 𝑂𝑟𝑎𝑛𝑔𝑒 or 𝐵𝑢𝑡. Another idea is
to a query might add noise to the results, and if the recall improves       to train a classifier to guess the activity sector an article is talking
or stays constant, the precision might worsen and hence decrease            about and check whether it is close to the activity sector of the pro-
the F-measure.                                                              posed companies. This method would help increasing the precision
                                                                            of our retrieval.
4 EXPERIMENTATION RESULTS
F-measures on 𝑑𝑖𝑐𝑡𝑠𝑢𝑏𝑝𝑐𝑛 and comparison with                                ACKNOWLEDGMENTS
legal names                                                                 To Infolégale, for providing the ground truth dataset.
                                                                            To G. Benturquia, Y. Latreche, G. Meddour, T. E. Mekhalfa and A. E.
We ran the protocol described in the previous section on the 2357
                                                                            Pereyra, for their most helpful work they have done during their
𝑤𝑒𝑙𝑙 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies, i.e. those having at least 6 articles
                                                                            fifth year research module.
within the corpus labeled as talking about them.
We calculated 𝑑𝑖𝑐𝑡𝑠𝑢𝑏𝑝𝑐𝑛 as described in the previous section. We
                                                                            REFERENCES
defined 𝑑𝑖𝑐𝑡 𝑓 𝑖𝑟𝑠𝑡 which only keeps the subset that maximises the
                                                                            [1] Nancy Chinchor. 1992. MUC-4 evaluation metrics. In Proceedings of the 4th con-
F-measure for each company, with its F-measure, precision and                   ference on Message understanding (MUC4 ’92). Association for Computational
recall values.                                                                  Linguistics, McLean, Virginia, 22–29. https://doi.org/10.3115/1072064.1072067
                                                                            [2] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A Survey on Deep Learning
We joined 𝑑𝑖𝑐𝑡 𝑓 𝑖𝑟𝑠𝑡 with 𝑑𝑖𝑐𝑡 𝑓 𝑙𝑒𝑔𝑎𝑙 to compare the F-measure of             for Named Entity Recognition. CoRR abs/1812.09449 (2018). arXiv:1812.09449
either the legal name or the best common name subset. Table 1                   http://arxiv.org/abs/1812.09449
summarises the mean values of results obtained for this experiment.         [3] Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, and Dirk Thomas. 2017.
                                                                                Improving Company Recognition from Unstructured Text by using Dictionaries.
    • We obtained an average of 0.56 for the F-measure on the                   (2017), 10.
       best common name subset for every company, whereas the               [4] Andrei Mikheev, Claire Grover, and Marc Moens. 1998. Description of the LTG
                                                                                System Used for MUC-7. In Seventh Message Understanding Conference (MUC-7):
       average when tagging only through legal names for the same               Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998. Fairfax,
       sample did not go beyond 0.44. Our protocol realised an aver-            Virginia. https://www.aclweb.org/anthology/M98-1021
       age of 11.6% of F-measure improvement. It also improves the          [5] David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition
                                                                                and classification. Lingvisticæ Investigationes 30, 1 (Jan. 2007), 3–26. https:
       recall by 20.5%, which means that we retrieve more relevant              //doi.org/10.1075/li.30.1.03nad
       articles using generated potential common names.                     [6] L. F. Rau. 1991. Extracting company names from text. In [1991] Proceedings. The
                                                                                Seventh IEEE Conference on Artificial Intelligence Application (Miami Beach, FL,
    • The minimum difference of F-measure value is a negligible                 USA, 1991-02), Vol. i. IEEE, 29–32. https://doi.org/10.1109/CAIA.1991.120841
       amount and can be considered as a nought value. We might             [7] GuoDong Zhou and Jian Su. 2002. Named Entity Recognition using an HMM-based
       affirm that our protocol can either improve things for article           Chunk Tagger. In Proceedings of the 40th Annual Meeting of the Association for Com-
                                                                                putational Linguistics. Association for Computational Linguistics, Philadelphia,
       tagging or let them at a stable level.                                   Pennsylvania, USA, 473–480. https://doi.org/10.3115/1073083.1073163
    • The precision improvement is quite low compared to the
       recall’s improvement. This is mainly due to complex scrap-
       ping issues (such as URL redirections) that brought noise to
       the data, or omitted annotations on some articles. The latter