Carolina’s Methodology: building a large corpus with
            provenance and typology information

       Mariana Lourenço Sturzeneker[0000-0002-0878-3463]1, Maria Clara Ramos Morales
    Crespo[0000-0003-1500-2749], Maria Lina de Souza Jeannine Rocha[0000-0001-9526-4786], Mar-
    celo Finger[0000-0002-1391-1175], Maria Clara Paixão de Sousa[0000-0002-8422-417X], Vanessa
         Martins do Monte[0000-0002-4929-5298] and Cristiane Namiuti[0000-0002-1451-8391]

Center for Artificial Intelligence, University of São Paulo. Av. Prof. Lúcio Martins Rodrigues,
                          370 - 05508-020 - Butantã, São Paulo, Brazil.
                                       lavihd@usp.br


         Abstract. This paper presents the salient aspects of WaC-wiPT methodology,
         developed for the construction of the Carolina Open Corpus for Linguistics and
         Artificial Intelligence, a large corpus for contemporary Brazilian Portuguese.
         Both the corpus and the methodology are under development at the Center for
         Artificial Intelligence of the University of São Paulo. This paper describes the
         paths we took this far into the making of the Carolina Corpus, presents its current
         state and discloses the future agenda of the project.

         Keywords: Open Corpus, Brazilian Portuguese, Provenance, Typology.


1        Introduction

The Carolina Open Corpus for Linguistics and Artificial Intelligence is a general corpus
of contemporary Brazilian Portuguese texts written after 1970 hosting provenance and
typology information. It is under development since September 2020 as part of the Nat-
ural Language Processing for Portuguese (NLP2) project of the Center for Artificial
Intelligence of the University of São Paulo (C4AI-USP).
   With Carolina, we expect to build a large and reliable resource for research in both
Linguistics and Computer Sciences, with more than a billion tokens. In opposition to
other corpora built under the “Web as corpus” view, which aim to gather large amounts
of texts for language-modeling retrieving them from multiple untraceable origins, Car-
olina’s intention is to curate sources in large quantities of text. In doing that, we expect
to provide further information about the texts, especially on provenance and typology
[1], which benefit linguistic research.
   By not tailoring the corpus for any specific linguistic application, we aim to avoid
restraining the possibilities for future projects and posing obstacles for any researchers
interested in studying a wide range of language aspects. For instance, investigation of


1 Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License

     Attribution 4.0 International (CC BY 4.0).}
2


typological characteristics, word collocation, language detection and Historical Lin-
guistics. To reach this goal, we developed the WaC-wiPT (Web-as-Corpus with Prov-
enance and Typology) methodology, which combines the automation and large exten-
sion of language-modeling corpora with the careful text-information curatorship of
smaller linguistic corpora.


2       Related works

Over the last decades, corpus-building initiatives have increasingly resorted to the Web
as their main source. As part of this endeavor, the WaCky (Web-As-Corpus Kool Yin-
itiative) methodology was developed [2, 3, 4]. As it proved to be a relatively easy and
not high-resource-demanding method, this framework quickly became popularized.
There have been undertakings to apply it to the Portuguese language, such as the Bra-
zilian Portuguese Web as Corpus (brWaC), considered to be the “biggest Brazilian Por-
tuguese corpus available” at the time [5], with 2.68 billion tokens.
    One example of a Brazilian Portuguese corpus with a significant size is the Brazilian
Corpus [6], with approximately one billion syntactically annotated words. There are
also other important corpora out of the envisioned scope of language or size, such as
the Oscar Corpus [7], or the Corpus do Português: Web/Dialects [8]. Other corpora with
provenance and typology information are known, such as ReLi [9] and CETENFolha
[10], albeit their smaller sizes according to their specific goals.


3       Methodology

During the first stages of our research, some effort was put into investigating the pos-
sibility of implementing pre-existing corpora-building frameworks, such as the WaCky
methodology [2]. However, it does not ensure the transparency of the content that is
scraped, so a post-hoc investigation is necessary [2]. This poses a challenge for prove-
nance tracking, quality control, and rights-of-use compliance, which are at the core of
Carolina's objectives. Therefore, we built on the knowledge provided by this investiga-
tion to develop WaC-wiPT, a Web-as-Corpus with Provenance and Typology method-
ology, which is constantly being improved.


3.1     Web text prospection

The fundamental steps of the method are based on broad types of domains, the Carolina
broad typology, which are not intended to reflect the textual content of our documents,
but a macro-structure that guides the corpus development. They were defined after the
surveying process described below. Currently, we are working with eight: datasets and
other corpora, Brazilian Judicial branch and Legislative branch, journalistic texts,
public domain works, social media2, university domains and wikis. The Carolina broad

2 It is important to clarify that we only incorporated open access content, in compliance with each

    domain’s license.
                                                                                         3


typology opposes the Carolina narrow typology, which aims to effectively reflect the
textual types of our documents, which are not yet defined for they require further anal-
ysis of the texts and may depend on linguistic theory; as well as the source typology, a
simple typological organization that reflects exclusively what is declared on the sources
and therefore are declared when available.
   We began by conducting prospective surveys, which are in-depth research of each
Web domain, prioritizing open access content available online. In these surveys, we
verified if the texts were in our scope, in addition to searching for metadata information
and mapping the basic directory structure of each domain. Therefore, this first step is
important to help us systematize metadata for future automatic annotation — so we do
not lose any important information from our sources — and organize the provenance
of the data as well. These surveys also facilitate the download and extraction stages, for
they allow the downloaded content to be mostly deliberate and not randomly crawled.
   At the beginning of the data collection step, we mirrored some websites to keep raw
copies of our sources, but even in those cases, we could directly extract the desired texts
because of the mapping of the directory structure previously made. However, most
websites were not obtained with this method for some had defense mechanisms against
machine download and others could only be used partially, as they contained texts that
were not in our scope or were under restrictive licenses. In all cases, special care was
taken to verify the rights of use: during the data collection step, we downloaded exclu-
sively open access texts and later verified if they allowed derivative works. Should the
occasion arise that any data is copyright-claimed, our methodology enables the easy
removal of any set of texts from the corpus.


3.2    Metadata and extraction

As we aim to build a corpus with provenance and typology information, each text is
embedded in an XML header carrying annotated metadata — such as source URL and
license — following the TEI (Text Encoding Initiative) guidelines. To gather infor-
mation for the header, after the download stage some surveys are complemented by
opening a small sample of the raw documents and searching for any additional
metadata. However, as we have a large number of texts, information that cannot be
automatically annotated is not mandatory, thus, most categories are optional, such as
Author and Regional Origin. To make this process easier we extract the texts by batches
arranged by coincident information, usually grouped by downloaded directory struc-
ture, which mirrors the broad typology's.
      This means that for the extraction of a batch, we inform the metadata that holds
for the whole set, prioritizing the fulfilling of the header’s mandatory categories. There-
fore, the metadata collection and insertion are carefully made, to prevent errors from
being repeated in the whole batch. We centralize this latter process with an extraction
module developed in Python3, which obtains some metadata by input and others auto-
matically, organizes them and generates the XML file with the clean text embedded in
        4


        the header3. It also verifies if the text is valid by assessing its language and size, for
        example. In addition to the traditional search by words, the structured header allows
        searches by tags, facilitating the metadata recovery, thus providing further query tools.
           In order to test our tools, we processed a portion of our raw data before our first
        official extraction, with over a billion tokens total and about 24 hours of CPU time.
        This first test version will not be made publicly available; however, it sheds some light
        on what is to be expected of the first official publication in terms of size and typology
        distribution. The texts obtained were sampled randomly (590 files in total) by Carolina
        broad typology, of which only 4 broad types were included.
           We carefully looked for problems concerning not only the cleaning process, but also
        the metadata provided, and implemented computational solutions to improve textual
        quality, such as removing remaining blank lines and corrupted characters. Some recur-
        ring issues were over or under-cleaned texts, as well as the formatting of the data pro-
        vided automatically, which sometimes did not match our chosen standards. After this
        process, we established the importance of human inspection of the files, as some prob-
        lems would not be easily identified and fixed without it. Therefore, this method of ex-
        amining the samples of the extracted files will be kept in the future as well as some
        machine inspections will be made to verify if all the files are well-formatted.


        4       Current state

        At present, Carolina is at a prototypical stage. After the test extraction, some consider-
        able yet expected size-reduction took place in relation to the raw crawled content4. The
        main reason for that is that files were disposed of tags and non-content elements aiming
        at a text as clean as possible, and many files did not reach a significant number of char-
        acters and were thus discarded. The table ahead illustrates these reductions.

                                          Table 1. Test extraction results.

Broad typology5                Size (GB)                        Number of words                    CPU time (h)
                               Raw          XML files           Raw             Clean text
Judicial branch                71           1,7                 2.543.098.232    191.635.110       20,32
Datasets and other corpora 31               7,5                 2.870.318.559    327.927.677       0,72
Public domain works        0,17             0,024               4.774.114        3.170.682         0,23
Wikis                          741          20                  89.664.268.580   665.638.761       3,00
Total                          843,17       29,224              95.082.459.485   1.188.372.230     24,27


        3 An example of a generated XML header can be accessed at: https://sites.usp.br/corpuscaro-

            lina/exemplo/
        4 The raw crawled content encompasses everything downloaded from each domain, including

            media files, source metadata and content out of the project’s current scope.
        5 A list of the content available on the first version of the Carolina Corpus can be accessed at:

            https://sites.usp.br/corpuscarolina/repositorios/
                                                                                               5


   Based on this initial testing of our methods' performance, we have already begun the
proceedings for the first version, to be released in March 2022, the Carolina 1.0 (Ada),
following the steps of this under development methodology.


5      Conclusion and future steps

Both the preliminary surveys and the close analysis of the samples extracted for the test
version proved to be essential to keep track of the metadata collection and sustain such
complex text headers. Thus, our methodology helped to guarantee the provenance and
typology information we aim to preserve in the automatic processes of text extraction.
   As for future intentions, we wish to develop language verification tools capable of
distinguishing Brazilian Portuguese from other variants and assess the percentage of all
languages used in a text, hoping that would allow studies on language contact and cul-
tural influences. Additionally, there are plans to build a historical corpus for Philologi-
cal and Historical Linguistics research based on our methodology.
   For the future versions of the Carolina corpus, we will work to ensure better balanc-
ing of text types, which require further effort put into new surveys. In our understand-
ing, this continuous development of the methodology is an essential part of the labor
involved in the construction of such an ever-growing corpus.


References
 1. Finger, M., Paixão de Souza, M. C., Namiuti, C., Monte, V. M., Costa, A. S., Serras, F. R.,
    Sturzeneker, M. L., Guets, R. P., Mesquita, R. M., Crespo, M. C. R. M., Rocha, M. L. S. J.,
    Palma, M. F., Silva, M. M., Brasil, P. Carolina: a General Corpus of Contemporary Brazilian
    Portuguese with Provenance and Typology Information. Language resources and evaluation,
    submitted paper (2021).
 2. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection
    of very large linguistically processed web-crawled corpora. Language resources and evalu-
    ation, 43(3), 209-226 (2009).
 3. Bernardini, S., Baroni M., Evert, E.: A WaCky introduction. In: Baroni, M., Bernardini, S.
    (eds.) WaCky! working papers on the web as corpus, pp. 9-40. GEDIT, Bologna (2006).
 4. Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography:
    A pilot study of English/French collocation extraction and translation. In: Using Corpora in
    Contrastive and Translation Studies, pp. 337-362. Cambridge Scholars Publishing, Newcas-
    tle (2010).
 5. Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a wacky corpus for Brazilian
    Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe
    Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 201–206. Springer, Heidelberg
    (2014).
 6. Sardinha, T. B., Filho, J. L. M., Alambert, E.: Manual Córpus Brasileiro, https://www.lin-
    guateca.pt/Repositorio/manual_cb.pdf, last accessed 2021/12/13.
 7. Suárez, P. J. O. S., Sagot, B., Romary, L.: Asynchronous Pipeline for Processing Huge Cor-
    pora on Medium to Low Resource Infrastructures. In: Proceedings of the Workshop on Chal-
    lenges in the Management of Large Corpora (CMLC-7) 2019, pp. 9-16. Leibniz-Institut für
    Deutsche Sprach, Mannheim (2019).
6


 8. Davies, M., Ferreira, M.: Corpus do Português: Web/Dialetics, https://www.corpusdoportu-
    gues.org/web-dial/, last accessed 2021/12/13.
 9. Corpus ReLi, https://www.linguateca.pt/Repositorio/ReLi/, last accessed 2022/03/10.
10. CETENFolha, https://www.linguateca.pt/CETENFolha/, last accessed 2022/03/10.