=Paper= {{Paper |id=Vol-2029/paper16 |storemode=property |title=Data Science Approach to Analysis of Lattes CV Data |pdfUrl=https://ceur-ws.org/Vol-2029/paper16.pdf |volume=Vol-2029 |authors=Thiago Luís Viana de Santana,Rafael Santos |dblpUrl=https://dblp.org/rec/conf/simbig/SantanaS17 }} ==Data Science Approach to Analysis of Lattes CV Data== https://ceur-ws.org/Vol-2029/paper16.pdf
                     Data Science Approach to Analysis of Lattes CV Data


               Thiago Luı́s Viana de Santana                Rafael Santos
             Applied Computing Graduate Program Applied Computing Graduate Program
             INPE – Nat. Inst. for Space Research INPE – Nat. Inst. for Space Research
               thiago.santana@inpe.br               rafael.santos@inpe.br




                                Abstract                      a Brazilian physicist) that provides an unified in-
                                                              terface to a database that is used to collect, store
        The Lattes Platform is an online database
                                                              and process information about academic achieve-
        of academical records. It is used by
                                                              ments. Any researcher can create and maintain his
        the research and educational community
                                                              or hers own Lattes CV, using the taxonomy, for-
        of Brazil (and some other countries), be-
                                                              mats and fields defined by the platform. The unifi-
        ing of great value for identification of
                                                              cation of some fields and categories makes it easy
        researchers and their relationships with
                                                              to fill the forms that feed the database.
        other researchers and, for that, it can be
        considered a specialized kind of social net-             The Lattes Platform is also used by CNPq to
        work.                                                 generate reports about the current status of the aca-
                                                              demic production of the researchers and students,
        In spite of its usefulness, the main inter-           and to evaluate applications to several different
        face of access to its data does not allow any         types of grants. The data on the platform is also
        type of analysis, just basic reports. In this         used by other government funding agencies and
        paper, we present a new tool and approach             by the Ministry of Education, for evaluation of the
        to analysis of groups of records from the             production of professors and students in graduate
        Lattes Platform which is simpler and more             programs.
        flexible than other similar tools proposed
                                                                 The Lattes Platform public interface is a web-
        in the past.
                                                              based system that allows the edition of the CVs
1       Introduction                                          by its owner and the search and retrieval of the
                                                              CVs by anyone that knows either the researchers’
Evaluation of researchers and students productiv-
                                                              names or IDs (a 16-digit unique identifier). In
ity, either individually or in groups, is an important
                                                              the end of 2016 there were more than 3.500.000
task for universities, research centers, and funding
                                                              CVs stored in the database2 . Of those, almost
agencies, and also for the students and researchers
                                                              1.500.000 were CVs of students.
themselves: students may be interested in know-
                                                                 Data in the Lattes Platform can also be used for
ing more about the achievements, research areas
                                                              other academic purposes: analysis of academic in-
and experiences of prospective advisors. Teachers
                                                              dicators’ evolution (Perez-Cervantes et al., 2012),
and admission deans may also want to know about
                                                              identification of communities based on similar
the academic history of candidates to a graduate
                                                              interests or collaborations (Mena-Chalco et al.,
program, for example. The usual method to eval-
                                                              2014; Araújo et al., 2014; Alves et al., 2011b),
uate academics and students is by analysis of the
                                                              changes of research areas based on the publica-
achievements and publications listed on his or hers
                                                              tions records, etc. Analysis considering groups
curriculum vitae (CV).
                                                              of researchers or students can be done in differ-
   The Brazilian National Council of Techno-
                                                              ent scales, from the whole academic community
logical and Scientific Development (Conselho
                                                              to small groups, such as researchers in a group or
Nacional de Desenvolvimento Cientı́fico e Tec-
                                                              students in a college. But although the Lattes CV
nológico, CNPq) maintains an online system, the
                                                              data is considered public (being provided by the
Lattes Platform1 (named in honor of César Lattes,
    1                                                            2
        http://lattes.cnpq.br                                        http://estatico.cnpq.br/painelLattes/




                                                        168
researchers and students themselves), retrieval of         of CVs. The tool could be deployed as a local ap-
the data is limited: it is possible to download a full     plication on computers running Linux, and its au-
individual CV as a XML file with all the data en-          thors made the tool open so other groups could use
tered on that CV, but it is not possible to retrieve       it. Other research groups used scriptLattes as basis
subsets of the data for more than one CV at a time.        to create different types of analyses (Alves et al.,
This makes it hard to perform some specific types          2016; Perez-Cervantes et al., 2012).
of analysis that requires the extraction of certain            Initially scriptLattes downloaded the data from
categories from several CVs at once.                       CNPq’s servers as HTML files, but with later
   In this paper we present our work on tools and          adoption of a CAPTCHA (”Completely Auto-
techniques that allow the extraction and analysis          mated Public Turing test to tell Computers and
of data from collections of Lattes CVs. One char-          Humans Apart”) access control system, automatic
acteristic of these techniques is that it considers        download was made very difficult, so it was mod-
that extraction of data from a set of Lattes CVs           ified to use a local set of files that must be down-
as a part of a data science process (Schutt and            loaded in advance.
O’Neil, 2013): raw data (the Lattes CV XMLs) is                scriptLattes is probably the most referenced tool
collected, processed and cleaned; allowing an an-          in the bibliography we surveyed. It is quite com-
alyst to use exploratory data analysis techniques          plete, but the data is parsed from the HTML ver-
and apply statistical or other models on it. The an-       sion of the Lattes CVs, which has changed in the
alyst has access to the data and the tools to process      past and may change in the future. Reports and
it in a simple but flexible environment, therefore         graphics are also preprogrammed, so extensions
he or she isn’t limited to a packed set of tools. Re-      and different layouts must be programmed sepa-
sults of the analysis are presented in charts, plots,      rately.
reports and data products, derived from the set of             LattesExtractor3 is a tool developed by CNPq
Lattes CVs considered on the analysis, and which           that allows the download of several Lattes XML
may be used as input to other analysis tasks.              CVs in batches. Although it seems to solve part of
   This paper is divided as follows: Section 2             the problem at hand, namely, how to obtain sub-
presents related work, mainly on other tools used          sets of the data, it is not as open or as flexible: only
to extract information from Lattes CVs. Section 3          registered organizations can retrieve data with this
sets requirements for a Lattes CV exploration tool         tool, and organizations can only access data that is
and proposes a data science based approach to              related to the organization itself. For example, an
build this tool. In section 4, the Lattes CV explo-        university may be able to download all the XML
ration tool is used and a few data Exploratory Data        files with the CVs of its staff, teachers and stu-
Analysis visualization examples are exhibited. Fi-         dents, but will not be able to download CVs of
nally, section 5 enumerates problems outside the           collaborators that don’t currently work or study at
Exploratory Data Analysis framework that can be            the university. Similarly, when a student graduates
solved by extending the LattesLab tool.                    and leaves the university, its CV will no longer be
                                                           available after graduation (since he or she is not
2   Related Work                                           part of the university). This tool also does not
                                                           perform any kind of analysis, providing only the
Access to subsets of data on Lattes CVs is desir-          XML files.
able for several different types of analysis, and of-
                                                               SUCUPIRA (Alves et al., 2011b) was de-
ten these analyses must be done considering col-
                                                           veloped as a tool that allowed both the semi-
lections and not individual CVs. Considering this
                                                           automatic extraction of the XML files from the
need, several different tools were created in the
                                                           Lattes Platform and the creation of reports and
past to process and analyze data from the Lattes
                                                           graphics that could answer questions about col-
Platform.
                                                           laboration between researchers, their geographical
   One of the first tools that allowed the extraction      location, their scientific production and its evolu-
of information from a collection of Lattes CVs is          tion, etc. SUCUPIRA is a web-based application
scriptLattes (Mena-Chalco et al., 2009). This tool         that uses a list of names of researchers or students,
allowed the extraction of data from the Lattes CVs         managed by the system’s user, to download the
of groups of researchers, creating reports, maps,
graphs and other information from the collections             3
                                                                  http://lattesextrator.cnpq.br/lattesextrator/




                                                     169
Lattes CVs (as HTML files), parse those and cre-         3     A Data Science-based Approach to
ate reports based on the data extracted from the               Analysis of Lattes CV Data
CVs on that list.
                                                         Up to now, a number of Lattes CV exploring tools
   It seems that the development of that tool was        have been listed. Some of them cannot be used as
discontinued, but changes on the Lattes Platform         designed for the following reasons:
may have made it unusable: the structure of the
HTML files changed, therefore parsers that could             • The Lattes CV platform has updated its data
parse a version of the HTML generated by the                   publishing technology from HTML to XML.
platform had their usefulness restricted due to the            Therefore, all the tools that relied on that pre-
new layout. Additionally, SUCUPIRA was writ-                   vious file distribution (often requiring com-
ten when access to the Lattes CVs was unhindered               plex procedures to parse HTML) no longer
by CAPTCHAS – for some time the platform used                  work properly. It seems a trivial technical
simple CAPTCHAS that could be solved with                      issue, but the tools that extracted informa-
tools such as Tesseract (Kay, 2007), but recently              tion from the Lattes CVs formatted as HTML
the CAPTCHAs were made more difficult to solve                 documents had to deal with complex HTML
automatically.                                                 structures and had to detect, from the HTML
                                                               content itself, the categories of information
   SUCUPIRA used another tool developed by                     being extracted (e.g. articles in journals, con-
the same researchers: LattesMiner (Alves et al.,               ferences, titles, authors, etc.) while the data
2011a). LattesMiner is a Domain-Specific Lan-                  represented as XML is properly formatted
guage (DSL) implemented as a set of Java classes               and tagged with this information. Therefore,
that allows the manipulation of data on a set of               even though the XML platform restricted the
Lattes CVs, defined by a programmer, with mod-                 use of the HTML-based tools, it provided
ules for data discovery (association of names and              structural elements for development of new,
IDs), data extraction (parsing of the HTML files               more robust tools.
corresponding to the CVs with regular expres-
sions), storage of data in a local database, vi-             • Some of these tools were designed to au-
sualization and analysis tools. As with SUCU-                  tomatically download a list of CVs from
PIRA, LattesMiner’s development has stopped,                   CNPq’s site. This is not possible today due
since changes on the Lattes platform rendered                  to the implementation of the CAPTCHA test,
some aspects of the tool unusable.                             both to view and download the Lattes CVs.
  Another tool that is concerned with processing             • Some of the solutions are only available on
data from the Lattes Platform is XMLattes (Fer-                specific platforms - such as Linux - and re-
nandes et al., 2011), which converts the Lattes                quire a specific setup before use.
CVs in HTML to XML for further processing, and
which was rendered unnecessary since the present           Considering the present status of the existing
version of the Lattes platform already exports the       tools for Lattes CV analysis we consider that a
XML version of the curriculum vitae (although re-        new, functioning tool to extract, interpret, analyze
quiring CAPTCHAs for download of individual              and visualize the data in a simple but flexible way
CVs).                                                    ought to comply with the following requirements:
   As part of this research we reviewed several pa-          1. Work with an offline set of Lattes CVs.
pers related to analysis of Lattes CVs data (Di-                Due to the current impossibility of automated
giampietri et al., 2012; Mena-Chalco et al., 2014;              batch download of Lattes CVs, the CVs must
Araújo et al., 2014; Perez-Cervantes et al., 2012).            either be obtained manually or automatically
Most of these papers used a database with de-                   through other authorized tools – such as Lat-
tailed information on academics, which was ex-                  tesExtrator.
tracted from the Lattes Platform when it was pos-
sible to do so without the limitations imposed by            2. Be able to transparently transform the list of
the CAPTCHA currently in use. There were no                     Lattes CVs’ XML files into table-like data
references on whether that database was kept up                 structures for further processing. To make
to date.                                                        this transformation, it is necessary to know



                                                   170
     the structure of the Lattes CV XML, identify
     the parameters of interest to the researcher
     and migrate these parameters to the data
     structures. This transformation is made eas-
     ier since CNPq publishes the XML Schema
     Dictionary (XSD) of the Lattes CV files.

  3. Be agnostic with respect to the operating sys-
     tem used to run the tool. Each potential user
     has a limited number of resources and to re-
                                                          Figure 1: Data Science Process (Schutt and
     quire the user to learn a new programming
                                                          O’Neil, 2013).
     language, install a new software or even a
     new whole operating system shall be avoided
     if the tool is to be widely used.                    the steps shown in the Data Science process (Fig-
                                                          ure 1), they serve as a guideline to ensure that the
  4. Require no specialized knowledge for opera-          results attained by the tool are easy to acquire –
     tion. In the same way that requiring an spe-         and customize – and also reproducible.
     cific environment limits the utilization of the         The fourth requirement listed on this section is
     tool, if specialized knowledge is not required       directly related to the concept of Exploratory Data
     to operate this tool, it will be easier to use,      Analysis (EDA), which intent is to allow the re-
     and for that, may be used by a bigger num-           searcher to discover patterns on data by using vi-
     ber of individuals. At the same time the tool        sualization tools and statistics to understand “what
     must be extensible so more advanced users            is going on with this data”.
     could do more with the tool.
                                                             Analysis of Lattes CVs can be done in different
  5. The results produced by this tool should be          ways, using different metrics and algorithms, but
     reproducible by any user interested in anal-         if we consider that most of the analyses will be
     ysis of a set of Lattes CVs. Reproducibility         done considering thematic groups of researchers
     is ensured by the use of a common set of in-         (e.g. researchers in a specific area of knowledge,
     structions that can be easily shared and build       or professors and students of a specific depart-
     upon.                                                ment) it becomes clear that methods and tech-
                                                          niques applied to a particular analysis can be used
   The first three requirements on that list are          into different contexts, depending on the group. A
strictly technical, and must be met by a Lattes           tool for analysis of Lattes CVs collections must
analysis tool to deal with the complexity of the          make easy the reproduction of its results – repro-
Lattes CVs’ data access and representation issues.        ducible research is also a concept closely linked to
More important are the fourth and fifth require-          Data Science.
ments, that ensure that such a tool can be extended          According to (Peng, 2011), there is a spectrum
for different ways to explore the data and that the       of reproducibility of research, that goes from a
results can be reproduced and shared.                     non-reproducible result to a fully reproducible one
   By designing a tool that follows these require-        (Figure 2).
ments it is possible to apply a data science process
to the problem of analysis of collections of Lattes
CVs. The data science process (Schutt and O’Neil,
2013) is shown in Figure 1.
   The requirements listed above are directly re-
lated to the Data Science Process: the first require-
ment makes reference to the collection of data,
the second requirement is related to the process-         Figure 2: Reproducibility spectrum of a scientific
ing and cleaning of data as well as storage of the        publication (Peng, 2011).
“clean” data, so that it can be easily and trans-
parently accessed. Even though the fourth and                Considering Figure 2, it is highly desirable that
fifth requirements are not directly associated to         a tool performing Lattes CV Analysis allows the



                                                    171
full reproduction of the data analysis experiments        the Lattes CV is its XML structure (which allows
with the corresponding code being publishable             the extraction of semantic information from it) and
and applicable to different datasets of the same na-      the fact that the way its structure is posed is avail-
ture.                                                     able in a public XML Schema Definition. By ac-
                                                          cessing the XML file through the XPath language
3.1 LattesLab                                             one can extract the desired information and use it
In order to address the requirements listed on the        accordingly. In the presented case, the extracted
previous section, we propose a software stack so-         information is used to generate a data frame which
lution, based on concepts and principles of Data          will the be used to perform a few basic analysis.
Science, to tackle the generic problem of analyz-            To obtain meaningful results from the analy-
ing Lattes CVs. Our tool, named LattesLab, is             sis, it is desirable that the set of Lattes CVs share
based on the following components:                        some characteristics. For example, to perform an
                                                          analysis of the students and researchers of one in-
  • A library that is able to scan a collection of
                                                          stitution, it is necessary to have the Lattes CVs
    Lattes CVs (stored as local files) and create a
                                                          of members of that institution stored and subject
    set of data frames (table-like structures) from
                                                          to analysis by LattesLab. Other sets of thematic
    that collection.
                                                          collections could be researchers of all institutions
  • A deployment mechanism for that library               sharing some CNPq classification (e.g. CNPq
    that allows its use, with minimal software in-        grantees), or researchers that stated that they work
    stallation requirements.                              in a specific knowledge area.
                                                             The LattesLab main library is packed as a
  • A set of live documents that shows how to             Python package, which simplifies its deployment.
    perform basic statistical analysis, visualiza-        It can be downloaded and used in a standalone
    tions and reports.                                    way, in this case the user must be able to at least
                                                          install a Python IDE and then install and import
   LattesLab is being developed in the Python lan-
                                                          the package.
guage’s (VanRossum and Drake, 2010), widely
used platform – taking into account its large de-            Another deployment solution which is more
veloper and user community. It is also one of the         straightforward is to deploy LattesLab as a Jupyter
most used programming languages to solve Data             notebook (Kluyver et al., 2016). Jupyter note-
Science problems due do the large amount of free          books are web applications that allow creation and
and open analysis and visualization libraries.            sharing of documents that contain live code, equa-
   Other languages were considered for implemen-          tions, visualizations and explanatory text. The
tation – a previous version was developed in Java,        non-static parts of these documents are created by
but we found out that the distribution of the library     the execution of Python code.
to use in other derived projects was too complex.            Notebooks are an interesting solution not only
R (Ihaka and Gentleman, 1996) was also consid-            to make access to the information easier, but also
ered: even though the R language is strongly sup-         due to the fact that it runs in any operating system
ported by the community and is heavily used by            with a browser installed, and that it works not only
the scientific community, Python was chosen for           with Python, but with R, Scala, Julia, and over 40
the readability of its code (over R code, at least)       programming languages.
and its gradual learning curve.                              Jupyter notebooks allows not only the deploy-
   To use LattesLab, it is necessary to have all the      ment of code, but, in the same document, format-
Lattes CV files of interest stored in a folder and        ted text (with different styles and allowing the use
pass the folder name as a variable to the Lattes-         of hypertext, graphics, etc.) related to that code.
Lab main library. Then, the tool reads the CV files       Therefore, by using a Jupyter notebook, it is pos-
as downloaded from the Lattes Platform, parsing           sible to run LattesLab code, to perform analyses
the XMLs and storing the data available on data           and visualizations, to explain what was done and
frames.                                                   to give instructions to the users, so that he or she
   To extract data from the Lattes CV, XPath (part        doesn’t have to know the tool beforehand to use
of XLST, a language for transforming XML docu-            it and can change parameters or modify analysis
ments) was used. One of the major advantages of           to suit specific needs. Figure 3 shows a Jupyter



                                                    172
notebook running LattesLab.                                4   Examples of Analysis Reports
                                                           For the following examples, a group of 876 Lat-
                                                           tes CVs was downloaded from the Lattes Platform.
                                                           These CVs belong to participants in the Scientific
                                                           Initiation grant program at the Brazilian Institute
                                                           of Space Research (INPE) – these grants are given
                                                           to undergraduate students to participate in research
                                                           and development at the institute. The LattesLab li-
                                                           brary scanned and parsed these CVs’ files to create
                                                           a data frame used in the analysis and examples in
                                                           this section.
                                                              Lattes CVs are created and updated by the users,
                                                           and some may stop updating theirs, specially if
                                                           they are not involved in academic environments
                                                           anymore. A basic but interesting question we
                                                           could ask about our data is: how old are the CVs,
                                                           i.e. what is the last time they were updated?
                                                              We used LattesLab to create a histogram of the
Figure 3: Example of a Jupyter Notebook window             age of the CVs (the “last updated” information is
running the LattesLab tool.                                present on the CVs).


   When considering how to develop and deploy a
tool for analysis of Lattes CV data we could opt
for a standalone, GUI-based application. The rea-
son to work on a tool with interactive lines of code
is due to the flexibility that such a tool provides.
Consider a simple analysis that requires the selec-
tion of a date range on a set of CVs: a GUI must
provide a widget to allow the input of an initial and
end year, which is quite simple to implement and
use, and a programming approach would require
one or two lines of code to implement the same
functionality. But for more complex filters, e.g. to
select a non-continuous range of dates, a GUI dia-
log would be more complex for the user (probably
                                                                Figure 4: Age of Lattes CV in months.
implemented as a list of checkboxes, one for each
year) than one or two lines of code that filter a data
                                                              In order to give an idea of the simplicity of using
set by a list of years.
                                                           Jupyter notebooks and the LattesLab library, the
    A programming environment, while more com-             plot in Figure 4 was created by three lines of code,
plex, give more freedom to the user to implement           once the data frame is created and loaded (to im-
filters, apply visual effects on graphics and use          port data from the Lattes CVs and create the data
third-party tools, but the most important reason           frame, approximately two hundred lines of code
to avoid a GUI-based approach is the easiness of           were used, but these are not shown to users).
reproducibility: the chain of commands that pro-              Scientific Initiation grants are given for a period
vide an analysis from a dataset can be expressed in        of 12 months. It is possible for an undergradu-
code, which can be documented and read by users,           ate student to reapply for a grant, as long as he or
while GUI-based applications would require ges-            she is enrolled in an undergraduate program. We
tures (clicks, scrolls, inputs) that must be pre-          knew that some of the students had held more than
served somehow to allow reproduction of the anal-          one grant, but wanted to get some statistics on it.
ysis.                                                      A simple histogram was created, and it is shown



                                                     173
in Figure 5. Surprisingly, there were students that
held grants for five and six years – that was unex-
pected since the average of the duration of under-
graduate technical courses in Brazil is five years.




                                                         Figure 6: Academic Level of the Researchers.
Figure 5: Number of Scientific Research Grants
per Student.

   The plot above was created with five lines of
code in a Jupyter notebook.
   With that thematic set of Lattes CVs data it
is possible to analyze the academic achievements
and get a glimpse on the careers of the students
that held Scientific Initiation grants. How many of
those decided on an academic career after their un-
dergraduate studies? Figure 6 shows, of the Lattes
CVs on our data set, how many individuals were
part of different academic activities and achieved
which academic degrees. An individual can be
part of more than one graph bar, so individuals
                                                          Figure 7: Master Degrees obtained per year.
with Post-doctorate degrees are also found in the
PhD degrees bar.
   Many different types of visualizations can be        brary also contain statistics on publications as de-
easily achieved using LattesLab: Figure 7 shows         clared in the Lattes CVs. Publications counts are
how many of the students in our thematic data set       stored by type and year of publication. One in-
obtained his/hers Masters’ degree per year. Of          dicator of interest would be the evolution of the
course the interpretation of the results of these       number of papers of a given researcher, or the en-
graphics is heavily dependent on the environment        tire researcher group, over the years.
where the data has been collected: it could be             We could consider that there are different pro-
possible to infer from Figure 7 that the number         files for Scientific Initiation grantees, depending
of degrees awarded is increasing over the years,        on whether the grantee was able to publish his/her
or that there are in general more degrees awarded       work some time after receiving the grant, and de-
in even years, but the data itself does not answer      pending on the number of papers published per
why this is happening. It must be pointed out           year.
that we’re only showing some examples of analy-            In order to perform this type of analysis, a fea-
sis/visualizations as examples of Exploratory Data      ture vector – that counted the number of papers
Analysis that can be achieved with LattesLab.           published in conferences and journals each year
   Data frames generated with the LattesLab li-         after the student received his or hers first grant –



                                                  174
was used. This is one of the many possible ways to           years consider in this example – the one pattern
analyze individual and group publication indexes,            we know that that was in our data. To keep with
and can be easily extracted from the data frame              the Exploratory Data Analysis approach we elim-
obtained from the LattesLab library.                         inated from the data frame (again, a simple op-
   We know that some Scientific Initiation                   eration since the Lattes data was converted to a
grantees did not pursued further academic activ-             data frame) those grantees which have never pub-
ities – it is to be expected that these students did         lished, and ran the Fuzzy C-Means algorithm with
not publish their results, or, if they did, did that one     eight clusters. Resulting centroids are shown in
or two years after being awarded the grant. This is          Figure 9.
one profile we expect from our data and feature
vector – are there others? What are the most inter-
esting or unexpected profiles?
   We used the Fuzzy C-Means clustering algo-
rithm (Bezdek et al., 1984) to group the 876 pro-
files extracted from the grantees CVs into nine dif-
ferent groups (there are metrics that can be used to
indicate the best number of groups for clustering a
data set, but these metrics are sometimes conflict-
ing and inconclusive (Morais et al., 2015)).
   The centroids obtained from the profiles are
shown in Figure 8. That figure shows some in-
teresting patterns – one, already expected, shows
that the grantees did not published any paper be-
tween being awarded the grant and 12 years after             Figure 9: Model with all non-zero data classified
the award. Other patterns are also interesting: the          in eight centroids.
clustering algorithm identified six patterns where
the grantee did not publish on the first year (also             As expected, profiles (centroids) in Figures 8
expected), published one or more papers on the               and 9 are very similar, indicating that the removal
second year and fewer and fewer on subsequent                of the profile with zero publications did not change
years.                                                       the clustering results much. Further analysis could
                                                             be performed to better characterize the remain-
                                                             ing profiles, or to investigate different profiles that
                                                             may appear when we consider only publications
                                                             after four years of being awarded the grant.

                                                             5    Conclusion and Future Work
                                                             As shown in the previous section, the LattesLab
                                                             library has the capability to generate a data frame
                                                             containing most of the metadata and quantitative
                                                             data (counts for categories and years) of a local,
                                                             thematic collection of Lattes CVs. Different types
                                                             of analysis can be easily done when combining the
                                                             library with other Python libraries in a standalone
                                                             application or Jupyter notebook. Other interesting
Figure 8: Model with all data classified in nine             reports and analysis that can be done with this kind
centroids.                                                   of data are:
                                                                 • List CVs from the local thematic collection
   Some of the patterns shown in Figure 8 indi-
                                                                   that must be downloaded again (based on the
cates that the grantees kept publishing for years
                                                                   age of the CV).
after being awarded the grant.
   It is possible to see a large cluster correspond-             • Cluster CVs by different criteria using
ing to grantees that had no publications over the                  different methods more suited to Exploratory



                                                       175
     Data Analysis, such as the Kohonen                 Alexandre Donizeti Alves, Horacio Hideki Yanasse,
     Self-Organizing Networks for visualiza-              and Nei Yoshihiro Soma. 2011b. Sucupira: a system
                                                          for information extraction of the lattes platform to
     tion (Morais et al., 2015).
                                                          identify academic social networks. In Information
                                                          Systems and Technologies (CISTI), 2011 6th Iberian
  • Generate a histogram of all the publications,         Conference on. IEEE, pages 1–6.
    per category and year, of the researchers on a
    specific group.                                     Wonder AL Alves, Saulo D Santos, and Pedro HT
                                                         Schimit. 2016. Hierarchical clustering based on re-
  • Based on the previous task, create simula-           ports generated by scriptlattes. In IFIP International
                                                         Conference on Advances in Production Management
    tions that include or exclude certain members
                                                         Systems. Springer, pages 28–35.
    of the groups, to evaluate what if scenarios of
    researchers leaving groups or departments.          Eduardo B Araújo, André A Moreira, Vasco Furtado,
                                                          Tarcisio HC Pequeno, and José S Andrade Jr. 2014.
  • Again based on the previous tasks, compare            Collaboration networks from a large cv database:
    two subsets of Lattes CVs by yearly produc-           dynamics, topology and bonus impact. PloS one
                                                          9(3):e90537.
    tion, averaged by the number of researchers
    in each group, for evaluation of publications       James C Bezdek, Robert Ehrlich, and William Full.
    between departments or universities.                  1984. Fcm: The fuzzy c-means clustering algo-
                                                          rithm. Computers & Geosciences 10(2-3):191–203.
   It must be pointed out that these are actual re-
quests from some coordinators of graduate pro-          L Digiampietri, J Mena-Chalco, J de Jésus Pérez-
                                                          Alcázar, Esteban F Tuesta, K Delgado, and Rogério
grams and head of departments that are acting as          Mugnaini. 2012. Minerando e caracterizando dados
beta testers/evaluators of the tool.                      de currıculos lattes. In Brazilian Workshop on So-
   Lattes CVs can be considered social networks           cial Network Analysis and Mining (BraSNAM).
in the sense that co-publications, co-orientations
                                                        Gustavo de O Fernandes, Jonice de O Sampaio, and
and participation in events as organizers or in           JM Souza. 2011. Xmlattes a tool for importing and
committees can be extracted from the data, since          exporting curricula data. In International Confer-
coauthors are listed and sometimes identified by          ence on Information and Knowledge Engineering.
a unique ID used in the Lattes database. Future
                                                        Ross Ihaka and Robert Gentleman. 1996. R: a lan-
versions of the library will have the capability          guage for data analysis and graphics. Journal of
of extracting a co-occurrence matrix of IDs that          computational and graphical statistics 5(3):299–
identify categories and times of collaborations be-       314.
tween members of a group. This could be used
                                                        Anthony Kay. 2007. Tesseract: an open-source op-
to explore the social network aspect of the Lattes        tical character recognition engine. Linux Journal
CVs (to find cliques, temporal changes between            2007(159):2.
groups, etc.)
   Another improvement being considered is the          Thomas Kluyver, Benjamin Ragan-Kelley, Fer-
                                                          nando Pérez, Brian Granger, Matthias Bussonnier,
creation of another type of data frame that repre-        Jonathan Frederic, Kyle Kelley, Jessica Hamrick,
sents the textual information associated with each        Jason Grout, Sylvain Corlay, et al. 2016. Jupyter
researcher publication – papers titles, names of          notebooks—a publishing format for reproducible
conferences, etc. This could be used to identify          computational workflows. Positioning and Power in
                                                          Academic Publishing: Players, Agents and Agendas
areas of interest and keywords through text min-
                                                          page 87.
ing techinques, making it possible to explore other
ways to consider similarity between researchers.        Jesús Pascual Mena-Chalco, Luciano Antonio
                                                           Digiampietri, Fabrı́cio Martins Lopes, and
                                                           Roberto Marcondes Cesar. 2014.            Brazilian
References                                                 bibliometric coauthorship networks. Journal of the
                                                           Association for Information Science and Technology
Alexandre D Alves, Horacio H Yanasse, and Nei Y            65(7):1424–1445.
  Soma. 2011a. Lattesminer: a multilingual dsl for
  information extraction from lattes platform. In       Jesús Pascual Mena-Chalco, Cesar Junior, and Roberto
  Proceedings of the compilation of the co-located         Marcondes. 2009. Scriptlattes: an open-source
  workshops on DSM’11, TMC’11, AGERE! 2011,                knowledge extraction system from the lattes plat-
  AOOPES’11, NEAT’11, & VMIL’11. ACM, pages                form. Journal of the Brazilian Computer Society
  85–92.                                                   15(4):31–39.




                                                  176
Alessandra Marli M Morais, Rafael DC Santos, and
  M Jordan Raddick. 2015. Visualization of citizen
  science volunteers’ behaviors with data from us-
  age logs. Computing in Science & Engineering
  17(4):42–50.
R. D. Peng. 2011. Reproducible research in com-
  putational science. Science 334(6060):1226–1227.
  https://doi.org/10.1126/science.1213847.
Evelyn Perez-Cervantes, Jesús P Mena-Chalco, and
  Roberto M Cesar. 2012. Towards a quantitative
  academic internationalization assessment of brazil-
  ian research groups. In E-Science (e-Science), 2012
  IEEE 8th International Conference on. IEEE, pages
  1–8.
Rachel Schutt and Cathy O’Neil. 2013. Doing data
  science: Straight talk from the frontline. O’Reilly
  Media, Inc.
Guido VanRossum and Fred L Drake. 2010. The
  python language reference. Python Software Foun-
  dation Amsterdam, Netherlands.




                                                    177