=Paper=
{{Paper
|id=Vol-2029/paper16
|storemode=property
|title=Data Science Approach to Analysis of Lattes CV Data
|pdfUrl=https://ceur-ws.org/Vol-2029/paper16.pdf
|volume=Vol-2029
|authors=Thiago Luís Viana de Santana,Rafael Santos
|dblpUrl=https://dblp.org/rec/conf/simbig/SantanaS17
}}
==Data Science Approach to Analysis of Lattes CV Data==
Data Science Approach to Analysis of Lattes CV Data
Thiago Luı́s Viana de Santana Rafael Santos
Applied Computing Graduate Program Applied Computing Graduate Program
INPE – Nat. Inst. for Space Research INPE – Nat. Inst. for Space Research
thiago.santana@inpe.br rafael.santos@inpe.br
Abstract a Brazilian physicist) that provides an unified in-
terface to a database that is used to collect, store
The Lattes Platform is an online database
and process information about academic achieve-
of academical records. It is used by
ments. Any researcher can create and maintain his
the research and educational community
or hers own Lattes CV, using the taxonomy, for-
of Brazil (and some other countries), be-
mats and fields defined by the platform. The unifi-
ing of great value for identification of
cation of some fields and categories makes it easy
researchers and their relationships with
to fill the forms that feed the database.
other researchers and, for that, it can be
considered a specialized kind of social net- The Lattes Platform is also used by CNPq to
work. generate reports about the current status of the aca-
demic production of the researchers and students,
In spite of its usefulness, the main inter- and to evaluate applications to several different
face of access to its data does not allow any types of grants. The data on the platform is also
type of analysis, just basic reports. In this used by other government funding agencies and
paper, we present a new tool and approach by the Ministry of Education, for evaluation of the
to analysis of groups of records from the production of professors and students in graduate
Lattes Platform which is simpler and more programs.
flexible than other similar tools proposed
The Lattes Platform public interface is a web-
in the past.
based system that allows the edition of the CVs
1 Introduction by its owner and the search and retrieval of the
CVs by anyone that knows either the researchers’
Evaluation of researchers and students productiv-
names or IDs (a 16-digit unique identifier). In
ity, either individually or in groups, is an important
the end of 2016 there were more than 3.500.000
task for universities, research centers, and funding
CVs stored in the database2 . Of those, almost
agencies, and also for the students and researchers
1.500.000 were CVs of students.
themselves: students may be interested in know-
Data in the Lattes Platform can also be used for
ing more about the achievements, research areas
other academic purposes: analysis of academic in-
and experiences of prospective advisors. Teachers
dicators’ evolution (Perez-Cervantes et al., 2012),
and admission deans may also want to know about
identification of communities based on similar
the academic history of candidates to a graduate
interests or collaborations (Mena-Chalco et al.,
program, for example. The usual method to eval-
2014; Araújo et al., 2014; Alves et al., 2011b),
uate academics and students is by analysis of the
changes of research areas based on the publica-
achievements and publications listed on his or hers
tions records, etc. Analysis considering groups
curriculum vitae (CV).
of researchers or students can be done in differ-
The Brazilian National Council of Techno-
ent scales, from the whole academic community
logical and Scientific Development (Conselho
to small groups, such as researchers in a group or
Nacional de Desenvolvimento Cientı́fico e Tec-
students in a college. But although the Lattes CV
nológico, CNPq) maintains an online system, the
data is considered public (being provided by the
Lattes Platform1 (named in honor of César Lattes,
1 2
http://lattes.cnpq.br http://estatico.cnpq.br/painelLattes/
168
researchers and students themselves), retrieval of of CVs. The tool could be deployed as a local ap-
the data is limited: it is possible to download a full plication on computers running Linux, and its au-
individual CV as a XML file with all the data en- thors made the tool open so other groups could use
tered on that CV, but it is not possible to retrieve it. Other research groups used scriptLattes as basis
subsets of the data for more than one CV at a time. to create different types of analyses (Alves et al.,
This makes it hard to perform some specific types 2016; Perez-Cervantes et al., 2012).
of analysis that requires the extraction of certain Initially scriptLattes downloaded the data from
categories from several CVs at once. CNPq’s servers as HTML files, but with later
In this paper we present our work on tools and adoption of a CAPTCHA (”Completely Auto-
techniques that allow the extraction and analysis mated Public Turing test to tell Computers and
of data from collections of Lattes CVs. One char- Humans Apart”) access control system, automatic
acteristic of these techniques is that it considers download was made very difficult, so it was mod-
that extraction of data from a set of Lattes CVs ified to use a local set of files that must be down-
as a part of a data science process (Schutt and loaded in advance.
O’Neil, 2013): raw data (the Lattes CV XMLs) is scriptLattes is probably the most referenced tool
collected, processed and cleaned; allowing an an- in the bibliography we surveyed. It is quite com-
alyst to use exploratory data analysis techniques plete, but the data is parsed from the HTML ver-
and apply statistical or other models on it. The an- sion of the Lattes CVs, which has changed in the
alyst has access to the data and the tools to process past and may change in the future. Reports and
it in a simple but flexible environment, therefore graphics are also preprogrammed, so extensions
he or she isn’t limited to a packed set of tools. Re- and different layouts must be programmed sepa-
sults of the analysis are presented in charts, plots, rately.
reports and data products, derived from the set of LattesExtractor3 is a tool developed by CNPq
Lattes CVs considered on the analysis, and which that allows the download of several Lattes XML
may be used as input to other analysis tasks. CVs in batches. Although it seems to solve part of
This paper is divided as follows: Section 2 the problem at hand, namely, how to obtain sub-
presents related work, mainly on other tools used sets of the data, it is not as open or as flexible: only
to extract information from Lattes CVs. Section 3 registered organizations can retrieve data with this
sets requirements for a Lattes CV exploration tool tool, and organizations can only access data that is
and proposes a data science based approach to related to the organization itself. For example, an
build this tool. In section 4, the Lattes CV explo- university may be able to download all the XML
ration tool is used and a few data Exploratory Data files with the CVs of its staff, teachers and stu-
Analysis visualization examples are exhibited. Fi- dents, but will not be able to download CVs of
nally, section 5 enumerates problems outside the collaborators that don’t currently work or study at
Exploratory Data Analysis framework that can be the university. Similarly, when a student graduates
solved by extending the LattesLab tool. and leaves the university, its CV will no longer be
available after graduation (since he or she is not
2 Related Work part of the university). This tool also does not
perform any kind of analysis, providing only the
Access to subsets of data on Lattes CVs is desir- XML files.
able for several different types of analysis, and of-
SUCUPIRA (Alves et al., 2011b) was de-
ten these analyses must be done considering col-
veloped as a tool that allowed both the semi-
lections and not individual CVs. Considering this
automatic extraction of the XML files from the
need, several different tools were created in the
Lattes Platform and the creation of reports and
past to process and analyze data from the Lattes
graphics that could answer questions about col-
Platform.
laboration between researchers, their geographical
One of the first tools that allowed the extraction location, their scientific production and its evolu-
of information from a collection of Lattes CVs is tion, etc. SUCUPIRA is a web-based application
scriptLattes (Mena-Chalco et al., 2009). This tool that uses a list of names of researchers or students,
allowed the extraction of data from the Lattes CVs managed by the system’s user, to download the
of groups of researchers, creating reports, maps,
graphs and other information from the collections 3
http://lattesextrator.cnpq.br/lattesextrator/
169
Lattes CVs (as HTML files), parse those and cre- 3 A Data Science-based Approach to
ate reports based on the data extracted from the Analysis of Lattes CV Data
CVs on that list.
Up to now, a number of Lattes CV exploring tools
It seems that the development of that tool was have been listed. Some of them cannot be used as
discontinued, but changes on the Lattes Platform designed for the following reasons:
may have made it unusable: the structure of the
HTML files changed, therefore parsers that could • The Lattes CV platform has updated its data
parse a version of the HTML generated by the publishing technology from HTML to XML.
platform had their usefulness restricted due to the Therefore, all the tools that relied on that pre-
new layout. Additionally, SUCUPIRA was writ- vious file distribution (often requiring com-
ten when access to the Lattes CVs was unhindered plex procedures to parse HTML) no longer
by CAPTCHAS – for some time the platform used work properly. It seems a trivial technical
simple CAPTCHAS that could be solved with issue, but the tools that extracted informa-
tools such as Tesseract (Kay, 2007), but recently tion from the Lattes CVs formatted as HTML
the CAPTCHAs were made more difficult to solve documents had to deal with complex HTML
automatically. structures and had to detect, from the HTML
content itself, the categories of information
SUCUPIRA used another tool developed by being extracted (e.g. articles in journals, con-
the same researchers: LattesMiner (Alves et al., ferences, titles, authors, etc.) while the data
2011a). LattesMiner is a Domain-Specific Lan- represented as XML is properly formatted
guage (DSL) implemented as a set of Java classes and tagged with this information. Therefore,
that allows the manipulation of data on a set of even though the XML platform restricted the
Lattes CVs, defined by a programmer, with mod- use of the HTML-based tools, it provided
ules for data discovery (association of names and structural elements for development of new,
IDs), data extraction (parsing of the HTML files more robust tools.
corresponding to the CVs with regular expres-
sions), storage of data in a local database, vi- • Some of these tools were designed to au-
sualization and analysis tools. As with SUCU- tomatically download a list of CVs from
PIRA, LattesMiner’s development has stopped, CNPq’s site. This is not possible today due
since changes on the Lattes platform rendered to the implementation of the CAPTCHA test,
some aspects of the tool unusable. both to view and download the Lattes CVs.
Another tool that is concerned with processing • Some of the solutions are only available on
data from the Lattes Platform is XMLattes (Fer- specific platforms - such as Linux - and re-
nandes et al., 2011), which converts the Lattes quire a specific setup before use.
CVs in HTML to XML for further processing, and
which was rendered unnecessary since the present Considering the present status of the existing
version of the Lattes platform already exports the tools for Lattes CV analysis we consider that a
XML version of the curriculum vitae (although re- new, functioning tool to extract, interpret, analyze
quiring CAPTCHAs for download of individual and visualize the data in a simple but flexible way
CVs). ought to comply with the following requirements:
As part of this research we reviewed several pa- 1. Work with an offline set of Lattes CVs.
pers related to analysis of Lattes CVs data (Di- Due to the current impossibility of automated
giampietri et al., 2012; Mena-Chalco et al., 2014; batch download of Lattes CVs, the CVs must
Araújo et al., 2014; Perez-Cervantes et al., 2012). either be obtained manually or automatically
Most of these papers used a database with de- through other authorized tools – such as Lat-
tailed information on academics, which was ex- tesExtrator.
tracted from the Lattes Platform when it was pos-
sible to do so without the limitations imposed by 2. Be able to transparently transform the list of
the CAPTCHA currently in use. There were no Lattes CVs’ XML files into table-like data
references on whether that database was kept up structures for further processing. To make
to date. this transformation, it is necessary to know
170
the structure of the Lattes CV XML, identify
the parameters of interest to the researcher
and migrate these parameters to the data
structures. This transformation is made eas-
ier since CNPq publishes the XML Schema
Dictionary (XSD) of the Lattes CV files.
3. Be agnostic with respect to the operating sys-
tem used to run the tool. Each potential user
has a limited number of resources and to re-
Figure 1: Data Science Process (Schutt and
quire the user to learn a new programming
O’Neil, 2013).
language, install a new software or even a
new whole operating system shall be avoided
if the tool is to be widely used. the steps shown in the Data Science process (Fig-
ure 1), they serve as a guideline to ensure that the
4. Require no specialized knowledge for opera- results attained by the tool are easy to acquire –
tion. In the same way that requiring an spe- and customize – and also reproducible.
cific environment limits the utilization of the The fourth requirement listed on this section is
tool, if specialized knowledge is not required directly related to the concept of Exploratory Data
to operate this tool, it will be easier to use, Analysis (EDA), which intent is to allow the re-
and for that, may be used by a bigger num- searcher to discover patterns on data by using vi-
ber of individuals. At the same time the tool sualization tools and statistics to understand “what
must be extensible so more advanced users is going on with this data”.
could do more with the tool.
Analysis of Lattes CVs can be done in different
5. The results produced by this tool should be ways, using different metrics and algorithms, but
reproducible by any user interested in anal- if we consider that most of the analyses will be
ysis of a set of Lattes CVs. Reproducibility done considering thematic groups of researchers
is ensured by the use of a common set of in- (e.g. researchers in a specific area of knowledge,
structions that can be easily shared and build or professors and students of a specific depart-
upon. ment) it becomes clear that methods and tech-
niques applied to a particular analysis can be used
The first three requirements on that list are into different contexts, depending on the group. A
strictly technical, and must be met by a Lattes tool for analysis of Lattes CVs collections must
analysis tool to deal with the complexity of the make easy the reproduction of its results – repro-
Lattes CVs’ data access and representation issues. ducible research is also a concept closely linked to
More important are the fourth and fifth require- Data Science.
ments, that ensure that such a tool can be extended According to (Peng, 2011), there is a spectrum
for different ways to explore the data and that the of reproducibility of research, that goes from a
results can be reproduced and shared. non-reproducible result to a fully reproducible one
By designing a tool that follows these require- (Figure 2).
ments it is possible to apply a data science process
to the problem of analysis of collections of Lattes
CVs. The data science process (Schutt and O’Neil,
2013) is shown in Figure 1.
The requirements listed above are directly re-
lated to the Data Science Process: the first require-
ment makes reference to the collection of data,
the second requirement is related to the process- Figure 2: Reproducibility spectrum of a scientific
ing and cleaning of data as well as storage of the publication (Peng, 2011).
“clean” data, so that it can be easily and trans-
parently accessed. Even though the fourth and Considering Figure 2, it is highly desirable that
fifth requirements are not directly associated to a tool performing Lattes CV Analysis allows the
171
full reproduction of the data analysis experiments the Lattes CV is its XML structure (which allows
with the corresponding code being publishable the extraction of semantic information from it) and
and applicable to different datasets of the same na- the fact that the way its structure is posed is avail-
ture. able in a public XML Schema Definition. By ac-
cessing the XML file through the XPath language
3.1 LattesLab one can extract the desired information and use it
In order to address the requirements listed on the accordingly. In the presented case, the extracted
previous section, we propose a software stack so- information is used to generate a data frame which
lution, based on concepts and principles of Data will the be used to perform a few basic analysis.
Science, to tackle the generic problem of analyz- To obtain meaningful results from the analy-
ing Lattes CVs. Our tool, named LattesLab, is sis, it is desirable that the set of Lattes CVs share
based on the following components: some characteristics. For example, to perform an
analysis of the students and researchers of one in-
• A library that is able to scan a collection of
stitution, it is necessary to have the Lattes CVs
Lattes CVs (stored as local files) and create a
of members of that institution stored and subject
set of data frames (table-like structures) from
to analysis by LattesLab. Other sets of thematic
that collection.
collections could be researchers of all institutions
• A deployment mechanism for that library sharing some CNPq classification (e.g. CNPq
that allows its use, with minimal software in- grantees), or researchers that stated that they work
stallation requirements. in a specific knowledge area.
The LattesLab main library is packed as a
• A set of live documents that shows how to Python package, which simplifies its deployment.
perform basic statistical analysis, visualiza- It can be downloaded and used in a standalone
tions and reports. way, in this case the user must be able to at least
install a Python IDE and then install and import
LattesLab is being developed in the Python lan-
the package.
guage’s (VanRossum and Drake, 2010), widely
used platform – taking into account its large de- Another deployment solution which is more
veloper and user community. It is also one of the straightforward is to deploy LattesLab as a Jupyter
most used programming languages to solve Data notebook (Kluyver et al., 2016). Jupyter note-
Science problems due do the large amount of free books are web applications that allow creation and
and open analysis and visualization libraries. sharing of documents that contain live code, equa-
Other languages were considered for implemen- tions, visualizations and explanatory text. The
tation – a previous version was developed in Java, non-static parts of these documents are created by
but we found out that the distribution of the library the execution of Python code.
to use in other derived projects was too complex. Notebooks are an interesting solution not only
R (Ihaka and Gentleman, 1996) was also consid- to make access to the information easier, but also
ered: even though the R language is strongly sup- due to the fact that it runs in any operating system
ported by the community and is heavily used by with a browser installed, and that it works not only
the scientific community, Python was chosen for with Python, but with R, Scala, Julia, and over 40
the readability of its code (over R code, at least) programming languages.
and its gradual learning curve. Jupyter notebooks allows not only the deploy-
To use LattesLab, it is necessary to have all the ment of code, but, in the same document, format-
Lattes CV files of interest stored in a folder and ted text (with different styles and allowing the use
pass the folder name as a variable to the Lattes- of hypertext, graphics, etc.) related to that code.
Lab main library. Then, the tool reads the CV files Therefore, by using a Jupyter notebook, it is pos-
as downloaded from the Lattes Platform, parsing sible to run LattesLab code, to perform analyses
the XMLs and storing the data available on data and visualizations, to explain what was done and
frames. to give instructions to the users, so that he or she
To extract data from the Lattes CV, XPath (part doesn’t have to know the tool beforehand to use
of XLST, a language for transforming XML docu- it and can change parameters or modify analysis
ments) was used. One of the major advantages of to suit specific needs. Figure 3 shows a Jupyter
172
notebook running LattesLab. 4 Examples of Analysis Reports
For the following examples, a group of 876 Lat-
tes CVs was downloaded from the Lattes Platform.
These CVs belong to participants in the Scientific
Initiation grant program at the Brazilian Institute
of Space Research (INPE) – these grants are given
to undergraduate students to participate in research
and development at the institute. The LattesLab li-
brary scanned and parsed these CVs’ files to create
a data frame used in the analysis and examples in
this section.
Lattes CVs are created and updated by the users,
and some may stop updating theirs, specially if
they are not involved in academic environments
anymore. A basic but interesting question we
could ask about our data is: how old are the CVs,
i.e. what is the last time they were updated?
We used LattesLab to create a histogram of the
Figure 3: Example of a Jupyter Notebook window age of the CVs (the “last updated” information is
running the LattesLab tool. present on the CVs).
When considering how to develop and deploy a
tool for analysis of Lattes CV data we could opt
for a standalone, GUI-based application. The rea-
son to work on a tool with interactive lines of code
is due to the flexibility that such a tool provides.
Consider a simple analysis that requires the selec-
tion of a date range on a set of CVs: a GUI must
provide a widget to allow the input of an initial and
end year, which is quite simple to implement and
use, and a programming approach would require
one or two lines of code to implement the same
functionality. But for more complex filters, e.g. to
select a non-continuous range of dates, a GUI dia-
log would be more complex for the user (probably
Figure 4: Age of Lattes CV in months.
implemented as a list of checkboxes, one for each
year) than one or two lines of code that filter a data
In order to give an idea of the simplicity of using
set by a list of years.
Jupyter notebooks and the LattesLab library, the
A programming environment, while more com- plot in Figure 4 was created by three lines of code,
plex, give more freedom to the user to implement once the data frame is created and loaded (to im-
filters, apply visual effects on graphics and use port data from the Lattes CVs and create the data
third-party tools, but the most important reason frame, approximately two hundred lines of code
to avoid a GUI-based approach is the easiness of were used, but these are not shown to users).
reproducibility: the chain of commands that pro- Scientific Initiation grants are given for a period
vide an analysis from a dataset can be expressed in of 12 months. It is possible for an undergradu-
code, which can be documented and read by users, ate student to reapply for a grant, as long as he or
while GUI-based applications would require ges- she is enrolled in an undergraduate program. We
tures (clicks, scrolls, inputs) that must be pre- knew that some of the students had held more than
served somehow to allow reproduction of the anal- one grant, but wanted to get some statistics on it.
ysis. A simple histogram was created, and it is shown
173
in Figure 5. Surprisingly, there were students that
held grants for five and six years – that was unex-
pected since the average of the duration of under-
graduate technical courses in Brazil is five years.
Figure 6: Academic Level of the Researchers.
Figure 5: Number of Scientific Research Grants
per Student.
The plot above was created with five lines of
code in a Jupyter notebook.
With that thematic set of Lattes CVs data it
is possible to analyze the academic achievements
and get a glimpse on the careers of the students
that held Scientific Initiation grants. How many of
those decided on an academic career after their un-
dergraduate studies? Figure 6 shows, of the Lattes
CVs on our data set, how many individuals were
part of different academic activities and achieved
which academic degrees. An individual can be
part of more than one graph bar, so individuals
Figure 7: Master Degrees obtained per year.
with Post-doctorate degrees are also found in the
PhD degrees bar.
Many different types of visualizations can be brary also contain statistics on publications as de-
easily achieved using LattesLab: Figure 7 shows clared in the Lattes CVs. Publications counts are
how many of the students in our thematic data set stored by type and year of publication. One in-
obtained his/hers Masters’ degree per year. Of dicator of interest would be the evolution of the
course the interpretation of the results of these number of papers of a given researcher, or the en-
graphics is heavily dependent on the environment tire researcher group, over the years.
where the data has been collected: it could be We could consider that there are different pro-
possible to infer from Figure 7 that the number files for Scientific Initiation grantees, depending
of degrees awarded is increasing over the years, on whether the grantee was able to publish his/her
or that there are in general more degrees awarded work some time after receiving the grant, and de-
in even years, but the data itself does not answer pending on the number of papers published per
why this is happening. It must be pointed out year.
that we’re only showing some examples of analy- In order to perform this type of analysis, a fea-
sis/visualizations as examples of Exploratory Data ture vector – that counted the number of papers
Analysis that can be achieved with LattesLab. published in conferences and journals each year
Data frames generated with the LattesLab li- after the student received his or hers first grant –
174
was used. This is one of the many possible ways to years consider in this example – the one pattern
analyze individual and group publication indexes, we know that that was in our data. To keep with
and can be easily extracted from the data frame the Exploratory Data Analysis approach we elim-
obtained from the LattesLab library. inated from the data frame (again, a simple op-
We know that some Scientific Initiation eration since the Lattes data was converted to a
grantees did not pursued further academic activ- data frame) those grantees which have never pub-
ities – it is to be expected that these students did lished, and ran the Fuzzy C-Means algorithm with
not publish their results, or, if they did, did that one eight clusters. Resulting centroids are shown in
or two years after being awarded the grant. This is Figure 9.
one profile we expect from our data and feature
vector – are there others? What are the most inter-
esting or unexpected profiles?
We used the Fuzzy C-Means clustering algo-
rithm (Bezdek et al., 1984) to group the 876 pro-
files extracted from the grantees CVs into nine dif-
ferent groups (there are metrics that can be used to
indicate the best number of groups for clustering a
data set, but these metrics are sometimes conflict-
ing and inconclusive (Morais et al., 2015)).
The centroids obtained from the profiles are
shown in Figure 8. That figure shows some in-
teresting patterns – one, already expected, shows
that the grantees did not published any paper be-
tween being awarded the grant and 12 years after Figure 9: Model with all non-zero data classified
the award. Other patterns are also interesting: the in eight centroids.
clustering algorithm identified six patterns where
the grantee did not publish on the first year (also As expected, profiles (centroids) in Figures 8
expected), published one or more papers on the and 9 are very similar, indicating that the removal
second year and fewer and fewer on subsequent of the profile with zero publications did not change
years. the clustering results much. Further analysis could
be performed to better characterize the remain-
ing profiles, or to investigate different profiles that
may appear when we consider only publications
after four years of being awarded the grant.
5 Conclusion and Future Work
As shown in the previous section, the LattesLab
library has the capability to generate a data frame
containing most of the metadata and quantitative
data (counts for categories and years) of a local,
thematic collection of Lattes CVs. Different types
of analysis can be easily done when combining the
library with other Python libraries in a standalone
application or Jupyter notebook. Other interesting
Figure 8: Model with all data classified in nine reports and analysis that can be done with this kind
centroids. of data are:
• List CVs from the local thematic collection
Some of the patterns shown in Figure 8 indi-
that must be downloaded again (based on the
cates that the grantees kept publishing for years
age of the CV).
after being awarded the grant.
It is possible to see a large cluster correspond- • Cluster CVs by different criteria using
ing to grantees that had no publications over the different methods more suited to Exploratory
175
Data Analysis, such as the Kohonen Alexandre Donizeti Alves, Horacio Hideki Yanasse,
Self-Organizing Networks for visualiza- and Nei Yoshihiro Soma. 2011b. Sucupira: a system
for information extraction of the lattes platform to
tion (Morais et al., 2015).
identify academic social networks. In Information
Systems and Technologies (CISTI), 2011 6th Iberian
• Generate a histogram of all the publications, Conference on. IEEE, pages 1–6.
per category and year, of the researchers on a
specific group. Wonder AL Alves, Saulo D Santos, and Pedro HT
Schimit. 2016. Hierarchical clustering based on re-
• Based on the previous task, create simula- ports generated by scriptlattes. In IFIP International
Conference on Advances in Production Management
tions that include or exclude certain members
Systems. Springer, pages 28–35.
of the groups, to evaluate what if scenarios of
researchers leaving groups or departments. Eduardo B Araújo, André A Moreira, Vasco Furtado,
Tarcisio HC Pequeno, and José S Andrade Jr. 2014.
• Again based on the previous tasks, compare Collaboration networks from a large cv database:
two subsets of Lattes CVs by yearly produc- dynamics, topology and bonus impact. PloS one
9(3):e90537.
tion, averaged by the number of researchers
in each group, for evaluation of publications James C Bezdek, Robert Ehrlich, and William Full.
between departments or universities. 1984. Fcm: The fuzzy c-means clustering algo-
rithm. Computers & Geosciences 10(2-3):191–203.
It must be pointed out that these are actual re-
quests from some coordinators of graduate pro- L Digiampietri, J Mena-Chalco, J de Jésus Pérez-
Alcázar, Esteban F Tuesta, K Delgado, and Rogério
grams and head of departments that are acting as Mugnaini. 2012. Minerando e caracterizando dados
beta testers/evaluators of the tool. de currıculos lattes. In Brazilian Workshop on So-
Lattes CVs can be considered social networks cial Network Analysis and Mining (BraSNAM).
in the sense that co-publications, co-orientations
Gustavo de O Fernandes, Jonice de O Sampaio, and
and participation in events as organizers or in JM Souza. 2011. Xmlattes a tool for importing and
committees can be extracted from the data, since exporting curricula data. In International Confer-
coauthors are listed and sometimes identified by ence on Information and Knowledge Engineering.
a unique ID used in the Lattes database. Future
Ross Ihaka and Robert Gentleman. 1996. R: a lan-
versions of the library will have the capability guage for data analysis and graphics. Journal of
of extracting a co-occurrence matrix of IDs that computational and graphical statistics 5(3):299–
identify categories and times of collaborations be- 314.
tween members of a group. This could be used
Anthony Kay. 2007. Tesseract: an open-source op-
to explore the social network aspect of the Lattes tical character recognition engine. Linux Journal
CVs (to find cliques, temporal changes between 2007(159):2.
groups, etc.)
Another improvement being considered is the Thomas Kluyver, Benjamin Ragan-Kelley, Fer-
nando Pérez, Brian Granger, Matthias Bussonnier,
creation of another type of data frame that repre- Jonathan Frederic, Kyle Kelley, Jessica Hamrick,
sents the textual information associated with each Jason Grout, Sylvain Corlay, et al. 2016. Jupyter
researcher publication – papers titles, names of notebooks—a publishing format for reproducible
conferences, etc. This could be used to identify computational workflows. Positioning and Power in
Academic Publishing: Players, Agents and Agendas
areas of interest and keywords through text min-
page 87.
ing techinques, making it possible to explore other
ways to consider similarity between researchers. Jesús Pascual Mena-Chalco, Luciano Antonio
Digiampietri, Fabrı́cio Martins Lopes, and
Roberto Marcondes Cesar. 2014. Brazilian
References bibliometric coauthorship networks. Journal of the
Association for Information Science and Technology
Alexandre D Alves, Horacio H Yanasse, and Nei Y 65(7):1424–1445.
Soma. 2011a. Lattesminer: a multilingual dsl for
information extraction from lattes platform. In Jesús Pascual Mena-Chalco, Cesar Junior, and Roberto
Proceedings of the compilation of the co-located Marcondes. 2009. Scriptlattes: an open-source
workshops on DSM’11, TMC’11, AGERE! 2011, knowledge extraction system from the lattes plat-
AOOPES’11, NEAT’11, & VMIL’11. ACM, pages form. Journal of the Brazilian Computer Society
85–92. 15(4):31–39.
176
Alessandra Marli M Morais, Rafael DC Santos, and
M Jordan Raddick. 2015. Visualization of citizen
science volunteers’ behaviors with data from us-
age logs. Computing in Science & Engineering
17(4):42–50.
R. D. Peng. 2011. Reproducible research in com-
putational science. Science 334(6060):1226–1227.
https://doi.org/10.1126/science.1213847.
Evelyn Perez-Cervantes, Jesús P Mena-Chalco, and
Roberto M Cesar. 2012. Towards a quantitative
academic internationalization assessment of brazil-
ian research groups. In E-Science (e-Science), 2012
IEEE 8th International Conference on. IEEE, pages
1–8.
Rachel Schutt and Cathy O’Neil. 2013. Doing data
science: Straight talk from the frontline. O’Reilly
Media, Inc.
Guido VanRossum and Fred L Drake. 2010. The
python language reference. Python Software Foun-
dation Amsterdam, Netherlands.
177