Specializations for the Peruvian Professional in Statistics: A Text
                             Mining Approach


Luis Cajachahua Espinoza            Andrea Ruiz Guerrero                       Tomás Nieto Agudo
        UNI, Perú                      UC, Colombia                              UCLM, España
 lcajachahua@gmail.com                randreag@gmail.com               Tomas.nieto.agudo@gmail.com


                                                          On the other hand, there are many careers having
                    Abstract                              accelerated growth in recent years. One of those
                                                          careers is Statistics. According to reports in
    The objective of this study was to                    several countries around the world, the annual
    identify the specialization profiles which            demand for professionals in Statistics has been
    are most required by companies and                    increasing until having the highest employment
    organizations in Lima, through the                    rate. One example is Spain, where Statistics is
    analysis of job postings published in the             the second career with the lowest unemployment
    Internet. Text Mining techniques were                 rate in the country [6].
    used to extract relevant information and
    to identify some generic skills for the               Statisticians are also required in Brazil [8],
    Peruvian statisticians.                               United States [1] and many other countries.
                                                          According to another report, made by LinkedIn,
    For purposes of this study, we analyzed               statistical skills and data analysis are at the top of
    2,809 job postings published in the Blog              the 25 skills most sought by companies in the
    “Estadísticos de Perú” [2], between 2009              majority of countries considered in the study [9].
    and 2014. We have identified many
    requirements, knowledge and specific                  Considering these facts, there are some very
    skills that companies and organizations               interesting questions: What kind of statistics
    were looking for. After that, job postings            professionals are seeking companies and
    were segmented using Singular Value                   organizations? Have these requirements changed
    Decomposition (SVD) of the Terms and                  in recent years? Is there a unique statistician
    Documents Matrix. In addition, five                   profile, or are several types? Where can we find
    segments were discovered, corresponding               useful information to clarify these doubts? We
    to specific competency profiles of                    tried to answer these questions through analysis
    statisticians, where each one has different           of job postings.
    types of knowledge and specific skills.
                                                          2    Background
    Keywords: Job postings, Statistician,
    Professional, Competencies, Abilities.                To understand the demand for professionals and
    SVD, Clustering, Text Mining.                         the skills required, we need to find some useful
                                                          information sources. Previous research related to
                                                          the issue, were made through in-depth studies,
                                                          talking with some subject experts [14].
1    Introduction
                                                          On the other hand, a group of Italian students
The employment trends are changing a lot in               developed a segmentation technique based on
recent years. A report published by the social            centroids [4] on the database of jobs for college
network LinkedIn in 2014, after analyzing 259             SOUL (University Orientation and Job System, a
million professional profiles, have identified ten        network that contains jobs posted by 8 different
professions that did not exist five years ago, but        universities in Italy) where they took more than
they are very popular today [11, 10]. This                1,650 job postings. All kinds of them were
produces great uncertainty about the future of            analyzed, resulting segments from all university
young people job opportunities.                           careers.


                                                     35
Another related work is the iSchool of Illinois,            3     Methodology
where they performed a segmentation analysis of
Indeed job postings, in order to find the profiles          According to the literature reviewed, there are
that are most demanded for their students in                several methods of text analysis, but these
these subjects [15]. In this case, 15,000 job               methods work well in other languages, so we
postings were analyzed, all of them related to              needed to adapt some tools to Spanish. On the
professionals in the data analysis field. But,              other hand, our aim, unlike previous studies, is to
segmentation was performed inside the contents              segment the job postings, in order to know the
of each job posting, so the resulting segments are          different types of specialties for a statistician.
referred to generic skills for all professionals.           3.1    Study scope
The two last studies aimed not only to identify             The population considered was formed by 2,809
the most requested profiles, but also see the               job postings published in the blog "Estadísticos
status of the current job market and its evolution          de Perú" [2]. All the postings were analyzed, so
over time, finding important patterns and can be            it was unnecessary to use sampling techniques.
implemented as actions either within the                    The number of postings published per year is
company or college.                                         shown in the next graph.

2.1     Objectives
      The main objectives of this study are:
-     Identify the more important requirements,
      competencies and demands that companies
      include in their job postings.
-     Detect the existence of professional profiles
      through all the job postings available through
      text mining techniques.
-     Compare the evolution of the requirements
      and skills by dividing the dataset in two
      periods (2009-2011 and 2012-2014).                                Fig. 1. Job postings per year

Once all previous goals achieved, we can make
some recommendations to the agents involved in              3.2    Text Mining
the job market: companies, educational                      As a part of Data Mining, Text Mining is the
institutions and potential employees, statisticians.        intensive process of information extraction,
                                                            where a user interacts with a collection of
2.2     Limitations
                                                            documents using specialized analysis tools. As a
By the nature of the study, it should be noted              process, it deals with the discovery of knowledge
limitations implied in its realization:                     in the content of several texts and after passing
- The main information source is the Blog                   through several stages.
    where the job postings are published. If there          Text Mining seeks to extract useful information
    were errors or omissions in the posts, they             from multiple data sources through the
    will influence the accuracy of the results.             identification and exploration of interesting
- There are job opportunities that are not being            patterns. One remarkable difference with
    published, causing a bias in the analysis               numeric data analysis is that the documents
    results. Moreover, many leadership and                  analyzed do not have a defined structure. That is
    senior positions are sent to headhunting                why in text mining the pre-processing tasks are
    companies. Consequently, they could not be              very important. These operations are focused on
    included in this analysis.                              the features identification and extraction of
- The postings are mostly from companies and                natural language and are responsible for
    organizations located in the city of Lima.              transforming unstructured data in a structured
    Peru is still a very centralized country, nearly        intermediate format.
    a third of Peruvian population lives in Lima,           Text mining is used for:
    so the results could not be extrapolated to the         - Classify and organize documents based on
    whole country.                                              their content: With the information
                                                                overload in companies, it is necessary a


                                                       36
      method to facilitate the classification of                  are added over 40,000 items biomedical each
      documents that enter daily to the system.                   month [17]. In a collection of this size, try to
      Text mining has several algorithms to do this               correlate the data between documents,
      automatically using index classification.                   mapping relationships or identify trends,
-     Organize depots for search and retrieval:                   could be extremely complex and demanding,
      This problem spots the need of an efficient                 in terms of time and machine. But there are
      system search, through the submission of a                  some techniques that perform these tasks
      request for recovering specific information.                automatically that improve the speed and
      This query sends keywords to help identify                  efficiency in the analysis.
      the documents that best fit, sorts by                 -     Document: For practical purposes, a
      relevance and the best matches are displayed.               document is a unit of text data (e.g. news, a
      There are techniques that help to measure the               report of business, emails, research articles,
      similarity between documents in order to                    manuscript, stories, tweets, books, among
      calculate the similarities and return                       others).
      information.                                          -     Corpus: A collection of documents, usually
-     Automated addition and comparison of                        stored electronically and on which the
      information: Many times, when researchers                   analysis is performed. Its elements are
      have many documents on the same subject, it                 known as documents which store the current
      is necessary to group the information                       text and the local metadata.
      automatically to facilitate analysis. Text            -     Terms and documents matrix: It is the
      clustering is a useful technique to build the               most common way to represent text for
      groups in these cases.                                      future comparisons. This matrix is composed
-     Extract relevant information from a                         of document ID’s as rows and terms as
      document: Text mining has methods that                      columns. Its elements are the frequencies of
      deals with unstructured texts, analyzes them                each term within that document.
      and identifies groups of concepts. That is, it        -     Vector space model: It is a matrix whose
      transforms plain texts into valuable and                    coefficients are functions of term frequency.
      relevant knowledge.
-     Prediction and evaluation: One of the                 3.4      Text Mining Tools
      concerns expressed sophisticated text mining          On this study, we used R libraries and SAS Text
      is to create predictive models and evaluation         Miner in order to obtain the results, because each
      from textual information that you count.              one offers some advantages and useful tasks that
      These models are based on a model already             the other one doesn’t have. Another reason to
      raised issues of modeling and assembly, to            choose these platforms is that the other ones do
      predict for new documents entering the                not have text Stemming and Lemmatization tools
      collection items or more suitable groups              in Spanish. We can see a comparison of these
      according to their contents. This type of             tools in the next diagram:
      problem is one of the most common text
      mining.
3.3     Text Mining Elements
Text Mining, as many other disciplines, have
some recognizable elements that characterize it.
- Repository of documents: Any set of
   documents containing text, regardless of
   size, can be 10 or 100 billion texts. One of
   the main sources of documents, with more
   than 12 million items open to the public,                    Fig. 2. Comparison of R and SAS Text Miner Tasks
   with a wide variety of subjects and in
   different languages is PubMed. These                     Following this comparison, we decided to use
   characteristics have become one of the                   both packages. R to clean the data and generate
   databases most used by computer                          Word clouds for the segments and SAS Text
   professionals in data analysts or interested in          Miner to the SVD decomposition and
   the implementation of text mining tasks on a             Segmentation.
   large scale. This collection is dynamic and


                                                       37
The scheme of the Text Mining process is shown                 and generic skills, data analysis and information
in the following image:                                        management (“datos”, “análisis”, “información”
                                                               y “manejo”). Then, some other words make
                                                               references to specific skills, such as SPSS or
                                                               Excel. So, it is necessary to use clustering
                                                               techniques, since there are several groups of
                                                               words representing different capabilities related
                                                               to statistical profiles.


    Fig. 3. Tasks and tools used in the analysis

In the terms filtering step, some stopwords were
used, in order to avoid some obvious findings,
like statistics, statistician, job, salary, enterprise,
etc. (“estadística”, “estadístico”, “empleo”,
“salario”, “empresa”, etc.) Then, we performed
the SVD decomposition and finally, the text
clustering step. After this process, we obtained
some interesting findings, which are explained in
the next section.
                                                                Fig. 5. Distribution of job postings for level (Total:
4     Results                                                                    2,809 job postings)

After textual analysis, we can answer the                      It’s clear that analysts’ position dominates,
research questions. For example: What are the                  because as we said, the job postings correspond
requirements and skills that students and                      to basic or intermediate positions.
professionals in Statistics are requested on
employment notices published?
For the first answer, we could see the Word
cloud of the complete database in order to
discover the main requirements founded.


    Fig. 4. WordCloud considering the entire Corpus
               (Total: 2,809 job postings)

As observed, the most prevalent and relevant
terms in the job appear larger. That is, in a high
percentage of postings, these words appeared
which leads us to believe that one of the first                Fig. 6. Job postings distribution by Requirements and
things required of a statistic is the experience                                       period
(“Experiencia”). We can see other some basic


                                                          38
It is remarkable that 81% of job postings mention
the word “Experience” in them. It means that this
is one of the most important requirements (along
with knowledge or intermediate and advanced
levels). Furthermore, they have experienced
increasing importance in recent years.


                                                                 Fig. 8. Job postings distribution by required
                                                                            background and period

                                                             About the background required, it weighs
                                                             heavily reporting tasks or report writing (24%).
                                                             One in four job postings, contains the term
                                                             "database" which makes clear that the SQL
                                                             language has become very important in Lima.
                                                             Not just someone who can get statistics or
                                                             models is needed, organizations valued
                                                             professionals whose can extract themselves from
                                                             the data sources. Other tasks are in high demand
                                                             as Process Control or Indicators Development.


Fig. 7. Job postings distribution by Competencies and
                         period

As for the Competencies, we highlight the
character or analytical profile along with other
basic skills in business such as responsibility and               Fig. 9. Most required Statistical Software
communication skills. The increase of good
communication, responsibility and strategic                  The importance of SPSS in the area of Lima is
thinking is valuable. Clearly, the organizations             also clear growth in recent years (almost
seek Statisticians that are not only good at                 doubling its appearance in the ads). Others such
technical level, but also have the ability to think          as R or SAS are still not much required; maybe
about the best solution for the organization as a            because the cost of acquisition or the time
whole.                                                       required learning the software (SPSS is easier).


                                                                Fig. 10. Most required Database Management
                                                                                  Software


                                                        39
Regarding the database software, SQL Server                Through descriptive terms offered by the five
predominates over Access or Oracle.                        clusters finally formed and considering the
                                                           results of characterization through WordClouds.
                                                           The following professional profiles were
                                                           obtained:

                                                           Risk managers (Cluster 1): Professionals with
                                                           experience in portfolio and risk management
                                                           (both credits and investments), preferably
                                                           analysts and engineers. They are sued for the
                                                           financial and banking sector. They were also
                                                           requested domain mainly SQL and SPSS.
       Fig. 11. Most required Office Software
Notice the importance of Excel (appears in four
out of ten job postings) and its great increase in
recent years.

Finally, it is important to determine the existence
of specialization profiles, segments that meet
specific characteristics and are different from
others. For this, we use SAS Enterprise Miner to
compare the results from four, five and nine
segments, we decided to choose five segments
because it showed better indicators of distance
between clusters and better possibilities of                        Fig. 14. Word cloud of the Cluster 1
interpretation. The distribution of each segment
is shown in the next figure:                               Analysts with reporting tasks (Cluster 2):
                                                           Analysts with good statistical knowledge
                                                           required for tasks of reporting and report writing.
                                                           Mainly related to the areas of marketing and
                                                           sales. The most required software is the Office
                                                           suite, more specifically Excel.


Fig. 12. Term-based Segmentation in SAS Enterprise
                      Miner

After segmenting the messages in these groups,
we decided to perform a characterization, that is,
find the most common expressions in each
cluster, in order to get a better idea of the
composition of each segment:

                                                                   Fig. 15. Word cloud of the Cluster 2

                                                           Business Intelligence Professionals (Cluster
                                                           3): Profiles that manage and analyze databases
                                                           generally related to marketing and related areas
                                                           (customers, sales, campaigns). They were also
                                                           asked experience in campaign management and
                                                           business intelligence. In software they are
        Fig. 13. Segments Characterization                 required Excel and SQL.


                                                      40
                                                          These are the profiles we wanted to find, as we
                                                          have seen, each implies that the professional
                                                          should have sought some proper statistics to job
                                                          in question features.

                                                          5    Conclusions
                                                          According to the results, we can conclude that
                                                          Statisticians have relative success in Lima. In
                                                          addition, we have obtained the following
                                                          conclusions:

        Fig. 16. Word cloud of the Cluster 3              1.   The main goal (to identify key
                                                               competencies and requirements) has been
Students or graduates in trainee programs                      successfully achieved. It was possible to
(Cluster 4): Young graduates who are at the end                detect the main (technical and personal)
of its cycle of studies (generally engineering)                requirements that often companies require
with knowledge of analysis tools and required to               in their job requirements. And due to the
be proactive. They are required to dominate                    temporary separation into two periods, we
Excel and SPSS.                                                also found interesting differences about the
                                                               change in the demand of these requirements.

                                                          2.   The second one (identification of
                                                               professional profiles), has also been
                                                               achieved. We have identified five types of
                                                               professionals, each group are different from
                                                               the rest and we have characterized them
                                                               accurately and in a very clear way.

                                                          3.   The results obtained in this analysis, may be
                                                               useful for three agents who are involved in
                                                               the labor market: companies, potential
                                                               workers (statisticians) and educational
        Fig. 17. Word cloud of the Cluster 4
                                                               institutions:
Market researchers (Cluster 5): Professionals
                                                          -    Business: Companies can improve their job
in the field of market research (both quantitative
and qualitative analysis). They were also                      postings, making easy the contact with the
required experience in processing and analysis of              wanted profiles. In the other hand, they could
                                                               obtain certain advantages in areas such as
surveys and marketing knowledge (for research
applications). They are required Excel and SPSS                employee training, based on the specific
                                                               profiles founded.
too.
                                                          -    Statisticians: This analysis would be helpful
                                                               for them, in order to improve the CV writing,
                                                               increasing their chances to obtain a good
                                                               employment opportunity. They can also
                                                               focus their training in the same direction as
                                                               do the requirements of companies.

                                                          -    Education: Universities, training centers and
                                                               other institutions can adjust their academic
                                                               offer, in order to meet the needs of the
                                                               market.
        Fig. 18. Word cloud of the Cluster 5


                                                     41
References
[1] AMSTAT (2015). "Statistics is the fastest-growing          [15] Thompson, Cheryl A., and Craig Willies. (2015).
   undergraduate degree". [Consulted: February 3,                 "Data Workforce Needs: Disambiguation of Roles
   2015]. Available in: http://bit.ly/1uvCn4F                     Using Clustering and Topic Modeling".
                                                                  [Consulted: June 1, 2015]. Available in:
[2] Cajachahua, L. (2008). “Estadísticos de Perú”.                http://bit.ly/1QaPDpu
   Blog de empleo y prácticas. [Consulted: February
   15, 2015]. Available in: http://bit.ly/1FZVfuV              [16] Witten, IH, Frank, E., and Hall, MA (2011).
                                                                  "Data mining: Practical machine learning tools and
[3] Cox, A., and Corral, S. (2013). "Evolving                     techniques". San Francisco: Morgan Kaufmann.
   Academic Library Specialties". Journal of the                  3rd edition.
   American Society for Information Science and
   Technology. 64 (8): 1526-1542.                              [17] National Institutes for Health (2015). PubMed:
                                                                  US National Library of Medicine. [Consulted: June
[4] Domenica, F., Mastrangelo, M., and Sarlo, S.                  1, 2015]. Available in: http://1.usa.gov/1brVEaa
   (2012). "Text Clustering Based on Centrality
   Measures: An Application on Job Advertisements".
   [Consulted: June 1, 2015]. Available in:
   http://bit.ly/1HO6uVv
[5] ElPais.com (2014). "Las carreras con mayor tasa
   de empleo". [Accessed: October 29, 2014].
   Available in: http://bit.ly/1rSot5P
[6] ElPais.com (2015). "¿Cuáles son los estudios con
   menos paro? ¿Y los que más tienen?" [Consulted:
   May 7, 2015]. Available in: http://bit.ly/1Jt25K3
[7] Han, J., and Kamber, M. (2001). Data Mining:
   Concepts and Techniques. Morgan Kaufmann.
[8] IPEA (2014). Radar: Technology, produção and
   Foreign Trade (2013) 27 Institute of Applied
   Economic Research. Setoriais Diretoria of Studies
   and Policies, of Inovação, Regulação and
   Infrastructure. [Consulted: June 1, 2015]. Available
   in: http://bit.ly/1SZHL9j
[9] LinkedIn (2014). "The 25 Hottest People Skills
   That Got Hired in 2014". [Consulted: December
   17, 2015]. Available in: http://linkd.in/1x0LQBT
[10] LinkedIn (2014). "Top 10 Job Titles That Did not
   Exist 5 Years Ago". [Consulted: June 1, 2015].
   Available in: http://linkd.in/KtpUbI
[11] Merca20.com (2014). " Infografía: 10 populares
   empleos que no existían hace 5 años". [Consulted:
   June 1, 2015]. Available in: http://bit.ly/1abEw6c
[12] Parr Rud, O. (2001). "Data Mining Cookbook".
   John Wiley & Sons, New York, NY.
[13] RPP.com (2015). "Conoce cuáles serán los
   empleos más demandados en los próximos 10
   años". [Consulted: March 4, 2015]. Available in:
   http://bit.ly/1EYJH7k
[14] Swan, A., and Brown, S. (2008). "The Skills,
   Role and Career Structure of Data Scientists and
   Curators: An Assessment on Current Practices and
   Future Needs". Report to the JISC.


                                                          42