Specializations for the Peruvian Professional in Statistics: A Text Mining Approach Luis Cajachahua Espinoza Andrea Ruiz Guerrero Tomás Nieto Agudo UNI, Perú UC, Colombia UCLM, España lcajachahua@gmail.com randreag@gmail.com Tomas.nieto.agudo@gmail.com On the other hand, there are many careers having Abstract accelerated growth in recent years. One of those careers is Statistics. According to reports in The objective of this study was to several countries around the world, the annual identify the specialization profiles which demand for professionals in Statistics has been are most required by companies and increasing until having the highest employment organizations in Lima, through the rate. One example is Spain, where Statistics is analysis of job postings published in the the second career with the lowest unemployment Internet. Text Mining techniques were rate in the country [6]. used to extract relevant information and to identify some generic skills for the Statisticians are also required in Brazil [8], Peruvian statisticians. United States [1] and many other countries. According to another report, made by LinkedIn, For purposes of this study, we analyzed statistical skills and data analysis are at the top of 2,809 job postings published in the Blog the 25 skills most sought by companies in the “Estadísticos de Perú” [2], between 2009 majority of countries considered in the study [9]. and 2014. We have identified many requirements, knowledge and specific Considering these facts, there are some very skills that companies and organizations interesting questions: What kind of statistics were looking for. After that, job postings professionals are seeking companies and were segmented using Singular Value organizations? Have these requirements changed Decomposition (SVD) of the Terms and in recent years? Is there a unique statistician Documents Matrix. In addition, five profile, or are several types? Where can we find segments were discovered, corresponding useful information to clarify these doubts? We to specific competency profiles of tried to answer these questions through analysis statisticians, where each one has different of job postings. types of knowledge and specific skills. 2 Background Keywords: Job postings, Statistician, Professional, Competencies, Abilities. To understand the demand for professionals and SVD, Clustering, Text Mining. the skills required, we need to find some useful information sources. Previous research related to the issue, were made through in-depth studies, talking with some subject experts [14]. 1 Introduction On the other hand, a group of Italian students The employment trends are changing a lot in developed a segmentation technique based on recent years. A report published by the social centroids [4] on the database of jobs for college network LinkedIn in 2014, after analyzing 259 SOUL (University Orientation and Job System, a million professional profiles, have identified ten network that contains jobs posted by 8 different professions that did not exist five years ago, but universities in Italy) where they took more than they are very popular today [11, 10]. This 1,650 job postings. All kinds of them were produces great uncertainty about the future of analyzed, resulting segments from all university young people job opportunities. careers. 35 Another related work is the iSchool of Illinois, 3 Methodology where they performed a segmentation analysis of Indeed job postings, in order to find the profiles According to the literature reviewed, there are that are most demanded for their students in several methods of text analysis, but these these subjects [15]. In this case, 15,000 job methods work well in other languages, so we postings were analyzed, all of them related to needed to adapt some tools to Spanish. On the professionals in the data analysis field. But, other hand, our aim, unlike previous studies, is to segmentation was performed inside the contents segment the job postings, in order to know the of each job posting, so the resulting segments are different types of specialties for a statistician. referred to generic skills for all professionals. 3.1 Study scope The two last studies aimed not only to identify The population considered was formed by 2,809 the most requested profiles, but also see the job postings published in the blog "Estadísticos status of the current job market and its evolution de Perú" [2]. All the postings were analyzed, so over time, finding important patterns and can be it was unnecessary to use sampling techniques. implemented as actions either within the The number of postings published per year is company or college. shown in the next graph. 2.1 Objectives The main objectives of this study are: - Identify the more important requirements, competencies and demands that companies include in their job postings. - Detect the existence of professional profiles through all the job postings available through text mining techniques. - Compare the evolution of the requirements and skills by dividing the dataset in two periods (2009-2011 and 2012-2014). Fig. 1. Job postings per year Once all previous goals achieved, we can make some recommendations to the agents involved in 3.2 Text Mining the job market: companies, educational As a part of Data Mining, Text Mining is the institutions and potential employees, statisticians. intensive process of information extraction, where a user interacts with a collection of 2.2 Limitations documents using specialized analysis tools. As a By the nature of the study, it should be noted process, it deals with the discovery of knowledge limitations implied in its realization: in the content of several texts and after passing - The main information source is the Blog through several stages. where the job postings are published. If there Text Mining seeks to extract useful information were errors or omissions in the posts, they from multiple data sources through the will influence the accuracy of the results. identification and exploration of interesting - There are job opportunities that are not being patterns. One remarkable difference with published, causing a bias in the analysis numeric data analysis is that the documents results. Moreover, many leadership and analyzed do not have a defined structure. That is senior positions are sent to headhunting why in text mining the pre-processing tasks are companies. Consequently, they could not be very important. These operations are focused on included in this analysis. the features identification and extraction of - The postings are mostly from companies and natural language and are responsible for organizations located in the city of Lima. transforming unstructured data in a structured Peru is still a very centralized country, nearly intermediate format. a third of Peruvian population lives in Lima, Text mining is used for: so the results could not be extrapolated to the - Classify and organize documents based on whole country. their content: With the information overload in companies, it is necessary a 36 method to facilitate the classification of are added over 40,000 items biomedical each documents that enter daily to the system. month [17]. In a collection of this size, try to Text mining has several algorithms to do this correlate the data between documents, automatically using index classification. mapping relationships or identify trends, - Organize depots for search and retrieval: could be extremely complex and demanding, This problem spots the need of an efficient in terms of time and machine. But there are system search, through the submission of a some techniques that perform these tasks request for recovering specific information. automatically that improve the speed and This query sends keywords to help identify efficiency in the analysis. the documents that best fit, sorts by - Document: For practical purposes, a relevance and the best matches are displayed. document is a unit of text data (e.g. news, a There are techniques that help to measure the report of business, emails, research articles, similarity between documents in order to manuscript, stories, tweets, books, among calculate the similarities and return others). information. - Corpus: A collection of documents, usually - Automated addition and comparison of stored electronically and on which the information: Many times, when researchers analysis is performed. Its elements are have many documents on the same subject, it known as documents which store the current is necessary to group the information text and the local metadata. automatically to facilitate analysis. Text - Terms and documents matrix: It is the clustering is a useful technique to build the most common way to represent text for groups in these cases. future comparisons. This matrix is composed - Extract relevant information from a of document ID’s as rows and terms as document: Text mining has methods that columns. Its elements are the frequencies of deals with unstructured texts, analyzes them each term within that document. and identifies groups of concepts. That is, it - Vector space model: It is a matrix whose transforms plain texts into valuable and coefficients are functions of term frequency. relevant knowledge. - Prediction and evaluation: One of the 3.4 Text Mining Tools concerns expressed sophisticated text mining On this study, we used R libraries and SAS Text is to create predictive models and evaluation Miner in order to obtain the results, because each from textual information that you count. one offers some advantages and useful tasks that These models are based on a model already the other one doesn’t have. Another reason to raised issues of modeling and assembly, to choose these platforms is that the other ones do predict for new documents entering the not have text Stemming and Lemmatization tools collection items or more suitable groups in Spanish. We can see a comparison of these according to their contents. This type of tools in the next diagram: problem is one of the most common text mining. 3.3 Text Mining Elements Text Mining, as many other disciplines, have some recognizable elements that characterize it. - Repository of documents: Any set of documents containing text, regardless of size, can be 10 or 100 billion texts. One of the main sources of documents, with more than 12 million items open to the public, Fig. 2. Comparison of R and SAS Text Miner Tasks with a wide variety of subjects and in different languages is PubMed. These Following this comparison, we decided to use characteristics have become one of the both packages. R to clean the data and generate databases most used by computer Word clouds for the segments and SAS Text professionals in data analysts or interested in Miner to the SVD decomposition and the implementation of text mining tasks on a Segmentation. large scale. This collection is dynamic and 37 The scheme of the Text Mining process is shown and generic skills, data analysis and information in the following image: management (“datos”, “análisis”, “información” y “manejo”). Then, some other words make references to specific skills, such as SPSS or Excel. So, it is necessary to use clustering techniques, since there are several groups of words representing different capabilities related to statistical profiles. Fig. 3. Tasks and tools used in the analysis In the terms filtering step, some stopwords were used, in order to avoid some obvious findings, like statistics, statistician, job, salary, enterprise, etc. (“estadística”, “estadístico”, “empleo”, “salario”, “empresa”, etc.) Then, we performed the SVD decomposition and finally, the text clustering step. After this process, we obtained some interesting findings, which are explained in the next section. Fig. 5. Distribution of job postings for level (Total: 4 Results 2,809 job postings) After textual analysis, we can answer the It’s clear that analysts’ position dominates, research questions. For example: What are the because as we said, the job postings correspond requirements and skills that students and to basic or intermediate positions. professionals in Statistics are requested on employment notices published? For the first answer, we could see the Word cloud of the complete database in order to discover the main requirements founded. Fig. 4. WordCloud considering the entire Corpus (Total: 2,809 job postings) As observed, the most prevalent and relevant terms in the job appear larger. That is, in a high percentage of postings, these words appeared which leads us to believe that one of the first Fig. 6. Job postings distribution by Requirements and things required of a statistic is the experience period (“Experiencia”). We can see other some basic 38 It is remarkable that 81% of job postings mention the word “Experience” in them. It means that this is one of the most important requirements (along with knowledge or intermediate and advanced levels). Furthermore, they have experienced increasing importance in recent years. Fig. 8. Job postings distribution by required background and period About the background required, it weighs heavily reporting tasks or report writing (24%). One in four job postings, contains the term "database" which makes clear that the SQL language has become very important in Lima. Not just someone who can get statistics or models is needed, organizations valued professionals whose can extract themselves from the data sources. Other tasks are in high demand as Process Control or Indicators Development. Fig. 7. Job postings distribution by Competencies and period As for the Competencies, we highlight the character or analytical profile along with other basic skills in business such as responsibility and Fig. 9. Most required Statistical Software communication skills. The increase of good communication, responsibility and strategic The importance of SPSS in the area of Lima is thinking is valuable. Clearly, the organizations also clear growth in recent years (almost seek Statisticians that are not only good at doubling its appearance in the ads). Others such technical level, but also have the ability to think as R or SAS are still not much required; maybe about the best solution for the organization as a because the cost of acquisition or the time whole. required learning the software (SPSS is easier). Fig. 10. Most required Database Management Software 39 Regarding the database software, SQL Server Through descriptive terms offered by the five predominates over Access or Oracle. clusters finally formed and considering the results of characterization through WordClouds. The following professional profiles were obtained: Risk managers (Cluster 1): Professionals with experience in portfolio and risk management (both credits and investments), preferably analysts and engineers. They are sued for the financial and banking sector. They were also requested domain mainly SQL and SPSS. Fig. 11. Most required Office Software Notice the importance of Excel (appears in four out of ten job postings) and its great increase in recent years. Finally, it is important to determine the existence of specialization profiles, segments that meet specific characteristics and are different from others. For this, we use SAS Enterprise Miner to compare the results from four, five and nine segments, we decided to choose five segments because it showed better indicators of distance between clusters and better possibilities of Fig. 14. Word cloud of the Cluster 1 interpretation. The distribution of each segment is shown in the next figure: Analysts with reporting tasks (Cluster 2): Analysts with good statistical knowledge required for tasks of reporting and report writing. Mainly related to the areas of marketing and sales. The most required software is the Office suite, more specifically Excel. Fig. 12. Term-based Segmentation in SAS Enterprise Miner After segmenting the messages in these groups, we decided to perform a characterization, that is, find the most common expressions in each cluster, in order to get a better idea of the composition of each segment: Fig. 15. Word cloud of the Cluster 2 Business Intelligence Professionals (Cluster 3): Profiles that manage and analyze databases generally related to marketing and related areas (customers, sales, campaigns). They were also asked experience in campaign management and business intelligence. In software they are Fig. 13. Segments Characterization required Excel and SQL. 40 These are the profiles we wanted to find, as we have seen, each implies that the professional should have sought some proper statistics to job in question features. 5 Conclusions According to the results, we can conclude that Statisticians have relative success in Lima. In addition, we have obtained the following conclusions: Fig. 16. Word cloud of the Cluster 3 1. The main goal (to identify key competencies and requirements) has been Students or graduates in trainee programs successfully achieved. It was possible to (Cluster 4): Young graduates who are at the end detect the main (technical and personal) of its cycle of studies (generally engineering) requirements that often companies require with knowledge of analysis tools and required to in their job requirements. And due to the be proactive. They are required to dominate temporary separation into two periods, we Excel and SPSS. also found interesting differences about the change in the demand of these requirements. 2. The second one (identification of professional profiles), has also been achieved. We have identified five types of professionals, each group are different from the rest and we have characterized them accurately and in a very clear way. 3. The results obtained in this analysis, may be useful for three agents who are involved in the labor market: companies, potential workers (statisticians) and educational Fig. 17. Word cloud of the Cluster 4 institutions: Market researchers (Cluster 5): Professionals - Business: Companies can improve their job in the field of market research (both quantitative and qualitative analysis). They were also postings, making easy the contact with the required experience in processing and analysis of wanted profiles. In the other hand, they could obtain certain advantages in areas such as surveys and marketing knowledge (for research applications). They are required Excel and SPSS employee training, based on the specific profiles founded. too. - Statisticians: This analysis would be helpful for them, in order to improve the CV writing, increasing their chances to obtain a good employment opportunity. They can also focus their training in the same direction as do the requirements of companies. - Education: Universities, training centers and other institutions can adjust their academic offer, in order to meet the needs of the market. Fig. 18. Word cloud of the Cluster 5 41 References [1] AMSTAT (2015). "Statistics is the fastest-growing [15] Thompson, Cheryl A., and Craig Willies. (2015). undergraduate degree". [Consulted: February 3, "Data Workforce Needs: Disambiguation of Roles 2015]. Available in: http://bit.ly/1uvCn4F Using Clustering and Topic Modeling". [Consulted: June 1, 2015]. Available in: [2] Cajachahua, L. (2008). “Estadísticos de Perú”. http://bit.ly/1QaPDpu Blog de empleo y prácticas. [Consulted: February 15, 2015]. Available in: http://bit.ly/1FZVfuV [16] Witten, IH, Frank, E., and Hall, MA (2011). "Data mining: Practical machine learning tools and [3] Cox, A., and Corral, S. (2013). "Evolving techniques". San Francisco: Morgan Kaufmann. Academic Library Specialties". Journal of the 3rd edition. American Society for Information Science and Technology. 64 (8): 1526-1542. [17] National Institutes for Health (2015). PubMed: US National Library of Medicine. [Consulted: June [4] Domenica, F., Mastrangelo, M., and Sarlo, S. 1, 2015]. Available in: http://1.usa.gov/1brVEaa (2012). "Text Clustering Based on Centrality Measures: An Application on Job Advertisements". [Consulted: June 1, 2015]. Available in: http://bit.ly/1HO6uVv [5] ElPais.com (2014). "Las carreras con mayor tasa de empleo". [Accessed: October 29, 2014]. Available in: http://bit.ly/1rSot5P [6] ElPais.com (2015). "¿Cuáles son los estudios con menos paro? ¿Y los que más tienen?" [Consulted: May 7, 2015]. Available in: http://bit.ly/1Jt25K3 [7] Han, J., and Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann. [8] IPEA (2014). Radar: Technology, produção and Foreign Trade (2013) 27 Institute of Applied Economic Research. Setoriais Diretoria of Studies and Policies, of Inovação, Regulação and Infrastructure. [Consulted: June 1, 2015]. Available in: http://bit.ly/1SZHL9j [9] LinkedIn (2014). "The 25 Hottest People Skills That Got Hired in 2014". [Consulted: December 17, 2015]. Available in: http://linkd.in/1x0LQBT [10] LinkedIn (2014). "Top 10 Job Titles That Did not Exist 5 Years Ago". [Consulted: June 1, 2015]. Available in: http://linkd.in/KtpUbI [11] Merca20.com (2014). " Infografía: 10 populares empleos que no existían hace 5 años". [Consulted: June 1, 2015]. Available in: http://bit.ly/1abEw6c [12] Parr Rud, O. (2001). "Data Mining Cookbook". John Wiley & Sons, New York, NY. [13] RPP.com (2015). "Conoce cuáles serán los empleos más demandados en los próximos 10 años". [Consulted: March 4, 2015]. Available in: http://bit.ly/1EYJH7k [14] Swan, A., and Brown, S. (2008). "The Skills, Role and Career Structure of Data Scientists and Curators: An Assessment on Current Practices and Future Needs". Report to the JISC. 42