=Paper=
{{Paper
|id=Vol-1318/SIMBig2014_proceedings
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1318/SIMBig2014_proceedings.pdf
|volume=Vol-1318
}}
==None==
<pdf width="1500px">https://ceur-ws.org/Vol-1318/SIMBig2014_proceedings.pdf</pdf>
<pre>
1st Symposium on Information Management
             and Big Data


          8th, 9th and 10th September 2014
                     Cusco - Peru


        PROCEEDINGS
            © SIMBig, Andina University of Cusco
            and the authors of individual articles.
Proceeding Editors: J. A. Lossio-Ventura and H. Alatrista-Salas
	
  
TABLE OF CONTENTS

SIMBig 2014 Conference Organization Committee
SIMBig 2014 Conference Reviewers
SIMBig 2014 Paper Contents
SIMBig 2014 Keynote Speaker Presentations
Full and Short Papers


The SIMBig 2014 Organization Committee confirms that full and concise papers
accepted for this publication:
• Meet the definition of research in relation to creativity, originality, and increasing
   humanity's stock of knowledge;
• Are selected on the basis of a peer review process that is independent, qualified
   expert review;
• Are published and presented at a conference having national and international
   significance as evidenced by registrations and participation; and
• Are made available widely through the Conference web site.
	
  
Disclaimer: The SIMBig 2014 Organization Committee accepts no responsibility for omissions
and errors.


                                            3
SIMBig 2014 Organization Committee
	
  
GENERAL CHAIR

•      Juan Antonio LOSSIO VENTURA, Montpellier 2 University, LIRMM,
       Montpellier, France
•      Hugo ALATRISTA SALAS, UMR TETIS, Irstea, France


LOCAL CHAIR

•      Cristhian GANVINI VALCARCEL Andina University of Cusco, Peru
•      Armando FERMIN PEREZ, National University of San Marcos, Peru
•      Cesar A. BELTRAN CASTAÑON, GRPIAA, Pontifical Catholic University
       of Peru
•      Marco RIVAS PEÑA, National University of San Marcos, Peru


                                      4
SIMBig 2014 Conference Reviewers

•   Salah Ait-Mokhtar, Xerox Research Centre Europa, FRANCE
•   Jérôme Azé, LIRMM - University of Montpellier 2, FRANCE
•   Cesar A. Beltrán Castañón, GRPIAA - Pontifical Catholic University of Peru,
    PERU
•   Sandra Bringay, LIRMM - University of Montpellier 3, FRANCE
•   Oscar Corcho, Ontology Engineering Group - Polytechnic University of Madrid,
    SPAIN
•   Gabriela Csurka, Xerox Research Centre Europa, FRANCE
•   Mathieu d'Aquin, Knowledge Media institute (KMi) - Open University, UK
•   Ahmed A. A. Esmin, Federal University of Lavras, BRAZIL
•   Frédéric Flouvat, PPME Labs - University of New Caledonia, NEW CALEDONIA
•   Hakim Hacid, Alcatel-Lucent, Bell Labs, FRANCE
•   Dino Ienco, Irstea, FRANCE
•   Clement Jonquet, LIRMM - University of Montpellier 2, FRANCE
•   Eric Kergosien, Irstea, FRANCE
•   Pierre-Nicolas Mougel, Hubert Curien Labs - University of Saint-Etienne,
    FRANCE
•   Phan Nhat Hai, Oregon State University, USA
•   Thomas Opitz, University of Montpellier 2 - LIRMM, FRANCE
•   Jordi Nin, Barcelona Supercomputing Center (BSC) - Technical University of
    Catalonia (BarcelonaTECH), SPAIN
•   Gabriela Pasi, Information Retrieval Lab - University of Milan Bicocca, ITALY
•   Miguel Nuñez del Prado Cortez, Intersec Labs - Paris, FRANCE
•   José Manuel Perea Ortega, European Commission - Joint Research Centre
    (JRC)- Ispra, ITALY
•   Yoann Pitarch, IRIT - Toulouse, FRANCE
•   Pascal Poncelet, LIRMM - University of Montpellier 2, FRANCE
•   Julien Rabatel, LIRMM, FRANCE
•   Mathieu Roche, Cirad - TETIS and LIRMM, FRANCE
•   Nancy Rodriguez, LIRMM - University of Montpellier 2, FRANCE
•   Hassan Saneifar, Raja University, IRAN
•   Nazha Selmaoui-Folcher, PPME Labs - University of New Caledonia, NEW
    CALEDONIA
•   Maguelonne Teisseire, Irstea - LIRMM, FRANCE
•   Boris Villazon-Terrazas, iLAB Research Center - iSOCO, SPAIN
•   Pattaraporn Warintarawej, Prince of Songkla University, THAILAND
•   Osmar Zaïane, Department of Computing Science, University of Alberta,
    CANADA


                                        5
SIMBig 2014 Paper Contents

Title and Authors                                                     Page

Keynote Speaker Presentations                                           7
Pascal Poncelet
Mathieu Roche
Hakim Hacid
Mathieu d’Aquin

Long Papers
Identification of Opinion Leaders Using Text Mining Technique in        8
Virtual Community, Chihli Hung and Pei-Wen Yeh
Quality Metrics for Optimizing Parameters Tuning in Clustering         14
Algorithms for Extraction of Points of Interest in Human Mobility,
Miguel Nunez Del Prado Cortez and Hugo Alatrista-Salas
A Cloud-based Exploration of Open Data: Promoting Transparency         22
and Accountability of the Federal Government of Australia, Richard
Sinnott and Edwin Salvador
Discovery Of Sequential Patterns With Quantity Factors, Karim          33
Guevara Puente de La Vega and César Beltrán Castañón
Online Courses Recommendation based on LDA, Rel Guzman                 42
Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe and Jose
Ochoa Luna

Short Papers
ANIMITEX project: Image Analysis based on Textual Information,         49
Hugo Alatrista-Salas, Eric Kergosien, Mathieu Roche and
Maguelonne Teisseire
A case study on Morphological Data from Eimeria of Domestic            53
Fowl using a Multiobjective Genetic Algorithm and R&P for
Learning and Tuning Fuzzy Rules for Classification, Edward
Hinojosa Cárdenas and César Beltrán Castañón
SIFR Project: The Semantic Indexing of French Biomedical Data          58
Resources, Juan Antonio Lossio-Ventura, Clement Jonquet,
Mathieu Roche and Maguelonne Teisseire

Spanish Track
Mathematical modelling of the performance of a computer system,        62
Felix Armando Fermin Perez
Clasificadores supervisados para el análisis predictivo de muerte y    67
sobrevida materna, Pilar Hidalgo


                                      6
SIMBig 2014 Keynote Speakers


Mining Social Networks: Challenges and Directions
Pascal Poncelet, Lirmm, Montpellier, France
Professor Pascal Poncelet talk was focused on analysis of data associate to
social network. In his presentation, Professor Poncelet summarized several
techniques through knowledge discovery on databases process.

NLP approaches: How to Identify Relevant Information in
Social Networks?
Mathieu Roche, UMR TETIS, Irstea, France
In this talk, doctor Roche outlined last tendencies in text mining task. Several
techniques (sentiment analysis, opinion mining, entity recognition, ...) on
different corpus (tweets, blogs, sms,...) were detailed in this presentation.

Social Network Analysis: Overview and Applications
Hakim Hacid, Zayed University, Dubai, UAE
Doctor Hacid presented a complete overview concerning techniques
associated to explote social web data. Techniques as social network analysis,
social information retrieval and mashups full completion were presented.

Putting intelligence in Web Data With Examples Education
Mathieu d’Aquin, Knowledge Media Institute, The Open University, UK
Doctor d'Aquin presented several web mining approaches addressed to
education issues. In this talk doctor d'Aquin detailed themes like intelligent
web information and knowledge processing, the semantic web, among others.


                                        7
     Identification of Opinion Leaders Using Text Mining Technique in
                             Virtual Community


                Chihli Hung                                            Pei-Wen Yeh
    Department of Information Management                   Department of Information Management
       Chung Yuan Christian University                        Chung Yuan Christian University
           Taiwan 32023, R.O.C.                                   Taiwan 32023, R.O.C.
          chihli@cycu.edu.tw                                     mogufly@gmail.com


                                                           significantly and consumers are further influenced
                      Abstract                             by other consumers without any geographic
    Word of mouth (WOM) affects the buying
                                                           limitation (Flynn et al., 1996).
    behavior of information receivers stronger than           Nowadays, making buying decisions based on
    advertisements. Opinion leaders further affect         WOM becomes one of collective decision-making
    others in a specific domain through their new          strategies. It is nature that all kinds of human
    information, ideas and opinions. Identification        groups have opinion leaders, explicitly or
    of opinion leaders has become one of the most          implicitly (Zhou et al., 2009). Opinion leaders
    important tasks in the field of WOM mining.            usually have a stronger influence on other
    Existing work to find opinion leaders is based         members through their new information, ideas and
    mainly on quantitative approaches, such as             representative opinions (Song et al., 2007). Thus,
    social network analysis and involvement.               how to identify opinion leaders has increasingly
    Opinion leaders often post knowledgeable and
    useful documents. Thus, the contents of WOM
                                                           attracted the attention of both practitioners and
    are useful to mine opinion leaders as well. This       researchers.
    research proposes a text mining-based approach            As opinion leadership is relationships between
    to evaluate features of expertise, novelty and         members in a society, many existing opinion leader
    richness of information from contents of posts         identification tasks define opinion leaders by
    for identification of opinion leaders. According       analyzing the entire opinion network in a specific
    to experiments in a real-world bulletin board          domain, based on the technique of social network
    data set, this proposed approach demonstrates          analysis (SNA) (Kim, 2007; Kim and Han, 2009).
    high potential in identifying opinion leaders.         This technique depends on relationship between
                                                           initial publishers and followers. A member with
                                                           the greatest value of network centrality is
1    Introduction                                          considered as an opinion leader in this network
This research identifies opinion leaders using the         (Kim, 2007).
technique of text mining, since the opinion leaders           However, a junk post does not present useful
affect other members via word of mouth (WOM)               information. A WOM with new ideas is more
on social networks. WOM defined by Arndt (1967)            interesting. A spam link usually wastes readers'
is an oral person-to-person communication means            time. A long post is generally more useful than a
between an information receiver and a sender, who          short one (Agarwal et al., 2008). A focused
exchange the experiences of a brand, a product or a        document is more significant than a vague one.
service based on a non-commercial purpose.                 That is, different documents may contain different
Internet provides human beings with a new way of           influences on readers due to their quality of WOM.
communication. Thus, WOM influences the                    WOM documents per se can also be a major
consumers more quickly, broadly, widely,                   indicator for recognizing opinion leaders. However,
                                                           such quantitative approaches, i.e. number-based or


                                                       8
SNA-based methods, ignore quality of WOM and                 network hubs usually contain six aspects, which
only include quantitative contributions of WOM.              are ahead in adoption, connected, travelers,
   Expertise, novelty, and richness of information           information-hungry, vocal, and exposed to media
are three important features of opinion leaders,             more than others (Rosen, 2002). Ahead in adoption
which are obtained from WOM documents (Kim                   means that network hubs may not be the first to
and Han, 2009). Thus, this research proposes a text          adopt new products but they are usually ahead of
mining-based approach in order to identify opinion           the rest in the network. Connected means that
leaders in a real-world bulletin board system.               network hubs play an influential role in a network,
   Besides this section, this paper is organized as          such as an information broker among various
follows. Section 2 gives an overview of features of          different groups. Traveler means that network hubs
opinion leaders. Section 3 describes the proposed            usually love to travel in order to obtain new ideas
text mining approach to identify opinion leaders.            from other groups. Information-hungry means that
Section 4 describes the data set, experiment design          network hubs are expected to provide answers to
and results. Finally, a conclusion and further               others in their group, so they pursue lots of facts.
research work are given in Section 5.                        Vocal means that network hubs love to share their
                                                             opinions with others and get responses from their
2    Features of Opinion Leaders                             audience. Exposed to media means that network
                                                             hubs open themselves to more communication
The term “opinion leader”, proposed by Katz and
                                                             from mass media, and especially to print media.
Lazarsfeld (1957), comes from the concept of
                                                             Thus, a network hub or an opinion leader is not
communication. Based on their research, the
                                                             only an influential node but also a novelty early
influence of an advertising campaign for political
                                                             adopter, generator or spreader. An opinion leader
election is lesser than that of opinion leaders. This
                                                             has rich expertise in a specific topic and loves to be
is similar to findings in product and service
                                                             involved in group activities.
markets. Although advertising may increase
                                                                As members in a social network influence each
recognition of products or services, word of mouth
                                                             other, degree centrality of members and
disseminated via personal relations in social
                                                             involvement in activities are useful to identify
networks has a greater influence on consumer
                                                             opinion leaders (Kim and Han, 2009). Inspired by
decisions (Arndt, 1967; Khammash and Griffiths,
                                                             the PageRank technique, which is based on the link
2011). Thus, it is important to identify the
                                                             structure (Page et al., 1998), OpinionRank is
characteristics of opinion leaders.
                                                             proposed by Zhou et al. (2009) to rank members in
   According to the work of Myers and Robertson
                                                             a network. Jiang et al. (2013) proposed an
(1972), opinion leaders may have the following
                                                             extended version of PageRank based on the
seven characteristics. Firstly, opinion leadership in
                                                             sentiment analysis and MapReduce. Agarwal et al.
a specific topic is positively related to the quantity
                                                             (2008) identified influential bloggers through four
of output of the leader who talks, knows and is
                                                             aspects, which are recognition, activity generation,
interested in the same topic. Secondly, people who
                                                             novelty and eloquence. An influential blog is
influence others are themselves influenced by
                                                             recognized by others when this blog has a lot of in-
others in the same topic. Thirdly, opinion leaders
                                                             links. The feature of activity generation is
usually have more innovative ideas in the topic.
                                                             measured by how many comments a post receives
Fourthly and fifthly, opinion leadership is
                                                             and the number of posts it initiates. Novelty means
positively related to overall leadership and an
                                                             novel ideas, which may attract many in-links from
individual’s social leadership. Sixthly, opinion
                                                             the blogs of others. Finally, the feature of
leaders usually know more about demographic
                                                             eloquence is evaluated by the length of post. A
variables in the topic. Finally, opinion leaders are
                                                             lengthy post is treated as an influential post.
domain dependent. Thus, an opinion leader
                                                                Li and Du (2011) determined the expertise of
influences others in a specific topic in a social
                                                             authors and readers according to the similarity
network. He or she knows more about this topic
                                                             between their posts and the pre-built term ontology.
and publishes more new information.
                                                             However both features of information novelty and
   Opinion leaders usually play a central role in a
                                                             influential position are dependent on linkage
social network. The characteristics of typical
                                                             relationships between blogs. We propose a novel


                                                         9
text mining-based approach and compare it with               3.3   Novelty
several quantitative approaches.
                                                             We       utilize     Google      trends      service
3     Quality Approach-Text Mining                           (http://www.google.com/trends) to obtain the first-
                                                             search time tag for significant words in documents.
Contents of word of mouth contain lots of useful             Thus, each significant word has its specific time
information, which has high relationships with               tag taken from the Google search repository. For
important features of opinion leaders. Opinion               example, the first-search time tag for the search
leaders usually provide knowledgeable and novel              term, Nokia N81, is 2007 and for Nokia Windows
information in their posts (Rosen, 2002; Song et al.,        Phone 8 is 2011. We define three degrees of
2007). An influential post is often eloquent (Keller         novelty evaluated by the interval between the first-
and Berry, 2003). Thus, expertise, novelty, and              search year of significant words and the collected
richness     of     information     are   important          year of our targeted document set, i.e. 2010. This
characteristics of opinion leaders.                          significant word belongs to normal novelty if the
                                                             interval is equal to two years. A significant word
3.1    Preprocessing                                         with an interval of less than two years belongs to
This research uses a traditional Chinese text                high novelty and one with an interval greater than
mining process, including Chinese word                       two years belongs to low novelty. We then
segmenting, part-of-speech filtering and removal             summarize all novelty values based on significant
of stop words for the data set of documents. As a            words used by a member in a social network. The
single Chinese character is very ambiguous,                  equation of novelty for a member is shown in (2).
segmenting Chinese documents into proper
Chinese words is necessary (He and Chen, 2008).                       eh  0.66  em  0.33  el
                                                             novi                               ,          (2)
This research uses the CKIP service                                          eh  em  el
(http://ckipsvr.iis.sinica.edu.tw/)   to     segment
                                                             where eh , em and el is the number of words that
Chinese documents into proper Chinese words and
                                                             belong to the groups of high, normal and low
their suitable part-of-speech tags. Based on these
                                                             novelty, respectively.
processes, 85 words are organized into controlled
vocabularies as this approach is efficient to capture
the main concepts of document (Gray et al., 2009).           3.4   Richness of Information
3.2    Expertise                                             In general, a long document suggests some useful
                                                             information to the users (Agarwal et al., 2008).
This can be evaluated by comparing their posts
                                                             Thus, richness of information of posts can be used
with the controlled vocabulary base (Li and Du,
                                                             for the identification of opinion leaders. We use
2011). For member i, words are collected from his
                                                             both textual information and multimedia
or her posted documents and member vector i is
                                                             information to represent the richness of
represented as fi=(w1, w2, …wj, …, wN), where wj
                                                             information as (3).
denotes the frequency of word j used in the posted
documents of user i. N denotes the number of                 ric=d + g,                                     (3)
words in the controlled vocabulary. We then
normalize the member vector by his or her
                                                             where d is the total number of significant words
maximum frequency of any significant word. The               that the user uses in his or her posts and g is the
degree of expertise can be calculated by the
                                                             total number of multimedia objects that the user
Euclidean norm as show in (1).
                                                             posts.

          fi
exp i       ,                                  (1)          3.5   Integrated Text Mining Model
          mi
                                                             Finally, we integrate expertise, novelty and
where  is Euclidean norm.                                   richness of information from the content of posted
                                                             documents. As each feature has its own


                                                        10
distribution and range, we normalize each feature             number of documents that a member initiates plus
to a value between 0 and 1. Thus, the weights of              the number of derivative documents by other
opinion leaders based on the quality of posts                 members is treated as involvement.
become the average of these three features as (4).               Thus, we have one qualitative model, i.e. ITM,
                                                              and four quantitative models, i.e. DEG, CLO, BET
        Norm ( nov )  Norm (exp)  Norm ( ric )              and INV. We put top ten rankings from each model
ITM                                             . (4)
                          3                                   in a pool of potential opinion leaders. Duplicate
                                                              members are removed and 25 members are left.
                                                              We request 20 human testers, which have used and
4     Experiments                                             are familiar with Mobile01.
                                                                 In our questionnaire, quantitative information is
4.1    Data Set                                               provided such as the number of documents that the
                                                              potential opinion leaders initiate and the number of
Due to lack of available benchmark data set, we               derivative documents that are posted by other
crawl WOM documents from the Mobile01                         members. For the qualitative information, a
bulletin board system (http://www.mobile01.com/),             maximum of three documents from each member
which is one of the most popular online discussion            are provided randomly to the testers. The top 10
forums in Taiwan. This bulletin board system                  rankings are also considered as opinion leaders
allows its members to contribute their opinions               based on human judgment.
free of charge and its contents are available to the
public. A bulletin board system generally has an              4.3   Results
organized structure of topics. This organized                 We suppose that ten of 9460 members are
structure provides people who are interested in the
                                                              considered as opinion leaders. We collect top 10
same or similar topics with an online discussion              ranking members from each models and remove
forum that forms a social network. Finding opinion            duplicates. We request 20 human testers to identify
leaders on bulletin boards is important since they
                                                              10 opinion leaders from 25 potential opinion
contain a lot of availably focused WOM. In our                leaders obtained from five models. According to
initial experiments, we collected 1537 documents,
                                                              experiment results in Table 1, the proposed model
which were initiated by 1064 members and                      outperforms others. This presents the significance
attracted 9192 followers, who posted 19611                    of documents per se. Even INV is a very simple
opinions on those initial posts. In this data set, the
                                                              approach but it performs much better than social
total number of participants is 9460.                         network analysis models, i.e. DEG, CLO and BET.
                                                              One possible reason is the sparse network structure.
4.2    Comparison                                             Many sub topics are in the bulletin board system so
                                                              these topics form several isolated sub networks.
As we use real-world data, which has no ground
truth about opinion leaders, a user centered                                                   F-
evaluation approach should be used to compare the                       Recall   Precision           Accuracy
                                                                                             measure
difference between models (Kritikopoulos et al.,              DEG        0.45       0.50      0.48     0.56
2006). In our research, there are 9460 members in             CLO        0.36       0.40      0.38     0.48
this virtual community. We suppose that ten of                BET        0.64       0.70      0.67     0.72
them have a high possibility of being opinion                 INV        0.73       0.80      0.76     0.80
leaders.                                                      ITM        0.82       0.90      0.86     0.88
   As identification of opinion leaders is treated to
be one of important tasks of social network                     Table 1: Results of models evaluated by recall,
analysis (SNA), we compare the proposed model                         precision, F-measure and accuracy
(i.e. ITM) with three famous SNA approaches,
which are degree centrality (DEG), closeness
centrality (CLO), betweenness centrality (BET).
Involvement (INV) is an important characteristic
of opinion leaders (Kim and Han, 2009). The


                                                         11
5     Conclusions and Further Work                              Flynn, L. R., Goldsmith, R. E. and Eastman, J. K. 1996.
                                                                   Opinion Leaders and Opinion Seekers: Two New
    Word of mouth (WOM) has a powerful effect                      Measurement Scales. Academy of Marketing
on consumer behavior. Opinion leaders have                      He, J. and Chen, L. 2008. Chinese Word Segmentation
stronger influence on other members in an opinion                 Based on the Improved Particle Swarm Optimization
society. How to find opinion leaders has been of                  Neural Networks. Proceedings of IEEE Cybernetics
interest to both practitioners and researchers.                   and Intelligent Systems, 695-699.
Existing models mainly focus on quantitative
                                                                Jiang, L., Ge, B., Xiao, W. and Gao, M. 2013. BBS
features of opinion leaders, such as the number of                 Opinion Leader Mining Based on an Improved
posts and the central position in the social network.              PageRank      Algorithm    Using    MapReduce.
This research considers this issue from the                        Proceedings of Chinese Automation Congress, 392-
viewpoints of text mining. We propose an                           396.
integrated text mining model by extracting three
                                                                Katz, E. and Lazarsfeld, P. F. 1957. Personal Influence,
important features of opinion leaders regarding                   New York: The Free Press.
novelty, expertise and richness of information,
from documents. Finally, we compare this                        Keller, E. and Berry, J. 2003. One American in Ten
proposed text mining model with four quantitative                 Tells the Other Nine How to Vote, Where to Eat and,
approaches, i.e., involvement, degree centrality,                 What to Buy. They Are The Influentials. The Free
                                                                  Press.
closeness centrality and betweenness centrality,
evaluated by human judgment. In our experiments,                Khammash, M. and Griffiths, G. H. 2011. Arrivederci
we found that the involvement approach is the best                CIAO.com Buongiorno Bing.com- Electronic Word-
one among the quantitative approaches. The text                   of-Mouth (eWOM), Antecedences and Consequences.
mining approach outperforms its quantitative                      International Journal of Information Management,
                                                                  31:82-87.
counterparts as the richness of document
information provides a similar function to the                  Kim, D. K. 2007. Identifying Opinion Leaders by Using
qualitative features of opinion leaders. The                      Social Network Analysis: A Synthesis of Opinion
proposed text mining approach further measures                    Leadership Data Collection Methods and Instruments.
opinion leaders based on features of novelty and                  PhD Thesis, the Scripps College of Communication,
                                                                  Ohio University.
expertise.
   In terms of possible future work, some                       Kim, S. and Han, S. 2009. An Analytical Way to Find
integrated strategies of both qualitative and                     Influencers on Social Networks and Validate their
quantitative approaches should take advantages of                 Effects in Disseminating Social Games. Proceedings
both approaches. For example, the 2-step                          of Advances in Social Network Analysis and Mining,
integrated strategy, which uses the text mining-                  41-46.
based approach in the first step, and uses the                  Kritikopoulos, A., Sideri, M. and Varlamis, I. 2006.
quantitative approach based on involvement in the                 BlogRank: Ranking Weblogs Based on Connectivity
second step, may achieve the better performance.                  and Similarity Features. Proceedings of the 2nd
Larger scale experiments including topics, the                    International Workshop on Advanced Architectures
                                                                  and Algorithms for Internet Delivery and
number of documents and testing, should be done
                                                                  Applications, Article 8.
further in order to produce more general results.
                                                                Li, F. and Du, T. C. 2011. Who Is Talking? An
                                                                   Ontology-Based Opinion Leader Identification
References                                                         Framework for Word-of-Mouth Marketing in Online
                                                                   Social Blogs. Decision Support Systems, 51,
Agarwal, N., Liu, H., Tang, L. and Yu, P. S. 2008.                 2011:190-197.
    Identifying the Influential Bloggers in a Community.        Myers, J. H. and Robertson, T. S. 1972. Dimensions of
    Proceedings of WSDM, 207-217.
                                                                 Opinion Leadership. Journal of Marketing Research,
Arndt, J. 1967. Role of Product-Related Conversations            4:41-46.
  in the Diffusion of a New Product. Journal of
                                                                Page, L., Brin, S., Motwani, R. and Winograd, T. 1998.
  Marketing Research, 4(3):291-295.                               The PageRank Citation Ranking: Bringing Order to
                                                                  the Web. Technical Report, Stanford University.


                                                           12
Rosen, E. 2002. The Anatomy of Buzz: How to Create
  Word of Mouth Marketing, 1st ed., Doubleday.
Song, X., Chi, Y., Hino, K. and Tseng, B. L. 2007.
  Identifying Opinion Leaders in the Blogosphere.
  Proceedings of CIKM’07, 971-974.
Zhou, H., Zeng, D. and Zhang, C. 2009. Finding
  Leaders from Opinion Networks. Proceedings of the
  2009 IEEE International Conference on Intelligence
  and Security Informatics, 266-268.


                                                       13
       Quality Metrics for Optimizing Parameters Tuning in Clustering
       Algorithms for Extraction of Points of Interest in Human Mobility


            Miguel Nuẽz del Prado Cortez                               Hugo Alatrista-Salas
                     Peru I+D+I                                        GRPIAA Labs., PUCP
                   Technopark IDI                                           Peru I+D+I
           miguel.nunez@peruidi.com                                   halatrista@pucp.pe


                       Abstract                                On one hand, heuristics rely on the dwell time, which
                                                               is the lost of signal when user gets into a building.
     Clustering is an unsupervised learning tech-
                                                               Another used heuristic is the residence time, which
     nique used to group a set of elements into non-
     overlapping clusters based on some predefined
                                                               represents the time that a user spends at a particu-
     dissimilarity function. In our context, we rely           lar place. On the other hand, clustering algorithms
     on clustering algorithms to extract points of             group nearby mobility traces into clusters.
     interest in human mobility as an inference at-               In particular, in the context of POI extraction, it is
     tack for quantifying the impact of the privacy            important to find a suitable set of parameters, for a
     breach. Thus, we focus on the input param-                specific cluster algorithm, in order to obtain a good
     eters selection for the clustering algorithm,             accuracy as result of the clustering. The main contri-
     which is not a trivial task due to the direct im-
                                                               bution of this paper is a methodology to find a “op-
     pact of these parameters in the result of the
     attack. Namely, if we use too relax parame-               timal” configuration of input parameters for a clus-
     ters we will have too many point of interest              tering algorithm based on quality indices. This op-
     but if we use a too restrictive set of parame-            timal set of parameters allows us to have the appro-
     ters, we will find too few groups. Accordingly,           priate number of POIs in order to perform another
     to solve this problem, we propose a method to             inference attack. This paper is organized as follows.
     select the best parameters to extract the opti-           First, we present some related works on parameters
     mal number of POIs based on quality metrics.
                                                               estimation techniques in Section 2. Afterwards, we
                                                               describe the clustering algorithms used to perform
1   Introduction                                               the extraction of points of interests (POIs) as well as
The first step in inference attacks over mobility              the metrics to measure the quality of formed clusters
traces is the extraction of the point of interest (POI)        in sections 3 and 4, respectively. Then, we introduce
from a trail of mobility traces. Indeed, this phase            the method to optimize the choice of the parameters
impacts directly the global accuracy of an inference           in Section 5. Finally, Section 6 summarizes the re-
attack that relies on POI extraction. For instance,            sults and presents the future directions of this paper.
if an adversary wants to discover Alice’s home and
                                                               2   Related works
place of work the result of the extraction must be as
accurate as possible, otherwise they can confuse or            Most of the previous works estimate the parameters
just not find important places. In addition, for a more        of the clustering algorithms for the point of interest
sophisticated attack such as next place prediction, a          extraction by using empirical approaches or highly
mistake when extracting POIs can decrease signif-              computationally expensive methods. For instance,
icantly the global precision of the inference. Most            we use for illustration purpose two classical cluster-
of the extraction techniques use heuristics and clus-          ing approaches, K-means (MacQueen et al., 1967)
tering algorithms to extract POIs from location data.          and DBSCAN (Ester et al., 1996). In the former


                                                          14
clustering algorithm, the main issue is how to deter-         a cluster C composed of all the consecutive points
mine k, the number of clusters. Therefore, several            within distance d from each other. Afterwards, the
approaches have been proposed to address this issue           algorithm checks if the accumulated time of mobil-
(Hamerly and Elkan, 2003; Pham et al., 2005). The             ity traces between the youngest and the oldest ones
latter algorithm relies on OPTICS (Ankerst et al.,            is greater than the threshold t. If it is the case, the
1999) algorithm, which searches the space of param-           cluster is created and added to the list of POIs. Fi-
eters of DBSCAN in order to find the optimal num-             nally as a post-processing step, DT-Cluster merges
ber of clusters. The more parameters the clustering           the clusters whose medioids are less than d/3 far
algorithm has, the bigger the combinatorial space of          from each other.
parameters is. Nevertheless, the methods to calibrate
cluster algorithm inputs do not guarantee a good ac-          3.3    Time Density (TD-Cluster)
curacy for extracting meaningful POIs. In the next            Introduced in (Gambs et al., 2011), TD-Cluster is
section, we described the cluster algorithms used in          a clustering algorithm inspired from DT Cluster,
our study.                                                    which takes as input parameters a radius r, a time
                                                              window t, a tolerance rate τ , a distance threshold d
3   Clustering algorithms for extraction of                   and a trail of mobility traces M . The algorithm starts
    points of interest                                        by building iteratively clusters from a trail M of mo-
To perform the POI extraction, we rely on the fol-            bility traces that are located within the time window
lowing clustering algorithms:                                 t. Afterwards, for each cluster, if a fraction of the
                                                              points (above the tolerance rate τ ) are within radius
3.1 Density Joinable Cluster (DJ-Cluster)                     r from the medoid, the cluster is integrated to the list
DJ-Cluster (Zhou et al., 2004) is a clustering algo-          of clusters outputted, whereas otherwise it is simply
rithm taking as input a minimal number of points              discarded. Finally, as for DT Cluster, the algorithm
minpts, a radius r and a trail of mobility traces             merges the clusters whose medoids are less than d
M . This algorithm works in two phases. First, the            far from each other.
pre-processing phase discards all the moving points           3.4    Begin-end heuristic
(i.e. whose speed is above , for  a small value)
and then, squashes series of repeated static points           The objective of the begin and end location finder
into a single occurrence for each series. Next, the           inference attack (Gambs et al., 2010) is to take as
second phase clusters the remaining points based              meaningful points the first and last of a journey.
on neighborhood density. More precisely, the num-             More precisely, this heuristic considers that the be-
ber of points in the neighborhood must be equal or            ginning and ending locations of a user, for each
greater than minpts and these points must be within           working day, might convey some meaningful infor-
radius r from the medoid of a set of points. Where            mation.
medioid is the real point m that minimizes the sum of            Since we have introduced the different clustering
distances from the point m to the other points in the         algorithms to extract points of interest, we present in
cluster. Then, the algorithm merges the new cluster           the next section the indices to measure the quality of
with the clusters already computed, which share at            the clusters.
least one common point. Finally, during the merg-             4     Cluster quality indices
ing, the algorithm erases old computed clusters and
only keeps the new cluster, which contains all the            One aspect of the extraction of POIs inference at-
other merged clusters.                                        tacks is the quality of the obtained clusters, which
                                                              impacts on the precision and recall of the attack.
3.2 Density Time Cluster (DT-Cluster)                         In the following subsection we describe some met-
DT-Cluster (Hariharan and Toyama, 2004) is an iter-           rics to quantify how accurate or “how good“ is the
ative clustering algorithm taking as input a distance         outcome of the clustering task. Intuitively, a good
threshold d, a time threshold t and a trail of mobil-         clustering is one that identifies a group of clusters
ity traces M . First, the algorithm starts by building        that are well separated one from each other, compact


                                                         15
and representative. Table 1 summarizes the notation                        4.2    Additive margin
used in this section.                                                      Inspired by the Ben-David and Ackerman (Ben-
 Symbol     Definition                                                     David and Ackerman, 2008) k-additive Point Mar-
    C       An ensemble of clusters.                                       gin (K-AM) metric , which evaluates how well cen-
    ci      The ith cluster of C.                                          tered clusters are. We measure the difference be-
    nc      The number of clusters in C.                                   tween the medoid mi and its two closest points m0
   mi       The medoid point of the ith cluster.                           and m00 of a given group ci belonging to a cluster C
 d(x, y)    The Euclidean distance between x and y.
                                                                           (Equation 5).
   |ci |    The number of points in a cluster ci .
   m0       The closest point to the medoid mi .                                 K − AM (ci ) = d(mi , m00i ) − d(mi , m0i )     (5)
   m00      The second closest point to the medoid mi .
   |C|      The total number of points in a set of C.                      Since the average of the k-additive point margins for
                                                                           all groups ci in a cluster C is computed, we take the
           Table 1: Summary of notations
                                                                           ratio between the average k-additive Point Margin
                                                                           and the minimal inter-cluster distance (Equation 1)
4.1 Intra-inter cluster ratio                                              as shown in Equation 6.
The intra-inter cluster ratio (Hillenmeyer, 2012)                                                    1 Pnc
                                                                                                    nc    ci ∈C K − AM (ci )
measures the relation between compact (Equation 1)                           AM (C) = minci ∈C                               (6)
                                                                                                             DIC(ci )
and well separated groups (Equation 3). More pre-
cisely, we first take the inter-cluster distance, which                    The additive margin method has a linear complexity
is the average distance from each point in a cluster                       in the number of clusters. This metric gives a high
ci to its medoid mi .                                                      value for a well centered clusters.
                                  |ci |
                      1           X                                        4.3    Information loss
   DIC(ci ) =                                  d(xj , mi )      (1)
                 |ci | − 1
                             xj ∈ci ,xj 6=mi
                                                                           The information loss ratio is a metric inspired by the
                                                                           work of Sole and coauthors (Solé et al., 2012). The
Then, the average intra-cluster distance (DIC) is                          basic idea is to measure the percent of information
computed using Equation 2.                                                 that is lost while representing original data only by
                      1 X
                                      |C|                                  a certain number of groups (e.g., when we represent
        AV G DIC(C) =     DIC(ci )                              (2)        the POIs by the cluster medoids instead of the whole
                      nc
                                     ci ∈C                                 set of points). To evaluate the percent of information
Afterwards, the mean distance among all medoids                            loss, we compute the sum of distance of each point
(DOC) in the cluster C is computed, using Equation                         represented by xi to its medoid mi for all clusters
3.                                                                         ci ∈ C as we shown in Equation 7.
                                           |C|     |C|
                  1          X                     X                                                   |c|
                                                                                                    nc X
                                                                                                    X
DOC(C) =                                                     d(mi , mj )               SSE(C) =                    d(xj , mi )   (7)
         |nC | × (|nC | − 1)
                                          ci ∈C cj ∈C,i6=j                                          ci ∈C xj ∈ci
                                                       (3)
Finally, the ratio intra-inter cluster rii is given by the                 Then, we estimate the accumulated distance of all
Equation 4 as the relationship between the average                         points of a trail of mobility traces in the cluster C to
intra cluster distance divided by the inter-cluster dis-                   a global centroid (GC) using the following equation
tance.                                                                     Equation 8.
                          AV G DIC(C)
              rii(C) =                                 (4)                                             |C|
                             DOC(C)                                                                    X
                                                                                         SST (C) =            d(xi , GC)         (8)
The intra-inter ratio has an approximate linear com-
                                                                                                      xi ∈C
plexity in the number of points to be computed and
gives low values to well separated and compact clus-                       Finally, the ratio between aforementioned distances
ter.                                                                       is computed using Equation 9, which results in the


                                                                      16
information loss ratio.                                       Afterwards, a similarity measure between two clus-
                                                              ters ci and cj , called R-similarity, is estimated, based
                           SSE(C)
                IL(C) =                           (9)         on Equation 14.
                           SST (C)
                                                                                            S(ci ) + S(cj )
The computation of this ratio has a linear complex-                         R(ci , cj ) =                           (14)
                                                                                              M (ci , cj )
ity. The lowest is the value of this ratio, the more
representative the clusters are.                              After that, the most similar cluster cj to ci is the
                                                              one maximizing the result of the function Rall (ci ),
4.4 Dunn index
                                                              which is given by Equation 15 for i 6= j.
This quality index (Dunn, 1973; Halkidi et al., 2001)
attempts to recognize compact and well-separated                        Rall (ci ) = maxcj ∈C,i6=j R(ci , cj )      (15)
clusters. The computation of this index relies on
a dissimilarity function (e.g. Euclidean distance d)          Finally, the DBI is equal to the average of the simi-
between medoids and the diameter of a cluster (c.f,           larity between clusters in the clustering set C (Equa-
Equation 10) as a measure of dispersion.                      tion 16).
                                                                                                n
        diam(ci ) = maxx,y∈ci ,x6=y d(x, y)      (10)                                       1 Xc

                                                                           DBI(C) =              Rall (ci )         (16)
                                                                                            nc
                                                                                              ci ∈C
Then, if the clustering C is compact (i.e, the diam-
eters tend to be small) and well separated (distance          Ideally, the clusters ci ∈ C should have the mini-
between cluster medoids are large), the result of the         mum possible similarity to each other. Accordingly,
index, given by the Equation 11, is expected to be            the lower is the DB index, the better is the cluster-
large.                                                        ing formed. These indices would be used to maxi-
                                                              mize the number of significant places a cluster algo-
     DIL(C) = minci ∈C [mincj ∈C,j=i+1                        rithm could find. More precisely, in the next section
                      d(mi , mj )                (11)         we evaluate the cluster algorithm aforementioned as
                [                    ]]
                  maxck ∈C diam(ck )                          well as the method to extract the meaningful places
                                                              using the quality indices.
The greater is this index, the better the performance
of the clustering algorithm is assumed to be. The             5     Selecting the optimal parameters for
main drawbacks of this index is the computational                   clustering
complexity and the sensitivity to noise in data.
                                                              In order to establish how to select the best set
4.5 Davis-Bouldin index                                       of parameters for a given clustering algorithm, we
The objective of the Davis-Bouldin index (DBI)                have computed the precision, recall and F-measure
(Davies and Bouldin, 1979; Halkidi et al., 2001) is to        of all users of LifeMap dataset (Chon and Cha,
evaluate how well the clustering was performed by             2011). One of the unique characteristic of this
using properties inherent to the dataset considered.          dataset is that the POIs have been annotated by
First, we use a scatter function within the cluster ci        the users. Consequently, given a set of clusters
of the clustering C (Equation 12).                            ci ∈ C such that C = {c1 , c2 , c3 , . . . , cn } and a
                                                              set of points of interest (POIs) defined by the users
                    v
                    u
                    u1 X    n                                 Ppoi = {ppoi 1 , ppoi 2 , ppoi 3 , . . . , ppoi n } we were
           S(ci ) = t           d(xj , mi )2     (12)         able to compute the precision, recall and f-measure
                       nc x ∈c
                           j   i                              as we detail in the next subsection.

Then, we compute the distance between two differ-             5.1    Precision, recall and F-measure
ent clusters ci and cj , given by Equation 13.                To compute the recall (c.f. Equation 17), we take as
                             q                                input a clustering set C, the ground truth represented
              M (ci , cj ) = d(mi , mj )       (13)           by the vector Ppoi (which was defined manually by


                                                         17
each user) as well as a radius to count all the clus-                          Characteristics                                            LifeMap
ters c ∈ C that are within the radius of ppoi ∈ Ppoi ,                       Total nb of users                                               12
which represents the ”good clusters”. Then, the ratio                  Collection period (nb of days)                                       366
of the number of good clusters compared to the total                     Average nb of traces/user                                         4 224
number of found clusters is computed. This measure                           Total nb of traces                                            50 685
illustrates the ratio of extracted cluster that are POIs                  Min #Traces for a user                                            307
divided by the total number of extracted clusters.                        Max #Traces for a user                                           9 473

                         good clusters                          Table 2:             Main characteristics of the LifeMap
 P recision =                                   (17)
                total number extracted clusters                 dataset.
   To compute the recall (c.f. Equation 18), we take
as input a clustering set C, a vector of POIs Ppoi as           5.3     Experimental results
well as a radius to count the discovered POIs ppoi ∈
Ppoi within a radius of the clusters c ∈ C, which               This section is composed of two parts, in the first
represents the ”good POIs”. Then, the ratio between             part we compare the performance of the previously
the number of good POIs and the total number of                 described clustering algorithms, with two base-
POIs is evaluated. This metric represents the percent           line clustering algorithms namely k-means and DB-
of the extracted unique POIs.                                   SCAN. In the second part, a method to select the
                                                                most suitable parameters for a clustering algorithm
                                                                is presented.
                        good P OIs
      Recall =                                     (18)
                  total number of P OIs


                                                                                                                                                       DT cluster


                                                                                                                                                                                TD cluster
                                                                                                                                 DBSCAN

                                                                                                                                          DJ cluster


                                                                                                                                                                    K-means
  Finally, the F-measure is defined as the weighted              Input parameters                   Possible values
                                                                 Tolerance rate (%)             {0.75, 0.8, 0.85, 0.9}            y        Y            y            y           y
average of the precision and recall as we can see in             Tolerance rate (%)             {0.75, 0.8, 0.85, 0.9}            7        7            7            7           3
                                                                 Minpts (points)            {3, 4, 5, 6, 7, 8, 9, 10, 20, 50}     3        3            7            7           7
Equation 19.                                                     Eps (Km.)                   {0.01, 0.02, 0.05, 0.1, 0.2}         3        3            3            7           3
                                                                 Merge distance (Km.)        {0.02, 0.04, 0.1, 0.2, 0.4}          7        7            3            7           3
                                                                 Time shift (hour)                 {1, 2, 3, 4, 5, 6}             7        7            3            7           3
                                                                 K (num. clusters)                   {5, 6, 7,8, 9}               7        7            7            3           7
                           precision × recall
  F − measure = 2 ×                                (19)
                           precision + recall                   Table 3: Summary of input parameters for clustering
We present the dataset used for our experiments in              algorithms.
the next subsection.

5.2 Dataset description                                                         Precision       Recall      F-measure       Time(s)
                                                                                                                                                Number of
                                                                                                                                                parameters
                                                                                                                                                                              Complexity

                                                                 DBSCAN             0,58         0,54          0,48             316                    3                       O(n2 )
In order to evaluate our approach, we use the                    DJ-Cluster         0,74         0,52          0,52             429                    3                       O(n2 )
LifeMap dataset (Chon et al., 2012), which is com-               DT-Cluster         0,38         0,47          0,39             279                    3                       O(n2 )
                                                                  k-means           0,58         0,51          0,49             299                    2                       O(n)
posed of mobility traces of 12 user collected for a              TD-Cluster         0,43         0,54          0,44             362                    4                       O(n2 )

year in Seoul, Korea. This dataset comprises lo-
cation (latitude and longitude) collected with a fre-           Table 4: The characteristics of the clustering algo-
quency between 2 to 5 minutes with the user defined             rithms.
point of interest as true if the mobility trace is con-
sidered as important or meaningfull for each user.                 In order to compare the aforementioned clustering
Table 2 summarizes the main characteristics of this             algorithms, we have take into account the precision,
dataset, such as the collect period, the average num-           recall, F-measure obtained, average execution time,
ber of traces per user, the total number of mobility            number of input parameters and time complexity.
traces in the dataset, the minimal and maximal num-             To evaluate these algorithms, we used the LifeMap
ber of users’ mobility traces.                                  dataset with POIs annotation and a set of different
   Since we have described our dataset, we present              parameters configurations for each algorithm, which
the results of our experiments in the next subsection.          are summarized in Table 3. After running these con-


                                                           18
figurations, we obtained the results shown in Table                                       different configurations of distinct algorithms us-
4 for the different input values.                                                         ing the previously described quality indices. After-
                                                                                          wards, we were able to estimate the precision, recall
                   DBSCAN             DJ Cluster           DT Cluster                     and F-measure using the manual annotation of POIs
                                                                                          by the users in the LifeMap dataset.
        0.5
                                                                                             Regarding the relation between the quality indices
                                                                                          and the F-measure, we studied the relationship be-
        0.0
                                                                                          tween these factors, in order to identify the indices
                                                                                          that are highly correlated with the F-measure, as
        -0.5
                                                                                          can be observed in Figure 1. We observe that the
                                                                            Index
                                                                               DBI
                                                                                          two best performing indices, except for k-means,
        -1.0                                                                   DI
                                                                                          are IL and DBI. The former shows a negative cor-
Value


                   k_means             Resume              TD Cluster          IL
                                                                               kAM        relation with respect to the F-measure. While the
        0.5                                                                    RII
                                                                                          latter, has a positive dependency to the F-measure.
                                                                                          Our main objective is to be able to identify the rela-
        0.0
                                                                                          tionship between quality and F-measure among the
                                                                                          previous evaluated clustering algorithms. Accord-
        -0.5                                                                              ingly, we discard the inter-intra cluster ratio (RII)
                                                                                          and the adaptive margin (AM), which only perform
        -1.0                                                                              well when using k-means and the DJ clustering al-
               A   B    C    D   A     B     C     D   A    B     C     D
                                     Correlation                                          gorithms. Finally, we observe that the Dunn index
                                                                                          has a poor performance. Based on these observa-
Figure 1: Correlation of quality indices with the                                         tions, we were able to propose an algorithm to auto-
computed F-measure. Where A) is the correlation                                           matically choose the best configuration of input pa-
measured between the user annotation and the cen-                                         rameters.
troid at 20 m of radius B) at 35 m radius C) at
50 m radius, D) at 100 m radius and DBI=Davis-                                            5.4    Parameter selection method
Bouldin index, DI=Dunn index, IL=Information                                              Let us define a vector of parameters pi ∈ P and P a
loss, kAM=Additive margin and RII= Ratio intra-                                           set of vectors, such that P = {p1 , p2 , p3 , . . . , pn },
inter cluster.                                                                            a trail of mobility traces M of a user. From
                                                                                          previous sections we have the clustering function
   It is possible to observe that the precision of DJ-                                    C(pi ) and the quality metrics Information Loss
Cluster out performs better than the other cluster-                                       IL(C) and Davis-Bouldin index DBI(C). Thus,
ing algorithms. In terms of recall DBSCAN and                                             for each vector of parameters we have a tuple com-
TD-Cluster perform the best but DJ-Cluster is just                                        posed of the trail of mobility traces, the result
behind them. Moreover, DJ-Cluster has the best                                            of the clustering algorithm and the quality metrics
F-measure. Regarding the execution time, DT-                                              (pi , M, Cpi , ILCpi , DBICpi ). When we compute
Clustering the fastest one while DJ-Cluster is the                                        the clustering algorithm and the quality metrics for
slowest algorithm due to the preprocessing phase.                                         each vector of parameter for a given user u. We de-
Despite the high computational time of DJ-Cluster,                                        fine also a χ0u matrix, which the matrix χu sorted by
this algorithm performs well in terms of F-measure.                                       IL ascending. Finally, the result matrix χu is of the
   In the following, we describe our method to                                            form:
choose “optimal” parameters for obtaining a good
F-measure. We have used the aforementioned algo-                                                       p1    M     Cp1    ILCp1     DBICp1
rithms with a different set of input parameters con-                                                   p2    M     Cp2    ILCp2     DBICp2
figurations for users with POIs annotations in the                                              χu =   p3    M     Cp3    ILCp3     DBICp3
LifeMap dataset (Chon and Cha, 2011). Once clus-                                                       ...
ters are built, we evaluate the clusters issued from                                                   pn    M     C pn   ILCpn     DBICpn


                                                                                     19
                                    DBSCAN                 DJ cluster                DT cluster


                        0.6


                        0.4


                        0.2


                                                                                                      Heuristic
            F_measure


                        0.0                                                                               DBI
                                                                                                          IL
                                    k means                Resume                    TD cluster
                                                                                                          IL_DBI
                                                                                                          MAX

                        0.6


                        0.4


                        0.2


                        0.0

                              u1 u2 u3 u4 u5 u6 u7   u1 u2 u3 u4 u5 u6 u7      u1 u2 u3 u4 u5 u6 u7
                                                             User


Figure 2: Comparison between F-measure and parameters selection based on schema in Figure ??. Where
DBI=Davis-Bouldin index, IL=Information loss, IL DBI= combination of IL and DBI and MAX is the maximal
computed F-measure (taken as reference to compare with IL DBI). Resume is the average of all the results using
different clustering algorithms.


  Therefore, the parameter selection function                            set of input parameters. In the second situation, both
S(χu ) could be defined as:                                              IL and DBI refer each one to a different set of pa-
                                                                         rameters. In this case, the algorithm sorts the values
           (
            pi ,     if maxpi (DBI) & minpi (IL)                         by IL in the ascending order (i.e., from the small-
  S(χu ) =                                                               est to the largest information loss value). Then, it
            p0i ,    if maxp0i (DBI) in 1st quartile
                                                                         chooses the set of parameters with the greatest DBI
                                                  (20)
                                                                         in the first quartile.
   In detail, the function S takes as input a χ matrix
containing the parameters vector pi , a trail of mobil-                     For the sake of evaluation, our methodology was
ity traces M , the computed clustering C(pi , M ) as                     tested using the LifeMap dataset to check if the cho-
well as the quality metrics, such as Information loss                    sen parameters are optimal. We have tested the
(IL(C)) and the Davis-Bouldin index (DBI(C)).                            method with the seven users of LifeMap that have
Once all these values have been computed for each                        annotated manually their POIs. Consequently, for
evaluated set of parameters, two cases are possible.                     every set of settings of each clustering algorithm, we
In the first case, both IL and DBI agree on the same                     have computed the F-measure because we have the


                                                                    20
ground truth as depicted in Figure 2. The ”MAX”                   Dunn, J. C. (1973). A fuzzy relative of the ISO-
bar represents the best F-measure for the given user                DATA process and its use in detecting compact well-
and it is compare to the F-measures obtained when                   separated clusters. Journal of Cybernetics, 3(3):32–
using the ”DBI”, ”IL” or ”IL DBI” as indicators to                  57.
                                                                  Ester, M., peter Kriegel, H., Sander, J., and Xu, X.
choose the best input parameters configuration. Fi-
                                                                    (1996). A density-based algorithm for discovering
nally, this method has a satisfactory performance ex-               clusters in large spatial databases with noise. Knowl-
tracting a good number of POIs for maximizing the                   edge Discovery and Data Mining, 2(4):226–231.
F-measure achieving a difference of only 9% with                  Gambs, S., Killijian, M.-O., and Núñez del Prado Cortez,
respect to the F-measure computed from the data                     M. (2010).       GEPETO: A GEoPrivacy-Enhancing
with the ground truth.                                              TOolkit. In Advanced Information Networking and
                                                                    Applications Workshops, pages 1071–1076, Perth,
6   Conclusion                                                      Australia.
                                                                  Gambs, S., Killijian, M.-O., and Núñez del Prado Cortez,
In the current paper, we have presented a method                    M. (2011). Show me how you move and I will tell
to extract the a optimal number of POIs. Conse-                     you who you are. Transactions on Data Privacy,
quently, based on the method described in tis paper,                2(4):103–126.
we are able to find an appropriate number of POIs                 Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001).
relying only on the quality metrics of the extracted                On clustering validation techniques. Journal of Intel-
                                                                    ligent Information Systems, 17(3):107–145.
clusters and without the knowledge of the ground
                                                                  Hamerly, G. and Elkan, C. (2003). Learning the k in K-
truth. Nonetheless, we are aware of the small size                  means. In In Neural Information Processing Systems,
of dataset but the results encourage us to continue in              pages 1–8, Vancouver, Canada.
this direction.                                                   Hariharan, R. and Toyama, K. (2004). Project lachesis:
   Therefore, in the future we plan to test our method              Parsing and modeling location histories. Lecture notes
in a larger dataset and in presence of noise like                   in computer science - Geographic information science,
downsamplig or random distortion. Another idea is                   3(1):106–124.
to evaluate the impact of this method in more com-                Hillenmeyer, M. (2012).           Intra and inter clus-
                                                                    ter distance.       http://www.stanford.edu/
plex attacks like prediction of future locations or de-
anonymization to verify if this step can affect the                 ˜maureenh/quals/html/ml/node82.html.
                                                                  MacQueen, J. et al. (1967). Some methods for classifi-
global result of a chain of inference attacks.                      cation and analysis of multivariate mbservations. In
                                                                    Proceedings of the fifth Berkeley Symposium on Math-
                                                                    ematical Statistics and Probability, pages 281–297,
References                                                          Berkeley, CA, USA.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander,          Pham, D. T., Dimov, S. S., and Nguyen, C. D. (2005).
  J. (1999). Optics: ordering points to identify the clus-          Selection of K in K-means clustering. Journal of Me-
  tering structure. ACM SIGMOD Record, 28(2):49–60.                 chanical Engineering Science, 205(1):103–119.
Ben-David, S. and Ackerman, M. (2008). Measures of                Solé, M., Muntés-Mulero, V., and Nin, J. (2012). Effi-
  clustering quality: A working set of axioms for clus-             cient microaggregation techniques for large numerical
  tering. In Proceedings of the Annual Conference on                data volumes. International Journal of Information
  Neural Information Processing Systems, pages 121–                 Security - Special Issue: Supervisory control and data
  128, Vancouver, Canada.                                           acquisition, 11(4):253–267.
Chon, Y. and Cha, H. (2011). LifeMap: A smartphone-               Zhou, C., Frankowski, D., Ludford, P., Shekhar, S., and
  based context provider for location-based services.               Terveen, L. (2004). Discovering personal gazetteers:
  Pervasive Computing, IEEE, 10(2):58–67.                           an interactive clustering approach. In Proceedings
                                                                    of the annual ACM international workshop on Ge-
Chon, Y., Talipov, E., Shin, H., and Cha,
                                                                    ographic information systems, pages 266–273, New
  H. (2012).           CRAWDAD data set yon-
                                                                    York, NY, USA.
  sei/lifemap (v. 2012-01-03).       Downloaded from
  http://crawdad.cs.dartmouth.edu/yonsei/lifemap.
Davies, D. L. and Bouldin, D. W. (1979). A cluster sepa-
  ration measure. IEEE Transactions on Pattern Analy-
  sis and Machine Intelligence, PAMI-1(2):224–227.


                                                             21
              A Cloud-based Exploration of Open Data:
Promoting Transparency and Accountability of the Federal Government of
                             Australia

                          Edwin Salvador                                Richard Sinnott
                   Department of Computing and                    Department of Computing and
                       Information Systems                            Information Systems
                     University of Melbourne                        University of Melbourne
                  Melbourne, Australia     Melbourne, Australia
             edwin.salvador@epn.edu.au rsinnott@unimelb.edu.au

                                                              national security issues are not challenged.
                       Abstract                               According to the Open Government Data
                                                              definition ("Welcome to Open Government
      The Open Data movement has become more                  Data," 2014), there are three main benefits that
      popular since governments such as USA, UK,              governments can obtain by opening up their data:
      Australia and New Zealand decided to open               transparency, participation and collaboration.
      up much of their public information. Data is
                                                              Acquiring and processing the amount of data
      open if anyone is free to use, reuse and
      redistribute it. The main benefits that a               generated by Governments may lead to
      government can obtain from Open Data                    workloads that are beyond the capacity of a
      include transparency, participation and                 single computer. Fortunately, the emergence of
      collaboration. The aim of this research is to           new technologies, such as Cloud Computing,
      promote transparency and accountability of              makes it easier to scale the data processing
      the Federal Government of Australia by using            demands in a seamless and scalable manner
      Cloud-related technologies to transform a set           (Buyya, Yeo, Venugopal, Broberg, & Brandic,
      of publicly available data into human-friendly          2009). Whilst for some disciplines and domains
      visualizations in order to facilitate its               where finer grained security is an impediment to
      analysis. The datasets include details of
                                                              adoption of Cloud computing, e.g. medicine,
      politicians, parties, political opinions and
      government contracts among others. This
                                                              open data has by its very nature, no such
      paper describes the stages involved in                  impediments. Cloud computing also encourages
      transforming an extensive and diverse                   the creation of more innovative services
      collection of data to support effective                 including those based on processing and
      visualization that helps to highlight patterns          analyzing datasets         made     available   by
      in the datasets that would otherwise be                 governments. The sharing of technologies as
      difficult or impossible to identify.                    open source solutions also goes hand in hand
                                                              with open data initiatives.
  1    Introduction                                           The aim of this paper is to describe an approach
  In recent years, the Open Data movement has                 taken to leverage the benefits provided by Open
  become increasingly popular since the                       Data from the Australian government using
  governments of various countries such as USA,               Cloud-related      technologies     through    the
  UK, Australia, New Zealand, Ghana amongst                   Australian national cloud facility: National
  many others decided to open up (some of) their              eResearch Collaboration Tools and Resources
  data sets. In order to consider data as open, it            (NeCTAR – www.nectar.org.au) and specifically
  should ideally be available preferably online in            the NeCTAR Research Cloud. The paper begins
  formats that are easy to read by computers and              with a brief introduction to Open Data, providing
  anyone must be allowed to use, reuse and                    its definition, its benefits and also its potential
  redistribute it without any restriction (Dietrich et        disadvantages. We then describe the advantages
  al., 2012). Furthermore, in the Open Data                   of using Cloud Computing to deal with Open
  Handbook (Dietrich et al., 2012), the authors               Data. The details of the approach taken to
  state that most of the data generated by                    harvest, clean and store publicly available data
  governments are public data by law, and                     from Australian government resources followed
  therefore they should be made available for                 by their analyses and visualizations of these
  others to use where privacy of citizens and                 datasets is given. Finally, the paper concludes by


                                                         22
pointing out the importance of Open Government             and analyze these data. However, according to
Data and the role of Cloud Computing to                    (Davies, 2010; Dietrich et al., 2012; Lathrop &
leverage the benefits offered by Open Data. It is          Ruma, 2010; Robinson, Yu, Zeller, & Felten,
emphasized that there is no causality implied in           2008), most of the data collected by government
this paper regarding the analysis of the data              is public by law and therefore, it should be made
offered. However we strongly believe that open             open and available for everyone to use. In some
discussions about causality are an essential               cases, when governments have failed to make
element in the transparency of Government more             data easily accessible, citizens have had to find
generally.                                                 alternative ways to harvest and process these
                                                           data to give it a meaningful use. A well-known
2    Open Data                                             case is the portal GovTrack.us which was
                                                           launched in 2004 by a student who harvested a
2.1 Definition
                                                           set of government data and published it in more
The Open Knowledge Foundation defines Open                 accessible formats. This kind of initiatives have
Data as ‘any data that can be freely used, reused          influenced in governments’ decisions to make
and redistributed by anyone – subject only, at             government data publicly available (Brito, 2007;
most, to the requirement of attribute and/or               Hogge, 2010; Lathrop & Ruma, 2010). It should
share-alike’ (Doctorow et al., 2014). We                   be noted also that government does not always
emphasize two important conditions that are not            share data effectively across its own departments
clearly mentioned in this short definition. First,         – here the data includes both open and non-open
data can be considered as open if it is easily             data. The government departments of
accessible which means that data should be                 immigration, employment, education, health,
available on the Internet and in formats that are          transport, etc. all have subsets of the total
machine readable. Second, the terms reuse and              “government” data, but the use of this data in
redistribute include the possibility of intermixing        integrated frameworks by government is
two or more datasets in order to discover                  currently lacking.
relations that would not be visible when having               Since 2009, various countries have started
the datasets separated. The full definition                Open Data initiatives by launching portals in
provided by the Open Knowledge Foundation                  which they publish government datasets to be
(Doctorow et al., 2014) gives further details of           downloaded by anyone. Among these countries
the conditions that should be satisfied by data to         are the USA (data.gov), the UK (data.gov.uk),
be considered as open. The final purpose of all            Australia (data.gov.au), Ghana (data.gov.gh) and
these conditions required by Open Data is to               New Zealand (data.govt.nz). These sources of
ensure the potential interoperability of datasets,         data are useful but do not include the tools to
i.e. it is possible to combine any number of these         compare all of the data sets in any meaningful
datasets and subsequently identify their inter-            manner. Instead they are typically large
relationships. Ideally this should be part of a            collections of documents and individual
larger system as opposed say to having many                (distinct) data sets. Often they are available as
individual data sets (e.g. spreadsheets). The true         spreadsheets, CSV files with no means for direct
power of Open Data is derived from the                     comparison or analysis across the data sets.
analytical tools and capabilities used to identify
patterns that would otherwise remain hidden                2.3 Benefits
across multiple, diverse data sets.                        Many authors (Brito, 2007; Davies, 2010;
                                                           Dietrich et al., 2012; Hogge, 2010; Lathrop &
2.2 Open Government Data
                                                           Ruma, 2010; Robinson et al., 2008) agree about
Governments are constantly gathering data from             the benefits that can be obtained by governments
many types of sources: the population, taxes,              when they decide to open up their data, namely:
quality of life indicators and indeed anything that        transparency, participation and collaboration.
could help the government to monitor and                   These benefits are directly derived from the
improve the management and governance of                   corresponding Open Data requirements: freedom
their country. Historically, only governmental             of use, reuse and redistribution. In this context,
entities (departments) have had access to process          the fact that anyone is free to use government


                                                      23
data leads to an increment in government                    2010).
transparency. Hogge (2010), in her study
mentions that transparency is not only about                2.4 Barriers
citizens trying to find irregularities in                   According to (Davies, 2010; Lathrop & Ruma,
government actions, it is also about citizens               2010), transparency should not be focused only
constantly monitoring their governments’                    on the accountability and transparency of
activities and providing feedback to improve                government. In fact, this could generate an
processes and public services, and according to             excessive attention to government’s mistakes and
the Open Government Data definition                         consequently, create an image of government as
("Welcome to Open Government Data," 2014),                  corrupt. This is clearly a reason why
this is what defines a well-functioning                     governments might not want to open up their
democratic society.                                         data. However, the authors state that instead of
   Open Data not only requires data to be                   reducing transparency, this problem could be
accessible, but it requires the freedom to reuse            addressed by creating a culture of transparency
these data for different purposes. This allows              that not only judges when public entities behave
citizens to combine two or more datasets to                 badly, but a culture that is also capable to register
create mash-ups and highlight potentially hidden            approval when governments successfully solve
relations between different datasets (Brito, 2007;          public problems or deliver services in a cost
Davies, 2010; Lathrop & Ruma, 2010). This                   effective manner.
improves the participation of citizens from                    Furthermore, many governments and indeed
different fields such as developers, scientists and         individuals are concerned about the privacy of
indeed journalists. This is particularly important          citizens. Although, it is possible to anonymize
to governments since citizens can experiment in             datasets before they are made publicly available,
the creation of new services based on                       it requires considerable time, effort and expense
government data and the government is                       of public workers and sometimes it is not
subsequently able to evaluate the most useful               possible to guarantee that the data will be fully
services and where appropriate shape future                 anonymized (Lathrop & Ruma, 2010). For this
policy based on new knowledge. This has the                 reason, some governments prefer to keep the data
added value of encouraging the participation of             private. However it is the case that often terms
more citizens in government activities and                  such as protecting national security or citizen
increases the number of new services that could             privacy are used as a blanket to deny access to
benefit the government.                                     many other data sets that are not contentious.
   The third key benefit of Open Data is                       Additional barriers that stop governments
collaboration which is directly derived from the            making data publicly available is the fact that
freedom of users to redistribute government data,           many data sets are stored on older forms of data
e.g. combining two or more datasets for a                   storage media such as paper files and proprietary
specific purpose and making the resulting dataset           databases which do not allow for easy extraction
available for others to use. In this way, citizens          and publication. Furthermore open data also
are collaborating with each other while they are            requires appropriate (rich) metadata to describe
contributing to the government by creating                  it: the context in which it was collected, by
services and solving problems. In some cases,               whom and when. In some cases, this additional
this model of combining data sets to develop                information is not directly available.
new, targeted solutions has spurned a range of
start-ups and industries, e.g. San Francisco and            2.5      Disadvantages
the        Civic        Innovation        activities        Data can be open to misinterpretation, which can
(http://innovatesf.com/category/open-data/)                 subsequently generate civic confusion and extra
   Although the process of making data publicly             problems for governments. For instance,
available can be seen as laborious and cost                 (Lathrop & Ruma, 2010) mentions a case where
intensive to the government agencies involved, it           people correlated locations of crimes in a city
brings further economic benefits to governments             with the number of immigrants in that location
since it will improve the participation of people           and make conclusions like “This is a high crime
in the creation of innovative services (Hogge,              neighborhood because many immigrants live


                                                       24
here”. Something which is not necessarily true as           particularly important when working with
many other aspects must be taken into account to            government data as they can become quite
determine the reasons of high levels of crimes in           voluminous, they can change over time, they
a location.                                                 require veracity of information to be checks, and
Another disadvantage of publicly available data             when comparisons and analyses are made
is for the potential for it to be manipulated with          between data sets these can result in
the intention of satisfying personal interests. This        computationally expensive requirements. Cloud
is difficult for a government to control and could          Computing is especially suited to this
be problematic since people often do not always             environment since it is possible to scale out
verify data before making conclusions. Key to               resources to satisfy needs and (in principle) pay
tackling this is the spirit of open data: it should         for those extra resources only for the time that
be possible to verify or refute the conclusions             are actually being used. This is convenient
that are drawn by access to the original data sets.         specially for people considered ‘civil hackers’
Thus for any data that is accessed (harvested) it           who create services based on government data
should always be possible to go back to the                 and most often without financial reward (Davies,
original (definitive) sources of the data (since it         2010; Hogge, 2010). This contributes to the
is open).                                                   emergence of new questions and reduces the
                                                            time needed to answer these questions, which
3    Cloud Computing                                        encourages people to collect more data and
Open data benefits greatly by access to open data           create more innovative services.
processing platforms. Cloud computing offers                   The Cloud provider utilized here is the
one approach that is directly suited to the                 NeCTAR Research Cloud, which is an
processing of open data. The National Institute of          Australian government funded project that offers
Standards and Technology (NIST) (Mell &                     an IaaS platform with free access to Australian
Grance, 2011), points out five essential                    academics, or more precisely members of
characteristics that define the Cloud Computing             organizations subscribed to the Australian
model: on-demand self-service, broad network                Access Federation (AAF – www.aaf.edu.au)
access, resource pooling, rapid elasticity, and             such as the University of Melbourne. This work
measured service. In order to adapt to different            utilised two virtual machines (VMs) each with 2
types of users, Cloud providers offer three levels          cores, 8GB RAM and 100GB of storage. While
of abstraction: Software as a Service (SaaS) with           the VMs were located in different zones, both
examples being Salesforce’s CRM and Google                  have the same architecture (Figure 1). This
Docs; Platform as a Service (PaaS) with                     allowed them to act as master at any time
examples being Microsoft Azure and Google                   providing high availability to the system.
App Engine, and Infrastructure as a Service
(IaaS) with examples being Amazon EC2,
Rackspace and the Australian NeCTAR
Research Cloud. There are also many different
Cloud deployment models: Private Clouds,
Public Clouds, Community Clouds, and Hybrid
Clouds (Mell & Grance, 2011; Sriram & Khajeh-
Hosseini, 2010; Velte, Velte, & Elsenpeter,
2009; Zhang, Cheng, & Boutaba, 2010). Ideally
open data should be processed on open Clouds
and the applications and interpretation of the data
utilizing open sources data models for complete
transparency of the data and the associated data
processing pipelines.
   One of the main reasons for the success of
Cloud computing is the capacity to rapidly scale
up or scale down on demand, at an affordable
cost and ideally in an automated fashion. This is                        Figure 1. Architecture.


                                                       25
4   Implementation     of  the    Open                    that drove and shaped the work were based on:
    Government Data Access, Process and                   political donations, election results, historical
    Visualise Platform                                    contracts data and political speeches. These data
                                                          were selected following with researchers at the
4.1 Data Layer                                            Centre for Advanced Data Journalism at the
The key focus of the work is on access to and use         University of Melbourne. The Data Layer itself
of open government data. A Data Layer that                was divided into three stages: data harvesting,
harvested and processed these data was key to             data cleansing and data storage which are
this. This layer was responsible for dealing with         described here.
raw data coming from external sources. The data
                                                          Data Harvesting
sets that were harvested and used as the basis for
the work described here included:                         It should be noted that most of the datasets that
 Australian Electoral Commission                         were harvested satisfy the requirements of Open
     (www.aec.gov.au)                                     Data, i.e. they are downloadable and are
     o Annual Returns (2003 - 2013) (includes:            provided in machine-readable formats such as
         party returns, political donations,              CSV and XML. It is also noted that there are
         Associated Entities, Political                   other data that do not satisfy all of these
         Expenditure)                                     requirements. For instance, the lobbyist registers
     o Election Returns (2003 -                           for all the Australian States are available only in
         2013) (includes: donations to candidates,        HTML (via web pages). In this case, it was
         donors details, senate groups)                   necessary to implement web scrapers for
     o Election Results (2004 - 2010) (includes:          webpages to extract the data and then store it in
         Details of Federal election results              the database. This technique is inefficient and
         divided in general, house and senate)            has several disadvantages for how data can be
     o Federal electoral boundary GIS data                released as open data and subsequently used and
         (Census 2011)                                    interpreted. Firstly, it is error prone because a
 Portal data.gov.au                                      scraper may assume that a webpage follows a
     o Historical Australian Government                   standard but there is the possibility of mistakes in
         Contract Data (1999 - 2013)                      the scraped HTML, which would cause the
     o Members of Parliament webpages and                 scraper to obtain erroneous data. Furthermore, it
         social networks                                  is a tedious task since it is almost impossible to
     o Portfolio of Responsibilities                      build a scraper that works with many webpages
 Parliament of Australia                                 as different sites use completely different
     (www.aph.gov.au/Parliamentary_Business/H             designs. Lastly, the design of a webpage can
     ansard)                                              change without any notice, which would render a
     o House Hansard                                      given scraper totally useless and require a new
     o Senate Hansard                                     scraper to be produced. Nevertheless, it is an
 Lobbyists Details                                       effective technique when used carefully and after
     o Australian Government                              ensuring that all data obtained is verified before
         (www.lobbyists.pmc.gov.au)                       performing further analyses and interpretations.
     o Victoria                                           The information should also include metadata on
         (www.lobbyistsregister.vic.gov.au)               when the data was accessed and scraped.
     o Queensland                                            Additionally, even when data is made
         (www.lobbyists.integrity.qld.gov.au)             available in a more accessible (downloadable)
     o Western Australia                                  format, further work is often required. For
         (www.lobbyists.wa.gov.au)                        example, despite the fact that the Hansard
     o Tasmania                                           political speeches of Australia are provided as
         (www.lobbyists.dpac.tas.gov.au)                  downloadable XML files, there is no way to
     o New South Wales                                    download the whole collection of speeches or the
         (www.dpc.nsw.gov.au)                             possibility of selecting a range of speech dates
     o South Australia (www.dpc.sa.gov.au)                that could be downloaded. Consequently, it is
  The analyses and visualizations of these data           often necessary to download one file at a time,


                                                     26
which makes the process inefficient taking into             Data storage
account that there are thousands of files. As a             Due to the variety of sources and the lack of a
result, whilst the data is open, the way in which it        well-defined schema, CouchDB was selected as
is made available is not really conducive to                an appropriate database to store all the harvested
further processing without computational                    data. CouchDB is a schema-free NoSQL and
approaches to overcome these limitations, e.g.              document-oriented        database        (Anderson,
implementing processes to download all of the               Lehnardt, & Slater, 2010). It stores its documents
XML files.                                                  as JSON objects. In this model, each row of each
   It is recognized that the difficulties faced in          dataset was stored as an independent JSON
harvesting public data are understandable since             document adding an extra field “type”, and in
many governments (including the Australian                  some cases “subtype”, in order to facilitate the
government) are still in the process of opening             exploration of different datasets in the database.
their datasets and learning about how best to do              Although both VMs were set up to act as a
this. These lessons are often spotlighted through           master at any given time, in this stage one VM
important initiatives such as organized events              could be considered as master and the other one
used to receive feedback from data enthusiasts              as slave because only one of them could harvest
about how to improve the available datasets or              data from external sources at a time while it
which new datasets could be made available. For             replicated all the new data to the other. CouchDB
instance, GovHack is an annual event which                  provides strong replication processes that allow
brings together people from government,                     setting up a bi-directional replication between the
industry, academia and general public to                    databases in each VM. This allowed having both
experiment with government data and encourage               databases up to date while only one of them was
open government and open data in Australia.                 harvesting data.
Additionally, there exist various open data
portals around Australia including the national             4.2 Analysis Layer
portal data.gov.au, portals for every State such as         In addition to the flexibility provided by
the http://data.nsw.gov.au and even some cities
                                                            CouchDB to store schema-free documents, one
like Melbourne have launched their own open                 of the main reasons to choose this database was
data                   portals,                 e.g.        its support for MapReduce based views.
https://data.melbourne.vic.gov.au/.                         MapReduce is one of most effective approaches
Data Cleansing                                              to deal with large-scale data problems and allows
                                                            to separate what computations are performed and
Every dataset collected will typically contain              how those computations are performed (Buyya et
some extra and/or useless data that needs to be             al., 2009; Dean & Ghemawat, 2008; Ekanayake,
removed in order to improve the quality of data             Pallickara, & Fox, 2008; Lin & Dyer, 2010;
and increase the consistency between different              Segaran & Hammerbacher, 2009; White, 2012).
datasets allowing them to be combined and                   Therefore, to analyze the data the developer only
interpreted more easily. To aid in data                     needs to focus on the first part which consists on
consistency, datasets from different formats such           writing two functions: a map function and a
as CSV or XML were converted to JavaScript                  reduce function. The run-time system handles
Object Notation (JSON) objects. Although, this              how those computations are performed by
process was simple, there were some difficulties            managing         failures,     schedules       and
to overcome in specific datasets. For instance,             intercommunication. The complexity of map and
the XML files of the Hansard political speeches             reduce functions can be diverse and depends on
have different structures over different time               the type of analysis to be performed on the data.
periods, which made the process of parsing the                 Furthermore, CouchDB documents and views
whole collection more complex. However, it was              are indexed using a B-Trees data structures,
possible to find certain levels of consistency in           which are very efficient for storing large
most of the datasets, which allowed use of                  amounts of data (Anderson et al., 2010; Bayer,
Python scripts to convert hundreds of datasets              1997). The index for a view is created only the
and then store them in the database.                        first time that the view is queried and allows to
                                                            retrieve large amount of data very quickly. In


                                                       27
order reflect the current state of the database, the        spite of the document storage capabilities of
index of a view only needs to introduce the                 ElasticSearch, it is mainly used as an extension
documents that have changed. Although this                  for NoSQL databases thanks to a range of
process is very efficient, it can introduce high            available plugins. For instance, it offers the
latency to queries when a large amount of                   possibility of indexing any CouchDB database
documents have changed (Anderson et al., 2010).             through a plugin that listens to the changes API
This is a common problem faced by applications              of CouchDB making the database searchable and
where documents in the database tend to be                  allowing to perform more complex queries and
updated frequently. However, since the type of              more complete analyses of the data (Gormley &
data used in this project is largely historical and         Tong, 2014). Using CouchDB in conjunction
not changing dynamically, CouchDB views were                with ElasticSearch allows taking advantage of
used successfully.                                          the most important features provided by each
   Most of the data analyses where performed                technology, namely durability and advanced
using CouchDB views, these analyses included                search capabilities respectively. The number of
political donations over time, data aggregation of          features offered by ElasticSearch is vast and
donations such as retrieving the largest donation           more details can be found in (Gormley & Tong,
in certain election period and correlation between          2014).
different datasets, for instance, donations vs                 ElasticSearch was also useful to build
votes received by a political party. However,               visualizations that correlate the political
there were some cases where it was not possible             donations and the government contracts by
to perform more complex analyses using only                 searching all the occurrences of the donors’
CouchDB views. For example, despite the fact                names in the dataset of government contracts.
that CouchDB owes many of its advantages to B-              This was done through the Python API client
Trees, it also inherits one of its main drawbacks           provided by ElasticSearch and a Python script
which is the inability to perform multi-                    which returned a list of donor names that
dimensional queries (Bayer, 1997). In other                 appeared in the government contracts dataset
words, CouchDB views are excellent to process               indicating the field where it was found, this
queries such as the sum of donations received in            helped to show the correlation of both datasets in
the year 2004 (point queries) or the sum of                 a timeline.
donations received between 2004 and 2010
(range queries). However, for multi-dimensional             4.3 Presentation Layer
queries such as the sum of donations received by            Visualisation is essential when dealing with
a candidate from a specific donor in 2004 (4-               large-scale heterogeneous data sets. Indeed all of
dimensional), there were challenges that required           the data analyses would have limited value if it
support for other data processing capabilities.             were not possible to visualize them in a human-
For this kind of query it was required to provide           friendly way. This is especially important in
a visualization that showed an overview of the              open government data initiatives where the focus
political donations. This visualization was                 is less on the detailed scientific model of
required to group, color and filter donations in            discovery and more on the big picture questions
multiple ways and shows a summary for every                 that can be illustrated through the data itself. The
group of donations. The summary includes the                presentation layer was based mainly in
total sum of donations, the number of donations             JavaScript using the D3.js library, Google Charts
in that group, details of the largest donation and          API and jQuery. In this section we illustrate
the top 3 donations received by candidates,                 some of the visualizations for the analyses
parties and States. In order to solve this multi-           mentioned previously.
dimensional      query limitation,        CouchDB              Figure 2 shows an example of the multiple
functionalities were extended with ElasticSearch.           ways of visualizing political donations through
   ElasticSearch is a distributed search engine             one of the visualizations that were built. Each
built on top of Apache Lucene, which among                  bubble in the figure represents a donation
other features provides full text search                    received by a candidate, the size of the bubble
capabilities whilst hiding the complexity of                represents the amount of money donated, the
Lucene behind a simple and coherent API. In                 color in this case, represents a political party and


                                                       28
each group of bubbles is an election period. This          group (including the main title) are clickable and
is an interactive visualization so, donations could        they contain the summary for every group of
be grouped, colored and filtered by all the                donations and the main title contains the
features contained in the dataset which include            summary for all the four groups.
election period, candidate, party, electorate,
electorate state, donor, donor state, donor suburb,
and nil return. Furthermore, the labels for each


                                   Figure 2. Overview of Donations.

   This visualization facilitates a detailed
exploration of the donations dataset and could be
considered as a starting point for further
analyses.
   Another way of visualizing the donations is on
a timeline as exposed in Figure 3. This shows the
total number of donations received by date.
Something interesting to point out here is how
we can see that most of the peaks are in the years
2004, 2007 and 2010, years in which federal
elections have taken place. This pattern of
                                                           Figure 4. Individual Donations made by a given
donations increasing in election years is also
                                                                           entity over time.
visible when looking at donations made by
individual entities. Figure 4 illustrates all the            An additional scenario of interest is the
donations made by an entity over time and                  correlation between political donations and
highlights the tendency of making more                     government contracts, i.e. grants/award made to
donations in election years.                               entities (most often companies). With the results
                                                           obtained from the ElasticSearch analysis
                                                           described in the previous section, donations and
                                                           contracts were displayed in a timeline to explore
                                                           whether the number of donations or the amount
                                                           of money donated by an entity influenced (was
                                                           correlated with) the number and the value of
                                                           contracts that they subsequently         obtained.
                                                           Figure 5 shows this scenario for a specific entity.
                                                             It can be seen that there are several cases
  Figure 3. Summation of Political Donations               where contracts are obtained right before or after
           Visualised over Timeline.


                                                      29
some donations have been made. In addition to              both terms tend to be used in the same dates.
the graph showed in Figure 5, this visualization           This visualization is useful to get an idea of what
also provides the details of the donations and             topics are being discussed by the members of the
contracts related with the entity being analyzed.          parliament in different periods.
Thus one can see the persons involved as well as
political parties and governmental agencies and
try to find more patterns to perform further
investigations. For instance, a next step might be
to investigate who is on the board of the
companies making donations and to see if there
exists a direct or indirect relation with the
governmental agency that is offering the
contract. It is emphasized that this is only one of        Figure 6. Political Speeches: correlation of words
the many scenarios that can be visualized with                               used over time.
this analysis and there did not exist a clear
correlation between the two datasets in many of               An additional visualization using word clouds
the cases. However, this specific scenario helps           (Figure 7) was implemented to explore the most
us to demonstrate how mash-ups highlight                   popular terms used by politicians in their
hidden relations between apparently unrelated              speeches. This visualization allows to see a
datasets. For transparency of government it is             general word cloud for all the politicians in the
important to ensure that where major grants are            collection of speeches and provides the
awarded, independent review of political                   possibility of filtering by year, month and day as
donations prior to the award can be scrutinized to         well as selecting up to three politicians to show a
ensure independence of government.                         comparison of the terms used by each of them
                                                           over time. These word clouds provide a simple
                                                           overview of the topics discussed by each
                                                           politician. For instance, in Figure 7 the word
                                                           cloud on the left belongs to the Shadow Minister
                                                           for Transport and Infrastructure and so we can
                                                           see that the most popular words are highly
                                                           related to this charge such as “infrastructure”,
                                                           “transport”, “highway”, “safety”, and “airport”.
                                                           The word cloud on the right shows the words
                                                           used by the Prime Minister in his speeches in
                                                           May 2014 which is the month when the federal
                                                           budget was presented to the parliament. In this
                                                           case, we can see that the words “budget”,
                                                           “deficit”, “spending”, and “tax” are amongst the
                                                           most popular ones. This demonstrates that word
                                                           clouds give us an idea of the topics that are dealt
                                                           in parliament in different periods of time by
                                                           different politicians. The combination of terms
Figure 5. Colour-coded Correlation of Donations            used in speeches and decisions made in award of
    (A, B, E, F, Q-W, Y, Z) vs Government                  contracts are also essential to correlate, e.g.
       Contracts/Awards (C, D, G, H, X).                   speeches about the important of the Australian
                                                           car industry should not be correlated/associated
  A further visualization is illustrated in Figure         with political donations from car manufacturers
6, which shows the correlation of terms used in            for example if government is to be truly
political speeches over time. The figure                   transparent and ultimately accountable for the
demonstrate the correlation between the terms              independence of the decisions it makes.
“boats” and “immigration” and it indicates how


                                                      30
                   Figure 7. Word clouds comparing terms used by two politicians.

                                                           given government activity – this is the
5    Conclusions                                           responsibility of others, e.g. investigative
This paper presents an introduction to Open Data           journalists. However for truly democratic and
and points out how it could help governments to            transparent governments it is essential that the
improve transparency and accountability.                   data can be reviewed and analysed and stand up
Furthermore, it describes some reasons why                 to public scrutiny. We strongly encourage this
governments refuse to engage in Open Data                  endeavor. All of the software and systems used
initiatives as well as the existing disadvantages          in this work are also available. The existing
encountered if they are not managed correctly.             prototype      system       is  available     at
The work described how and why Cloud                       http://130.56.249.15/proj/.
Computing provide an appropriate environment
                                                           Acknowledgments
for working with Open Data and identified and
presented one of the many approaches that can                The authors are thankful to the feedback and
be taken to set up this environment and the                discussions that have shaped this work.
associated technologies involved. It also                  Specifically we would like to thank Prof.
identified some of the common challenges faced             Margaret Simons (Director of the Centre for
by projects that deal with publicly available data         Advanced Journalism at the University of
and the methods used to overcome these                     Melbourne).
challenges. Moreover, it showed multiple ways
of visualizing data and how different datasets             References
could be correlated to explore a portfolio of              Anderson, J. C., Lehnardt, J., & Slater, N. (2010). CouchDB: the
government data that is openly available on the                definitive guide: O'Reilly Media, Inc.
                                                           Bayer, R. (1997). The universal B-tree for multidimensional
web.                                                           indexing: General concepts Worldwide Computing and Its
  This work has many refinements that are                      Applications (pp. 198-209): Springer.
                                                           Brito, J. (2007). Hack, mash & peer: Crowdsourcing government
currently ongoing. Incorporation of further data,              transparency.
e.g.     membership        of    companies       by        Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I.
politicians/their families/associates, as well as              (2009). Cloud computing and emerging IT platforms: Vision,
                                                               hype, and reality for delivering computing as the 5th utility.
exploring social media use. The use of Twitter in              Future Generation computer systems, 25(6), 599-616.
particular offers a rich source of Open Data that          Davies, T. (2010). Open data, democracy and public sector reform.
                                                               A look at open government data use from data. gov. uk. Über:
can be accessed and used to help promote the                   http://practicalparticipation.co.uk/odi/report/wp-
overall information of government. Who is                      content/uploads/2010/08/How-is-open-governmentdata-being-
                                                               used-in-practice. pdf.
following whom on Twitter; who tweets on what              Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data
topics; what is their sentiment on particular                  processing on large clusters. Communications of the ACM,
                                                               51(1), 107-113.
topics and how does this change over time are all          Dietrich, D., Gray, J., McNamara, T., Poikola, A., Pollock, R., Tait,
on-going activities that are being pursued.                    J., et al. (2012). The Open Data Handbook. 2014, from
  In all of this, it is emphasized that the purpose            http://opendatahandbook.org/en/
                                                           Doctorow, C., Suber, P., Hubbard, T., Murray-Rust, P., Walsh, J.,
of this work is not to draw conclusions on any


                                                      31
    Tsiavos, P., et al. (2014). The Open Definition. 2014, from           Mell, P., & Grance, T. (2011). The NIST Definition of Cloud
    http://opendefinition.org/                                                 Computing.
Ekanayake, J., Pallickara, S., & Fox, G. (2008). Mapreduce for            Robinson, D., Yu, H., Zeller, W. P., & Felten, E. W. (2008).
    data intensive scientific analyses. Paper presented at the                 Government data and the invisible hand. Yale JL & Tech., 11,
    eScience, 2008. eScience'08. IEEE Fourth International                     159.
    Conference on.                                                        Segaran, T., & Hammerbacher, J. (2009). Beautiful data: the stories
Gormley, C., & Tong, Z. (2014). Elasticsearch: The Definitive                  behind elegant data solutions: O'Reilly Media, Inc.
    Guide: O'Reilly Media, Inc.                                           Sriram, I., & Khajeh-Hosseini, A. (2010). Research agenda in cloud
Hogge, B. (2010). Open data study. a report commissioned by the                technologies. arXiv preprint arXiv:1001.3259.
    Transparency and Accountability Initiative, available for             Velte, T., Velte, A., & Elsenpeter, R. (2009). Cloud computing, a
    download                                                   at:             practical approach: McGraw-Hill, Inc.
    http://www.soros.org/initiatives/information/focus/communicat         Welcome to Open Government Data. (2014). 2014, from
    ion/articles_publications/publications/open-data-study-                    http://opengovernmentdata.org/
    20100519.                                                             White, T. (2012). Hadoop: The definitive guide: " O'Reilly Media,
Lathrop, D., & Ruma, L. (2010). Open government: Collaboration,                Inc.".
    transparency, and participation in practice: O'Reilly Media,          Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud computing:
    Inc.                                                                       state-of-the-art and research challenges. Journal of internet
Lin, J., & Dyer, C. (2010). Data-Intensive Text Processing with                services and applications, 1(1), 7-18.
    MapReduce: Morgan and Claypool Publishers.


                                                                     32
              Discovery of Sequential Patterns with Quantity Factors

        Karim Guevara Puente de la Vega                                    Cesar Beltrán Castañón
     Universidad Católica de Santa María /Arequipa                           Departamento de Ingeniería
    Universidad Nacional de San Agustín / Arequipa                Pontificia Universidad Católica del Perú / Lima
            kguevara@ucsm.edu.pe                                            cbeltran@pucp.pe

                                                                 In addition, the algorithms for mining sequen-
                        Abstract                              tial patterns of dealing with the previous sequen-
                                                              tial patterns in a uniform manner, despite the fact
      The sequential pattern mining stems from the            that these patterns individually in a sequence can
      need to obtain patterns that are repeated in            have important differences such as the associated
      multiple transactions in a database of se-              amount to each item that make up each pattern.
      quences, which are related to time, or another             For the foregoing reasons, in the present paper
      type of criterion. This work presents the pro-
                                                              proposes a technique by which it is intended to
      posal of a new technique for the discovery of
      sequential patterns from a database of se-              exploit these intrinsic relationships of the se-
      quences, where the patterns not only provide            quential patterns, in this specific case the rela-
      information on how these relate to the time,            tionship to the amount of each of the items. The
      but also, that in the mining process itself             inclusion of this aspect in the sequential pattern
      should be included the quantity factors asso-           mining, you can afford to get a set of sequential
      ciated with each of the items that are part of a        patterns that are not only common but also let us
      sequence, and as a result of this process can           know how these amounts associated with each
      be obtain information relating to how they re-          item that is included in a sequential pattern fre-
      late these items with regard to the amounts             quent relates. The inclusion of the restriction of
      associated. The proposed algorithm uses di-
                                                              quantity within the extraction process of the fre-
      vide and conquer techniques, as well as in-
      dexing and partitioning of the database.                quent sequential patterns we could provide in-
                                                              formation much more meaningful.
1      Credits                                                   The article is organized as follows: Section 2
                                                              is on the previous work. Section 3 gives a de-
This document was written as part of the devel-               scription of the problem. Section 4 introduces the
opment of the 1st Symposium on Information                    technical proposal. Section 5 shows the experi-
Management and Big Data, SIMBig 2014. It has                  ments and results. The conclusions and future
been adapted from the instructions for earlier                work are shown in section 6 and finally the ref-
ACL.                                                          erences.

2      Introduction                                           3     Previous works
   The sequential pattern mining is the process                  The techniques of discovery of association
by which you get the relationships between oc-                rules are essentially boolean, due to which are
currences of sequential events, to find if there is           discarded the quantities of the items purchased
a specific order in which these events occur. In              and only pay attention to if something was pur-
relation to this area of study there are many in-             chased or not. An important area of study is the
vestigations, all of them makes use of the re-                sequential pattern mining that involves the ex-
striction of minimal support, some include other              traction of patterns that are repeated in multiple
restrictions, such as for example the time interval           transactions in a transactional database, which
in which it is required that the events happen,               are related to time or another type of sequence.
also the use of taxonomies as defined by the user,               The problem of the sequential pattern mining
and the fact of allowing the items in a sequence              was introduced by Agrawal and Srikant (1995)
not necessarily must have occurred in a single                set the example of the typical income of clients
transaction, but could be in two or more, always              in a rental shop videos. Customers often rent
and when their times of each of these transac-                "Star Wars", then "Empire Strikes Back" and
tions is within some small window of time de-                 then "Return of the Jedi". All these incomes not
termined by the user.                                         necessarily should have been made consecutive-
                                                              ly, that is to say, there could be customers that


                                                         33
leased any other video in the middle of the pre-            means of task execution of projection of the ini-
vious sequence, so that these sequences of trans-           tial data base and the obtaining of patterns, with-
actions also fall into the same pattern.                    out involving a process of lead generation. Using
   The researches on mining sequential patterns             this technique and approach under the divide and
are based on events that took place in an orderly           rule, Han et al. (1996) proposed the algorithm
fashion at the time.                                        FreeSpan (Frecuent Pattern-Project Sequential
   Most of the implemented algorithms for the               Pattern mining), and Pei et al. (2001) proposes
extraction of frequent sequences, using three dif-          PrefixSpan (Prefix-projected Sequential Pattern
ferent types of approaches according to the form            mining). In these algorithms the database of se-
of evaluating the support of the candidate se-              quences is projected recursively in a set of small
quential patterns. The first group of algorithms is         databases from which the fragments of sub se-
based on the ownership apriori, introduced by               quences grow based on the current set of fre-
Agrawal and Srikant (1994) in the mining of as-             quent sequences, where the patterns are extract-
sociation rules. This property suggests that any            ed.
sub pattern from a frequent pattern is also fre-               Han et al. (1996)] show that FreeSpan extracts
quent, allowing pruning sequences candidates                the full set of patterns and is more efficient and
during the process of lead generation. Based on             considerably faster than the algorithm GSP.
this heuristics, Agrawal and Srikant (1995) pro-            However, a sub sequence can be generated by
posed algorithms as the AprioriAll and Apri-                the combinations of sub strings in a sequence,
oriSome. The substantial difference between the-            while the projection in FreeSpan must follow the
se two algorithms is that the AprioriAll generates          sequence in the initial database without reducing
the candidates from all the large sequences                 the length. In addition, it is very expensive the
found, but that might not be lowest panning val-            fact that the growths of a sub sequence it will be
ues, however, AprioriSome only counts those                 explored in any point of the division within a
sequences that are large but lowest panning val-            candidate sequence. As an alternative to this
ues, thus reducing the search space of the pat-             problem, Pei (2001) proposes PrefixSpan. The
terns.                                                      general idea is to examine only the prefixes for
   In subsequent work, Srikant and Agrawal                  the sub project only sequences and their corre-
(1996) propose the same algorithm GSP (Gener-               sponding sub sequences postfijas within data-
alization Sequential Patterns), also based on the           bases planned. In each of these databases
technical apriori, surpassing previous in 20 mag-           planned, it will find the sequential patterns ex-
nitudes of time. Until this time, the algorithms            panded exploration only local patterns frequent-
that had been proposed for mining sequential                ly. PrefixSpan extracts the full set of patterns and
patterns focused on obtaining patterns taking into          their efficiency and implementation are consid-
account only the minimal support given by the               erably better both GSP and FreeSpan.
user. But these patterns could fit into transactions           The third group is formed by algorithms that
that had been given at intervals of time very dis-          kept in memory only information necessary for
tant, which was not convenient for the purposes             the evaluation of the bracket. These algorithms
of mining. So, in this paper, we propose the idea           are based on the calls of occurrence lists that
that in addition to the minimal support, the user           contain the description of the location where the
could be in the ability to specify your interest in         patterns occur in the database. Under this ap-
obtaining patterns that fit into transactions that          proach, Zaki (2001) proposes the SPADE algo-
have been given in certain periods of time, and             rithm (Sequential Pattern Discovery using
this is made from the inclusion of restrictions on          Equivalence classes) where he introduces the
the maximum and minimum distance, the size of               technical processing of the data base to vertical
the window in which the sequences and the in-               format, in addition there is a difference from the
heritance relationships - taxonomies, which are             algorithms based on apriori, it does not perform
cross-relations through a hierarchy.                        multiple passes on the database, and you can ex-
   In these algorithms based on the principle of            tract all the frequent sequences in only three
apriori, the greater effort focused on developing           passes. This is due to the incorporation of new
specific structures that allow sequential patterns          techniques and concepts such as the list of identi-
represent the candidates and in this way make the           fiers (id-list) with vertical format that is associat-
counting operations support more quickly.                   ed with the sequences. In these lists by means of
   The second group is the algorithms that seek             temporary unions can be generated frequent se-
to reduce the size of the set of scanned data, by           quences. Also used the grid based approach to


                                                       34
break down the search space in small classes that           consequently, its performance is better than
can be processed independently in the main                  Wang et al.
memory. Also, uses the search in both breadth                  Fiot (2008) in her work suggests that an item
and depth to find the frequent sequences within             quantitative is partitioned into several fuzzy sets.
each class.                                                 In the context of fuzzy logic, a diffuse item is the
   In addition to the techniques mentioned earli-           association of a fuzzy set b to its corresponding
er, Lin and Lee (2005) proposes the first algo-             item x, i.e. [x,b]. In the DB each record is asso-
rithm that implements the idea of indexing called           ciated with a diffuse item [x,b] according to their
Memisp memory (Memory Indexing for sequen-                  degree of membership. A set of diffuse items
tial pattern mining). The central idea of Memisp            will be implicated by the pair (X,B), where X is
is to use the memory for both the data streams as           the set of items, and B is a set of fuzzy sets.
to the indexes in the mining process and imple-                In addition, it argues that a sequence g-k-
ment a strategy of indexing and search to find all          sequence (s1, s2,…, sp) is formed by g item sets
frequent sequences from a sequence of data in               diffuse s=(X,B) grouped to diffuse k items [x,b],
memory, sequences that were read from the da-               therefore the sequential pattern mining diffuse
tabase in a first tour. Only requires a tour on the         consists in finding the maximum frequency dif-
basis of data, at most, two for databases too               fuse g-k-sequence.
large. Also avoids the generation of candidates                Fiot (2008), provides a general definition of
and the projection of database, but presented as            frequency of a sequence, and presents three algo-
disadvantage a high CPU utilization and                     rithms to find the fuzzy sequential patterns:
memory.                                                     SpeedyFuzzy, which has all the objects or items
   The fourth group of algorithms is composed of            of a fuzzy set, regardless of the degree, if it is
all those who use fuzzy techniques. One of the              greater than 0 objects have the same weight,
first work performed is the Wang et al. (1999),             MiniFuzzy is responsible for counting the objects
who propose a new data-mining algorithm,                    or items of a fuzzy set, but supports only those
which takes the advantages of fuzzy sets theory,            items of the sequence that candidate have a
to enhance the capability of exploring interesting          greater degree of belonging to a specified thresh-
sequential patterns from the databases with quan-           old; and TotallyFuzzy that account each object
titative values. The proposed algorithm integrates          and each sequence. In this algorithm takes into
concepts of fuzzy sets and the AprioriAll algo-             account the importance of the set or sequence of
rithm to find interesting sequential patterns and           data, and is considered the best grade of mem-
fuzzy association rules from transaction data.              bership.
The rules can thus predict what products and
quantities will be bought next for a customer and           4    Description of the Problem
can be used to provide some suggestions to ap-
                                                                A sequence s, denoted by <e1e2… in>, is an or-
propriate supervisors.
                                                            dered set of n elements, where each element ei is
   Wang et al. (1999) propose fuzzy quantitative
                                                            a set of objects called itemset. An itemset, which
sequential patterns (FQSP) algorithm, where an
                                                            is denoted by (x1 [c1], x2 [c2] , …, Xq[cq] ), is a
item’s quantity in the pattern is represented by a
                                                            non-empty set of elements q, where each element
fuzzy term rather than a quantity interval. In their
                                                            xj is an item and is represented by a literal, and cj
work an Apriori-like algorithm was developed to
                                                            is the amount associated with the item xj that is
mine all FQSP, it suffers from the same weak-
                                                            represented by a number in square brackets.
nesses, including: (1) it may generate a huge set
                                                            Without loss of generality, the objects of an ele-
of candidate sequences and (2) it may require
                                                            ment are supposed to be found in lexicographical
multiple scans of the database. Therefore, an
                                                            order by the literal. The size of the sequence s,
Apriori-like algorithm often does not have a
                                                            denoted by |s|, is the total number of objects of
good performance when a sequence database is
                                                            all elements of the s, so a sequence s is a k-
large and/or when the number of sequential pat-
                                                            sequence, if |s|=k.
terns to be mined is large.
                                                                For       example,        <(a[5])(c[2])(a[1])>,
   Chen et al. (2006) propose divide-and-conquer
                                                            <(a[2],c[4])(a[3])> and <(b[2])(a[2],e[3])> are
fuzzy sequential mining (DFSM) algorithm, to
                                                            all 3-sequences. A sequence s = <e1e2… in> is a
solve the same problem presented by Hong using
                                                            sub-sequence of another sequence of s'=<e1'e2'…
the divide-and-conquer strategy, which possesses
                                                            em'> if there are 1≤i1<i2<…<in≤m such that e1⊆
the same merits as the PrefixSpan algorithm;
                                                            ei1', e2⊆ei2 ', ... , and en⊆ ein'. The sequence s'


                                                       35
contains the sequence s if s is a sub-sequence of                Consider the sequence C6, which consists of
s'.                                                           three elements, the first has the objects b and c,
    Similarly,       <(b,c)(c)(a,c,e)>      contains          the second has the object c, and the third has the
<(b)(a,e)> where the quantities may be different.             objects a, c, and e. Therefore, the support of
    The support (sup) of a sequential pattern X is            <(b)(a)> is 4/6 since all the sequences of data
defined as the percentage on the fraction of rec-             with the exception of C2 and C3 contain a
ords that contains X the total number of records              <(b)(a)>. The sequence <(a,d)(a)> is a sub se-
in the database. The counter for each item is in-             quence of both C1 and C4; and therefore,
creased by one each time the item is found in                 <(a,d)(a)>.sup=2/6.
different transactions in the database during the                Given the pattern <(a)> and the frequent item
scanning process. This means that the counter of              b, gets the pattern type-1 <(a)(b)> adding (b) to
support does not take into account the quantity of            <(a)>, and the pattern type-2 <(a,b)> by the
the item. For example, in a transaction a custom-             extension of <(a)> with b.
er buys three bottles of beer, but only increases                Similarly, <(a)> is the prefix pattern (_pat)
the number of the counter to support {beer} by                which in turn is a frequent sequence, and b is the
one; in other words, if a transaction contains an             stem of both: <(a)(b)> and <(a,b)>.
item, then, the support counter that item only is                Note that the sequence null, denoted by <>, is
incremented by one.                                           the _pat of any 1-frequent sequence. Therefore,
    Each sequence in the database is known as a               a k-sequence is like a frequent pattern type-1 or
sequence of data. The support of the sequence s,              type-2 of a (k-1)-frequent sequence.
is denoted as s.sup, and represents the number of
sequences of data that contain s divided by the               5    Algorithm for the discovery of se-
total number of sequences that there is in the da-                 quential patterns with quantity fac-
tabase. minSup threshold is the minimum speci-                     tors - MSP-QF
fied by the user. A sequence s is frequent if
s.sup≥minSup, therefore it will be a sequential                  The algorithm for mining sequential patterns
pattern frequently.                                           with quantity factors, arises from the need to dis-
    Then, given the value of the minSup and a da-             cover from a database of sequences, the set of
tabase of sequences, the problem of the sequen-               sequential patterns that include the amounts as-
tial pattern mining is to discover the set of all             sociated with each of the items that are part of
sequential patterns whose supports are greater                the transactions in the database, since having this
equal to the value of the minimum support                     additional information can be known with greater
(s.sup≥ minSup).                                              precision not only what is the relationship with
    Definition: given a  pattern and a frequent              respect to the time that exists between the vari-
item x in the database of sequences, ' is a:                 ous items involved in the transactions of a se-
     Pattern Type-1: if ' can be formed by add-             quence, but also as is the relationship to the
      ing to  the itemset that contains the item x,          amount of these items.
                                                                 The algorithm MSP-QF, it is based on the idea
      as a new element of .
                                                              of the use of prefixes, and the creation of indexes
     Pattern Type-2: if ' can be formed by the              from the database of sequences or other indices
      extension of the last element of  with x.              that are generated during the mining process,
    The item x is called stem of the sequential pat-          where recursively searching for frequent pat-
tern ', and  prefix is the pattern of '.                   terns. As a result of the exploration of a particu-
    That is, the following database of sequences of           lar index, fewer and shorter sequences of data
figure 1, which includes amounts for the items                need to be processed, while the patterns that are
and that, has six sequences of data.                          found will be made longer.
                                                                 In addition, if the database is very large se-
                       Sequences
       C1 = <(a[1],d[2]) (b[3],c[4]) (a[3],e[2])>
                                                              quence uses the techniques of partitioning in a
       C2 = <(d[2],g[1]) (c[5],f[3]) (b[2],d[1])>             manner that the algorithm is applied to each of
       C3 = <(a[5],c[3]) (d[2]) (f[2]) (b[3])>                the partitions as if it were a database of lesser
       C4 = <(a[4],b[2],c[3],d[1]) (a[3]) (b[4])>
       C5 = <(b[3],c[2],d[1]) (a[3],c[2],e[2]) (a[4])>        size.
       C6 = <(b[4],c[3]) (c[2]) (a[1],c[2],e[3])>

            Figure 1. Database of sequences


                                                         36
5.1   Procedure of the algorithm MSP-QF                         Step 5: When there is no more stems, filtered
                                                             index stems all those who are frequent. For all
  Listed below are the steps of the proposed al-
                                                             stems (sequences) frequently, we proceed to dis-
gorithm.
                                                             cretize the quantities that were associated with
                                                             each item and stored in the set of frequent pat-
   Step 1: Partitioning and scanning of the data-
                                                             terns. For this, before adding it to the set of fre-
base of sequences. Depending on the size of the
                                                             quent patterns, we proceed to verify that the
database are applicable to so it can be partitioned
                                                             common pattern found recently has not already
and formatted through and then to scan each of
                                                             been added before this set as a result of applying
the partitions of independently. For each parti-
                                                             the algorithm to a partition of the database that
tion, the sequences are constructed and stored in
                                                             was processed with previously. If frequent pat-
the structure DBSeq. At the same time generates
                                                             tern already exists in the set of frequent patterns,
the index of items where is stored the support for
                                                             the discretization process is again applied to the
each one of them, which is found during the
                                                             set of quantities associated with the sequence is
scanning process.
                                                             stored as a frequent pattern and set of quantities
                                                             of newly discovered frequent pattern; otherwise,
   Step 2: The index of items are filtered out
                                                             the common pattern found in the current partition
those that are frequent, i.e., whose support is
                                                             is added directly to the set of frequent patterns.
greater than or equal to minSup determined by
                                                                Then we proceed to perform recursively steps
the user. All these items come to form sequences
                                                             3, 4 and 5 with each one of the frequent patterns
of size |s| =1, therefore, form the set of 1-
                                                             that are found in the process.
sequences. For all these sequences frequent item
is to write the amounts associated with each item
                                                                Discretization Function: This function is re-
to the time it is saved in the whole of frequent
                                                             sponsible for making the set of quantities associ-
patterns.
                                                             ated with an item, the range of values given by
                                                             the mean and standard deviation of this set. For
   Step 3: For each one of the frequent patterns
                                                             example, given the sequences of the figure 1, the
, found in step 2, or as a result of the step 4, the
                                                             set of quantities associated with the item <(a)>
index is constructed _idx, with inputs (ptr_ds,             is: 1,5,4,3,1, which after being discretized would
pos), where ptr_ds refers to a sequence of the DB            be the interval formed by: [ 2.8±1.6 ]
in which appears the  pattern, and pos is the
pair (posItemSet, posItem), where posItemSet is                 To summarize the steps carried out in the pro-
the position of the itemset in the sequence and              posed algorithm, figure 3 shows a schematic of
posItem the position of the item in the itemset              the entire procedure.
from the sequence where the pattern appears.
The values of pos allow the following scans are
performed only on the basis of these positions in            5.2      Algorithm specification MSP-QF
a certain sequence.                                            Here we show the specification of the pro-
                                                             posed algorithm MSP-QF.
   Step 4: Find the stems of type-1 and/or type-2
for each  pattern and its corresponding index                Algorithm MSP-QF
_idx generated in the previous step, considering             In: DB = database sequences
only those items of the sequences referred to in                   minSup = minimum support
                                                                   partition = number of sequences included in each of the partitions
_idx and the respective values of pos. At the                Out: set of all sequential patterns with quantity factors.
same time as are the stems are calculated their               Procedure:
                                                              1. Partitioning the DB
supports, and in addition is added to the list of             2. Each partition scan it in main memory and:
quantities of the item that is part of the stem the                (i) build sequences and store them in DBSeq structure.
amount referred to in the item of the sequence of                  (ii) index the items and determine the support of each item.
                                                                   (iv) associate the quantities of each item in a sequence list of item
the DB which is being examined. The infor-                                quantities in the index.
mation of the stems and their quantities are                  3. Find the set of frequent items
                                                              4. For each frequent item x,
stored in another index of stems. This step is re-                 (i) form the sequential pattern  = <(x)>
peated for the same pattern, until they were no                    (ii) call Discretize() to discretize the set of quantities associated
longer more stems from this.                                              with each item x.
                                                                   (iii) storing  in the set frequent patterns.
                                                                   (iv) call Indexing (x, <>, DBSeq) to build the -idx index.
                                                                   (v) call Mining (, -idx) to obtain patterns from index -idx.


                                                        37
Subrutine Indexing (x, , set_Seq)                                                     3. For each stem x found in the previous step,
Parameters:                                                                                 (i) form a sequential pattern '' from the prefix pattern -pat and
     x = one stem type-1 or type-2;                                                                the stem x.
      = prefix pattern (-pat);                                                            (ii) call Discretize(') to discretize the amounts associated with
     set_Seq = set of data sequences                                                               the items of '.
     / * If set_Seq is an index, then each data sequence in the index is                    (iii) call Index(', , ’-idx) to build the index ’-idx.
     referenced by the element ptr_ds¸ which is formed at the input                         (iv) call Mining(’, '-idx) to discover sequential patterns from
     (ptr_ds, pos) index * /                                                                       index ’-idx
Out: índex '-idx, where ' represents the pattern formed by the stem x
and prefix pattern -pat.
Procedure:
1. For each data sequence ds of set_Seq                                                Subrutina Discretize()
     (i) If set_Seq = DBSeq the pos_inicial = 0, else pos_inicial = pos.               Parameters:
     (ii) Find the stem in each sequence ds from the position                                = a pattern that is a sequence;
     (pos_inicial + 1),                                                                Output: the arithmetic mean and standard deviation of the amounts
         1. If the stem x is in position pos in ds, then insert a pair (ptr_ds,        associated with each item  pattern.
            pos) in '-idx index, where ptr_ds reference to ds.                        Procedure:
         2. If the stems x is equal to the item x’ of the ds sequence, add-            1. For each itemset    do
            ed the quantity q associated with the item x’, to the list of                   a) For each item x  do
            quantities related to x.                                                           (i) Calculate the arithmetic mean and standard deviation of
2. Return the '-idx index.                                                                        the set of quantities associated with the item x
                                                                                               (ii) storing the arithmetic mean and standard deviation in the
                                                                                                   pattern 
Subrutine Mining(,-idx)
Parameters:
       = a pattern;
      -idx = an índex.
Procedure:
1. For each data sequence ds referenced by ptr_ds of input (ptr_ds,
      pos) in -idx,
      (i) Starting from the (pos +1) position until |ds|, determining poten-
            tial stems and increase in one support each of these stems.
2. Filter those stems that have a large enough support.


                                                Figure 2. Specification of the algorithm MSP-QF


                                       Figure 3. Schema of the procedures of the algorithm MSP-QF


                                                                                  38
                                                                                                   In the test with minSup=2% were obtained
6            Experiments and Result                                                              824 sequential patterns with quantity factors,
                                                                                                 some of which are:
  The experiments to test the technical proposal                                                   Olive[ 1.06±0.51 ]$
were implemented in two different scenarios,                                                       Olive[ 1.03±0.5 ]$ Paprika[ 0.37±0.23 ]$
which are described below.                                                                         Olive[ 1.01±0.5 ]$ Porro[ 0.57±0.27 ]$
                                                                                                   Celery[ 0.56±0.25 ]$ Lettuce[ 0.53±0.24 ], Lentils[ 0.54±0.23 ]$
                                                                                                   Celery[ 0.56±0.26 ]$ Lettuce[ 0.55km Air ±0.24 ],
6.1                            Scenario 1: Real data                                                                                              Paprika[ 0.34±0.17 ]$
                                                                                                   Lettuce[ 0.61±0.25 ], Lentils[ 0.56±0.26 ]$ Porro[ 0.59±0.24 ],
   The technique was applied in the analysis of                                                                                                   Paprika[ 0.33±0.15 ]$
the market basket of a supermarket. These tests                                                    Porro[ 0.54±0.27 ], Lentils[at 0.62±0.25 ]$ Lentils[ 0.58±0.25 ]$
consisted of obtain the set of frequent sequential                                                                                               Paprika[ 0.35±0.17 ]$

patterns from the basis of data obtained in the
course of three non-consecutive periods. The                                                        Of these sequential patterns we can clarify the
first period goes from mid-December of 1999                                                      following with regard to purchases made by cus-
until mid-January 2000. The second period goes                                                   tomers:
from early 2000 until the beginning of June of                                                     • Customers buy only olives in a quantity of
                                                                                                       1.06±0.51.
the same year. The third period goes from late
                                                                                                  • Customers who have purchased a first time only
August 2000 until the end of November 2000.
                                                                                                    olive, returning a next time for chili or by porro,
This database consists of 88163 transactions,                                                       with quantities of 0.37±0.23 and 0.57±0.27 re-
3000 items unique to approximately 5133 cus-                                                        spectively . Those who buy after pepper, pur-
tomers.                                                                                             chased before olives in a quantity equal to
   The purpose of testing is to discover patterns                                                   1.03±0.5, while those who acquire porro did so
of customer usage in the supermarket, plus get                                                      with an amount equal to 1.01±0.5.
the amount of each of the items that will be pur-                                                 • Those who buy lettuce at the same time buy lentils
chased by these customers as a result of applying                                                   in amounts equal to 0.61±0.25 and 0.56±0.26 re-
the proposed technique, which will allow us to                                                      spectively. Later, these same customers buy porro
have more accurate and significant in terms of                                                      and paprika with amounts equal to 0.59±0.24 and
                                                                                                    0.33±0.15.
the quantity purchased of each of the items.
   Seven tests were carried out with minimum                                                      • Those who buy porro, in the same transaction also
                                                                                                    buy lentils. Later return to buy only lentils, and a
media 10 %, 2%, 1.5%, 1.0%, 0.75%, 0.50% and                                                        next time buy only paprika, in the amounts listed
0.25%, which were observed in figure 4. These                                                       in the pattern.
results were compared with results of the tech-
nical Memisp.                                                                                    6.2      Scenario 2: Synthetic data

                                             MEMISP                        MSP-QF
                                                                                                    This second scenario is generated multiple da-
      minSup                                                                                     tabases (datasets) of synthetic form by means of
                                      Exe.Time       No.            Exe.Time      No.
        (%)
                                       (seg.)     Patterns           (seg.)     Patterns         the Synthetic Data Generator Tool.
           10.00                           4           50               5           50              The process followed to synthetic generation
            2.00                         12           824              15         824
            1.50                         16          1371              19        1371            of the dataset, it is the describing Agrawal and
            1.00                         22          2773              27        2773            Srikan (1995), and under the parameters referred
            0.75                         28          4582              35        4582            to in the work of Lin and Lee (2005).
            0.50                         39          9286              50        9286
            0.25                         72         30831              89       30831               In this scenario, tests were carried out both of
                                                                                                 effectiveness, efficiency and scalability.
                                                                                                    The evidence of effectiveness and efficiency
                                          Analysis of the market basket
                               100                                                               were made with dataset generated with the fol-
                                90
                                                                                                 lowing parameters: NI = 25000, NS = 5000, N =
       Execution time (seg.)


                                80
                               70                                                                10000, |S| = 4, |I| = 1.25, corrS = 0.25, crupS =
                               60                                                                0.75, corrI = 0.25 and crupI = 0.75.
                               50
                               40
                                                                                   MEMISP           The results of these tests were compared with
                               30                                                  MSP-QF        the results obtained for the algorithms Pre-
                               20
                               10
                                                                                                 fixSpan-1, PrefixSpan-2 and Memisp.
                                0
                                     10     2      1.5   1   0.75     0.5   0.25                   Efficiency Tests: Ran a first subset of tests for
                                                Minimal Support (%)
                                                                                                 |C|=10 and a database of 200,000 sequences,
                               Figure 4. Results of the tests for scenario 1                     with different values for minSup. The results are
                                                                                                 shown in figure 5.


                                                                                            39
                              1600                                                                                 Efficacy tests: Were carried out 92 efficacy
   Exxecution Tme (seg.)
                              1400                                                                              trials with the same dataset of the tests of effi-
                              1200                                                                              ciency. In four of the tests carried out with values
                              1000
                                                                                                                |C| =20 and |T| equal to 2.5, 5 and 7.5 respective-
                              800
                              600
                                                                                                                ly, it did not achieve the same amount of sequen-
                              400                                                                               tial patterns found with the algorithms Pre-
                              200                                                                               fixSpan-1 and PrefixSpan-2. These 4 tests repre-
                                 0
                                          0.25      0.5          0.75        1             1.5        2
                                                                                                                sent 4% of the total.
                                                            Minimal Support (%)
                                     PrefixSpan-1         PrefixSpan-2           MEMISP           MSP-QF           Scalability tests: The scalability tests were
                                                                                                                used datasets synthetically generated with the
                    Figure 5. Test results for |C|=10 and |DB|=200K                                             same values of the parameters of the first subset
                                                                                                                of tests of efficiency, and with minimal support
   The second subgroup of tests was conducted                                                                   equal to 0.75%. The amount of sequences in the
with a dataset with values for |C| and |T| of 20                                                                dataset for these tests ranged from |DB|=1000K
and 5 respectively. This value of |T| implies that                                                              to 10000K, i.e. of a million to 10 million se-
the number of items of transactions increases,                                                                  quences. In figure 8, you can watch the results of
which represents that the database is also larger                                                               these tests.
and more dense with respect to the number of
frequent sequential patterns that may be ob-                                                                     Execution Time (relative |DB| =
                                                                                                                                                   40
tained. The results of these tests are those seen in                                                                                               35
figure 6.                                                                                                                                          30
                                                                                                                                                   25
                                                                                                                            1000KB)


                              10000                                                                                                                20
      Execution Time (seg.)


                                                                                                                                                   15
                               8000
                                                                                                                                                   10
                               6000
                                                                                                                                                    5
                               4000
                                                                                                                                                    0
                               2000                                                                                                                        1      2    3     4      5     6    7       8   9     10
                                                                                                                                                                           Millions of sequences
                                     0
                                                                                                                                                        PrefixSpan-1       PrefixSpan-2       MEMISP           MSP-QF
                                            0.25     0.5         0.75        1            1.5     2
                                                             Minimal Support (%)
                                     PrefixSpan-1         PrefixSpan-2           MEMISP          MSP-QF                                            Figure 8, Results of scalability tests for min-
                                                                                                                                                   sup=0.75% and different sizes of datasets
                                Figure 6. Test results for |C|=20, |T|=5 and
                                              |DB|=200K                                                         7                                  Conclusion
                                                                                                                   We have proposed an algorithm for the dis-
  A last subset of efficiency tests were carried                                                                covery of sequential patterns that allows, as part
out under the same parameters of the subset                                                                     of the mining process, to infer from the amounts
above with the exception of |T| increased to 7.5.                                                               associated with each of the items of the transac-
The results are shown in figure 7.                                                                              tions that make up a sequence, the quantity fac-
                                                                                                                tors linked to the frequent sequential patterns.
                              20000                                                                                The technical proposal has been designed in
                              18000
                              16000
                                                                                                                such a way that uses a compact set of indices in
      Execution Time (seg.)


                              14000                                                                             which focuses the search for the sequential pat-
                              12000                                                                             terns from frequent patterns that have already
                              10000
                               8000
                                                                                                                been found earlier and that represent the prefixes
                               6000                                                                             of the patterns to find. That is why the size of the
                               4000                                                                             indexes is decreasing in accordance with the
                               2000
                                  0
                                                                                                                mining process progresses.
                                              0.5         0.75           1          1.5          2                 In addition, there has been that the information
                                                             Minimal Support (%)                                provided by the frequent patterns with factors of
                                     PrefixSpan-1         PerfixSpan-2           MEMISP          MSP-QF
                                                                                                                quantity, is much more accurate, since not only
                               Figure 7. Test results for |C|=20, |T|=7.5 and
                                                                                                                gives us information on how is the temporal rela-
                                               |DB|=200K                                                        tionship of the items in the various transactions,


                                                                                                           40
but also, what is the relationship of the quantities            Karel Filip. 2006. Quantitative and Ordinal Associa-
of some items to others, which enriches the se-                   tion Rules Mining (QAR Mining). 10th Internation-
mantics provided by the set of sequential pat-                    al Conference on Knowledge-Based & Intelligent
terns.                                                            Information & Engineering Systems (KES 2006).
                                                                  South Coast, UK: Springer, Heidelberg.
    Finally, the results obtained in section 5, we
can conclude by saying that the technical pro-                  Lin Ming-Yen and Lee Suh-Yin. 2005. Fast Discov-
posal meets the objectives of the mining process;                  ery of Sequential Patterns through Memory Index-
it is effective, is efficient and is scalable because              ing and Database Partitioning. Journal of Infor-
it has a linear behavior in accordance with the                    mation Science and Engineering.
sequence database grows, and that when applied                  Market-Basket      Synthetic     Data       Generator,
to large data bases his performance turned out to                 http://synthdatagen.codeplex.com/.
be better than the techniques discussed in this                 Molina L. Carlos. 2001. Torturando los Datos hasta que
work.                                                             Confiesen. Departamento de Lenguajes y Sistemas In-
                                                                  formáticos, Universidad Politécnica de Cataluña. Bar-
Reference                                                         celona, España.

Agrawal Rakesh and Srikant Ramakrishnan. 1994.                  Papadimitriou Stergios and Mavroudi Seferina. 2005.
  Fast algorithms for mining association rules. In                The fuzzy frequent pattern Tree. In 9th WSEAS Inter-
  Proceeding 20th International Conference Very                   national Conference on Computers. Athens, Greece:
  Large Data Bases, VLDB.                                         World Scientific and Engineering Academy and Soci-
                                                                  ety.
Agrawal Rakesh and Srikant Ramakrishnan. 1995.
                                                                Pei Jian, Han Jiawei, Mortazavi-Asl Behzad and Pinto
  Minning Pattern Sequential. 11th International
                                                                   Helen. 2001. PrefixSpan: Mining sequential patterns
  Conference on Data Engineering (ICDE’95), Tai-
                                                                   efficiently by prefix-projected pattern growth. In
  pei, Taiwan.                                                     ICDE '01 Proceedings of the 17th International Con-
Agrawal Rakesh and Srikant Ramakrishnan. 1996.                     ference on Data Engineering.
  Mining Quantitative Association Rules in Large                Srikant Ramakrishnan and Agrawal Rakesh. 1996.
  Relational Tables. In Proceeding of the 1996 ACM                 Mining Sequential Patterns: Generalizations and
  SIGMOD Conference, Montreal, Québec, Canada.                     Performance Improvements. In Proc.5th Int. Conf.
Alatas Bilal, Akin Erhan and Karci Ali. 2008.                      Extending Database Technology (EDBT’96), pages
MODENAR: Multi-objective differential evolution algo-              3–17, Avignon, France.
rithm for mining numeric association rules. Applied Soft
                                                                Takashi Washio, Yuki Mitsunaga and Hiroshi Moto-
Computing.
                                                                  da. 2005. Mining Quantitative Frequent Itemsets
Chen Yen-Liang and Haung T. Cheng-Kui. 2006. A                    Using Adaptive Density-Based Subspace Cluster-
  new approach for discovering fuzzy quantitative                 ing. In Fifth IEEE International Conference on Da-
  sequential patterns in sequence databases. Fuzzy                ta Mining (ICDM'05). Houston, Texas, USA:
  Sets and Systems 157(12):1641–1661.                             IEEE Computer Society.
Fiot Céline. 2008. Fuzzy Sequential Patterns for                Wang Shyue, Kuo Chun-Yn and Hong Tzung-Pei.
   Quantitative Data Mining. In Galindo, J. (Ed.),               1999. Mining fuzzy sequential patterns from quan-
   Handbook of Research on Fuzzy Information Pro-                titative data”, 1999 IEEE Internat. Conference Sys-
   cessing in Databases.                                         tems, Man, and Cybernetics, vol. 3, 1999, pp. 962–
                                                                 966.
Han Jiawei, Pei Jian, Mortazavi-Asl Behzad, Chen
  Qiming, Dayal Umeshwar and Hsu Mei-Chun.                      Zaki Mohammed J. 2001. SPADE: An efficient algo-
  1996. Freespan: Frequent pattern-projected se-                  rithm for mining frequent sequences. Machine
  quential pattern mining. Conference of the sixth                Learning, 42(1-2):31–60.
  ACM SIGKDD international conference on
  Knowledge discovery and data mining.


                                                           41
                     Online Courses Recommendation based on LDA


    Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, José Ochoa Luna
                                         National University of St. Agustin
                                                  Arequipa - Perú
             {r.guzmanap,elizavvc,lvcruzq,eduardo.ol}@gmail.com


                      Abstract                                are primarily focused. To do so, we rely on Topic
                                                              Models (Blei, 2012), an unsupervised probabilistic
     In this paper we propose a course recommen-
                                                              generative model, which given a set of documents
     dation system based on historical grades of
     students in college. Our model will be able to           and a number of topics as input, automatically re-
     recommend available courses in sites such as:            turns a relevant set of words probabilistically asso-
     Coursera, Udacity, Edx, etc. To do so, proba-            ciated for each topic. Why this scheme is valuable?,
     bilistic topic models are used as follows. On            consider for instance a huge number of digitalized
     one hand, Latent Dirichlet Allocation (LDA)              books of a public library, this algorithm can auto-
     topic model infers topics from content given in          matically discover main topic words and therefore
     a college course syllabus. On the other hand,
                                                              allows one to gain insights about content in books.
     topics are also extracted from a massive online
     open course (MOOC) syllabus. These two sets                 Currently educational systems and data mining
     of topics and grading information are matched            is an emerging research area (Romero and Ventura,
     using a content based recommendation system              2010), these systems use different recommendation
     so as to recommend relevant online courses to
                                                              techniques in order to suggest online learning ac-
     students. Preliminary results show suitability
     of our approach.                                         tivities, based on preferences, knowledge and data
                                                              from other students with similar interests (Romero
                                                              et al., 2007). In (Kuang et al., 2011) the author pro-
1   Introduction
                                                              vides resource recommendation for users in the e-
Nowadays, the amount of educational resources                 learning system based on contents and user log ac-
spread at Internet is huge and diverse (Martin, 2012).        tivities. There was proposed a method for resource
Massive Online Open Courses (MOOCs) such us                   recommendation based on topic modeling in an e-
Coursera, Udacity, EdX, to name a few, are gain-              learning system, that system used Latent Dirichlet
ing momentum (Fischer, 2014). It is possible to find          Allocation (LDA) to get a low dimension vector,
courses from almost every knowledge domain. This              and to do inference it used Gibbs sampling, then in
vast offer overwhelm any user willing to find courses         resource recommendation it applied cosine similar-
according his/her background. This task can be te-            ity in document topic distribution to find neighbor
dious because it involves access to each platform,            resources. The authors from (Haruechaiyasak and
search available courses, select some courses, read           Damrongrat, 2008) also recommended documents,
carefully each course syllabus, and choose appropri-          in this case it recommended articles from wikipedia
ate content. This process can be unmanageable if              by calculating the similarity measures among topic
we extend our search beyond online courses to edu-            distributions of the articles. The model proposed in
cational content.                                             (Sadikov and Bratko, 2011) is an hybrid recommen-
   In this work we propose a system for online                dation system where the core of the system is a lin-
courses recommendation, although MOOCs courses                ear regression model, based on stochastic gradient


                                                         42
descent. For predicting the rank of a lecture, they            each document is a combination of topics and each
used and compared the predictions made by content-             topic is a probability distribution over words (Blei
based and collaborative-based methods. In this pa-             et al., 2003). Topic models are a type of graphical
per they established manually the attributes that rep-         model based on Bayesian networks.
resent each video-lecture, unlike our paper, where                The generative process described by a topic model
the attributes for the courses are defined by the LDA          does not make any assumptions about the order of
algorithm. In (Sadikov and Bratko, 2011), to find a            words as they appear in documents. The only infor-
rank they measured the correlation between an old              mation relevant to the model is the number of times
lecture (a lecture the visitor has already seen), and          words are produced, this is known as the “bag-of-
the new lectures (lectures that visitor has not seen           words” assumption (Steyvers and Griffiths, 2007).
yet), and then they ordered theses measures in a list,            There are two main topic models: LDA (Blei et
where the lowest comes first, theses computations              al., 2003) and Probabilistic Latent Semantic Anal-
were used in the linear regression model. Also they            ysis (pLSA) (Hofmann, 1999). In this work we use
said that there was not to much difference between             LDA due to its general model. It is also worth noting
using content-based or collaborative-based methods,            that LDA has been previously used in recommenda-
but they said that their system could have been im-            tion systems (Romero and Ventura, 2010; Romero et
proved if they used textual attributes, which is our           al., 2007; Kuang et al., 2011).
case.
   In our proposal, Latent Dirichlet Allocation                2.2   Topics Modeling using Latent Dirichlet
(LDA) (Blei et al., 2003) topic model is mainly used                 Allocation
as feature descriptor of courses. Thus, we assume
                                                               Latent Dirichlet allocation (LDA) (Blei et al., 2003)
that each course has a set of inherent topics and
                                                               is widely used for identifying topics in a set of docu-
therefore relevant words that summarize them. In
                                                               ments, building on previous work by Hofmann (Hof-
our content-based recommendation setting those are
                                                               mann, 1999). The corresponding graphical model
input features that describe courses. We are con-
                                                               representation is depicted in Figure 1, where each
cerned in discovering the parameter vector of users,
                                                               document is represented as a mixture of a fixed num-
i.e., weights over topic words that denote user pref-                                                            (d)
                                                               ber of topics, with topic z receiving weight θz in
erences on courses. In order to infer this user vector,
                                                               document d, and each topic is a probability distribu-
we rely on supervised machine learning algorithms
                                                               tion over a finite vocabulary of words, with word i
thus, we assume grading obtained in college courses                                 (z)
as ratings, learn user weights and ratings are pre-            having probability φi in topic z.
dicted for unseen MOOCs courses. Preliminary re-
sults show suitability of this approach.                                                           (d)

   The paper is organized as follows. In Section 2                                   α             θ

background is given. In Section 3, our proposal is
presented. Section 4 shows experimental results. Fi-                                               z
nally, Section 5 concludes the paper.
                                                                                     (z)
2   Background                                                             β        Φ             w

                                                                                           T             Nd D
2.1 Probabilistic Topic Modeling
Topic models are probabilistic models that have                Figure 1: Graphical model for the topic modeling using
been mainly used to discover topics in a big col-              plate notation
lection of text documents. They are non supervised
learning (Duda et al., 2012) techniques that do not              Symmetric Dirichlet priors are placed on θ(d)
require any prior annotations or labeling of the doc-          and φ( j), with θ(d) ∼ Dirichlet(α) and φ( j) ∼
uments: the topics emerge from the analysis of the             Dirichlet(β), where α and β are hyper-parameters
original texts (Blei, 2012). To do so, they assume             that affect the sparsity of these distributions. The


                                                          43
hyper-parameter α can be interpreted as a prior ob-             zN \j indicates (z1 , . . . , zj−1 , zj+1 , . . . , zN )
servation count for the number of times a topic is              W is the size of the vocabulary
sampled in a document, and β as the prior observa-               (w )
                                                                       \j is the number of times a word wj is
                                                                     j
tion count on the number of times words are sampled             nzj ,N
from a topic before any word from the corpus is ob-                  assigned to topic zj
served. This smooths the word distribution in every              (·)
                                                                nzj ,N \j is the total number of words assigned to
topic, with the amount of smoothing determined by                    topic zj
β. The goal of inference in this model is to identify            (d )
the values of φ and θ, given a corpus of D documents            nzj j,N \j is the number of times a word in document
represented by a vocabulary of W words.                               dj is assigned to topic zj
   In our proposal, each course is a document d that             (d )
                                                                n·,Nj \j is the total number of words in document dj
has its related sequence of Nd word tokens, N words
in the overall corpus.                                            From this probability distribution it is possible
                                                               to make inference, in order to compute conditional
2.3 Gibbs Sampling Algorithm
                                                               probability of topic structure given the observed
There are many algorithms proposed to obtain                   document. The probability distribution of topics in
the main variables of interest θ and φ in the lit-             a document represents a feature vector for that doc-
erature, (Hofmann, 1999) used the expectation-                 ument.
maximization (EM) algorithm, this approach suffers
from problems involving local maxima of the like-              2.4     Recommender Systems
lihood function, which has motivated a search for              According to (Ricci et al., 2011), recommender
better estimation algorithms like the ones proposed            systems are software tools and techniques provid-
in (Blei et al., 2003; Buntine, 2002; Minka and Laf-           ing items suggestions for a given user. Suggestions
ferty, 2002).                                                  provided are aimed at supporting their users in var-
   Instead of directly estimating the variables for            ious decision-making processes, such as what items
each document, another approach is the algorithm               to buy, what music to listen, or what news to read.
called “Gibbs sampling” (Griffiths and Steyvers,                  As a rule, in a recommendation-system applica-
2004), which provides a relatively efficient method            tion there are two classes of entities, which we shall
of extracting a set of topics from a large corpus.             refer to as users and items. Users have prefer-
Gibbs sampling considers each word token in the                ences for certain items and these preferences must
text collection in turn, and estimates the probability         be teased out of the data (Rajaraman and Ullman,
of assigning the current word token to each topic,             2012). The data itself is represented as a utility ma-
conditioned on the topic assignments to all other              trix, giving for each user-item pair, a value that rep-
word tokens. From this conditional distribution,               resents what is known about the degree of preference
given a document, a topic is sampled and stored as             of that user for that item. Values come from an or-
the new topic assignment for this word token. We               dered set, e.g., integer 1 − 5 representing the number
write this conditional distribution as:                        of stars that the users gave as a rating for that item.
                          (w )j            (d )                We assume that the matrix is sparse, meaning that
                         nzj ,N \j + β   nzj j,N \j + α
P (zj |zN \j , wN ) =                  · (d )                  most entries are unknown. An unknown rating im-
                         (·)
                        nzj ,N \j + W β n·,Nj \j + T α         plies that we have no explicit information about the
                                                               user’s preference for the item. The goal of a rec-
where:                                                         ommendation system is to predict the blanks in the
 wN = (w1 , . . . , wN ) are the words in the entire           utility matrix.
    corpus                                                        There are two basic architectures for a recommen-
                                                               dation system (Rajaraman and Ullman, 2012):
 zN = (z1 , . . . , zN ) are the topic assignments of
    the words                                                    • Content-based systems focus on properties of
                                                                   items. Similarity of items is determined by


                                                          44
      measuring the similarity in their properties                 courses        c. features     profile user(1)—rating ...
                                                                                                   (1)       (1)
    • Collaborative-Filtering system focus on the re-              Calculus       x1 , . . . xn   θ1 , . . . , θn — 12
      lationship between users and items. Similarity               ···            ···             ···
                                                                                   0          0    (1)          (1)
      of items is determined by the similarity of the              ML(Mooc)       x1 , . . . xn   θ1 , . . . , θn —?
      ratings of those items by the users who have
                                                                             Table 1: Utility matrix for courses
      rated both items.

   In a content-based system, we must construct a                all items. The rest of this section describes our main
profile for each item, which is a record of collections          design choices.
of records representing important characteristics of                Consider the utility matrix in Table 1 used to
that item. In simple cases, the profile consist of some          represent a content-based recommendation system.
characteristics of the item that are easily discovered.          First column contains courses names (college and
For example, in a movie there are the set of actors,             MOOC’s courses). Second column contains feature
the director, the genre of general type of movie. In             descriptors for courses. Each row denotes a different
documents it is not immediately apparent what the                course, therefore each course has a different feature
values of features should be. There are many kinds               vector. Third column shows the user vector profile
of documents for which a recommendation system                   Θ(1) for user 1. This vector could comprise user 1
can be useful. For example, there are many news ar-              preferences about art, math, biology and social sci-
ticles published each day, and we cannot read all of             ences in general. In this same column is also showed
them. A recommendation system can suggest arti-                  user 1 ratings for each course (they are in fact grades
cles on topics a user is interested in. Unfortunately,           obtained in college for user 1, see for instance rating
documents do not tend to have available information              12 for calculus). Further columns for user 2, user
giving features. A substitute that has been useful in            3 and so on should be added accordingly. Our goal
practice is the identification of words that character-          is to predict missing ratings for MOOC’s courses (?
ize the topic of a document. An approach is to com-              symbol in last row) for user 1 (user 2, 3, etc.). In or-
pute the T F (Term frequency) - IDF (Inverse doc-                der to do so, we should perform the following steps:
ument frequency) score for words in the document.
The ones with the highest scores are the words that                • Extract item vectors for courses: item vectors
characterize the document. In this sense, documents                  are defined by courses content, i.e., text that
are represented by sets of words. In this paper we                   describes courses, such as “about the course“
have used a different approach which relies on find-                 information. In order to construct item vectors
ing document topic information by using topic mod-                   (features from documents), we rely on Latent
eling algorithms such as LDA.                                        Dirichlet Allocation algorithm which extracts
                                                                     topic information from text as probability dis-
3    Proposal                                                        tribution of words. Since we use a machine
                                                                     learning setting, item vectors are features of
In order to recommend online courses, each course                    a regression/classification problem, which we
is considered a document which has a given con-                      denote X = {X1 , X2 , . . . , Xn }.
tent. To characterize each course, LDA is used to
uncover the semantic structure hidden in the docu-                 • Learn user’s vector: interests about topic
ment. Since LDA allow us to get a topic distribution                 courses can be modeled by user’s vector which
for each course, this output is used as a feature vec-               should be learned for each user. To do
tor for courses (items according to a content-based                  so, we use a machine learning approach, all
recommendation setting). A recommendation sys-                       available ratings (grading information in col-
tem is built using item profiles and utility matrices                lege) are used to train a multilinear regression
and we treat the problem as one of machine learn-                    model (Bishop and others, 2006). The user’s
ing. Regard the given data as a training set, and for                vector is therefore the resulting set of param-
                                                                                                        (1)      (1)
each user, build a classifier that predicts the rating of            eters (or weights), Θ(1) = {θ1 , . . . , θn }


                                                            45
     learned from training data (for instance, all
     courses and gradings of user 1). There are m
     (number of users) set of parameters. In a multi-
     linear regression algorithm we want to find the
     values for Θ, that minimize the cost function:
                                 1 Pm          (i)
     J(Θ0 , Θ1 , . . . , Θn ) = 2m  i=1 (hΘ (x ) −
     y)i 2


     We define an hypothesis:
     hΘ (x) = ΘT x = Θ0 x0 +Θ1 x1 +Θ2 x2 +. . .+
     Θn xn
     Where Θ0 , Θ1 , . . . , Θn are the parameters we
     want to predict minimizing the cost function.
     One way to minimize the cost function is by
     using gradient descent method, where each it-
     eration of gradient descent makes the parame-
     ters θj come closer to the optimal values that          Figure 2: Block diagram for the recommendation system
     will minimize the cost function J(θ).
                                                             4   Experimental Results
     For n > 1                                               This section shows preliminary experimental results
     Repeat {                                                conducted on real world data sets. Courses and users
                   1 Pm           (i)      i    (i)
     Θj := Θj − α m     i=1 (hΘ (x ) − y )xj                 grading information where extracted from a Peru-
     (simultaneously update Θj for j = 0, ...n) }            vian university. Some MOOC’s courses were ex-
                                                             tracted from Coursera, the following categories were
                                                             considered: “business and management”, “computer
                                                             science - artificial intelligence”, “computer science -
  • Given item and user vectors the goal is to pre-          software engineering”, “computer science - systems
    dict a rating RC for a MOOC course C with                and security”, “computer science - theory”, “math-
    feature vector XC for user U, i.e., user vector          ematics”, “statistics and data analysis”. The most
    profile Θ(U ) , the resulting predicted rating is        significant information from each course is given by
    given by:                                                “Introduction”, “About the Course” and “FAQ” sec-
                                                             tions.
                                                                All extracted information has been preprocessed
                     RC = XCT Θ(U )                          according to the following process: remove non
                                                             ASCII characters, strip HTML tags, remove special
                                                             strings, remove multiple spaces and blank lines.
                                                                After that we built a corpus further used by the
                                                             LDA algorithm. The number of Coursera courses
                                                             considered was 69, while the number of college
   An overview of the recommendation system is de-           courses was 43, which gives rises to 112 courses.
picted in Figure 2 where we estimate the ratings for         The topic modeling algorithm used the gibbs sam-
a student and to recommend a course we consider a            pling inference procedure and according to (Blei,
“top-10 best recommendations” approach thus, each            2012) we set parameters α = 50/T , β = 0.01. The
student get always 10 recommended courses. Those             number of iterations was chosen to be large enough
are the most related MOOCs to courses in which a             to guarantee convergence, N = 200.
student get the 10 lowest grades.                               To measure performance, accuracy was consid-


                                                        46
ered by counting the number of correct matches be-
tween college courses and Coursera courses. Figure                                     0.9
3 illustrates the impact of the number of topics T in
the topic model. A higher accuracy is achieved when                                    0.8

we use a higher number of topics, then we set the
number of topics T = number of Coursera courses                                        0.7


                                                                            Accuracy
because of the precision.
                                                                                       0.6


                                                                                       0.5


                                                                                       0.4

              0.7

                                                                                                                 10              20              30                    40
                                                                                                                  Number of Top-Recommended course(top-N)
              0.6
   Accuracy


              0.5                                                     Figure 4: Recommendation system accuracy according to
                                                                      the number of recommended courses
              0.4


                                                                         In Figure 5, a comparison between ratings of
                                                                      “coursera courses” and “college courses” for one
              0.3


              0.2
                                                                      student is showed. We intend to show proximity of
                    0   20   40        60      80    100   120        predicted data (ratings on “coursera courses”) and
                                  Number of Topics
                                                                      provided data (ratings on college courses).
Figure 3: Accuracy of the recommendation system ac-
cording to the number of topics, a better precision and                                20
                                                                                                                                                  Predicted rating Coursera
also efficiency is obtained when the number of topics is                                                                                          Real rating University

equal to the number of coursera courses, T =69
                                                                                       15
                                                                             Rating


   The goal of our proposal is to recommend courses                                    10


for students who have received low grades in college
therefore, we are using grades as ratings. To keep                                      5
a recommendation system setting, we have decided
to invert grading information thus, 20 grade turns
out 0 rating and viceversa (this step might not be                                      0
                                                                                             1   2   3   4   5    6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22
necessary in other recommendation systems). Mean                                                                                  Courses


normalization is also used to get a more reliable rec-
ommendation for students with few grades available,                       Figure 5: Comparison chart of ratings for one student
for instance, first year students.
   For testing, we define a variable “top-N” which                    5         Conclusion
denotes the number of courses to recommend. For
instance, for student “a” we recommend the “top-                      We have introduced a novel approach for recom-
N” courses from Coursera where he/she has gotten                      mending online courses that combines the proba-
the greatest ratings. In Figure 4, the x-axis denotes                 bilistic topic model LDA and content-based recom-
several values for “top-N”, and the y-axis denotes                    mendation systems. In short, we use a machine
accuracy obtained. An cccuracy over 0.6 is achieved                   learning approach where LDA allow us to extract
for “top-N” greater than or equal to 10.                              feature descriptors from courses, rating prediction


                                                                 47
in this setting is performed by inferring user pro-              Thomas Minka and John Lafferty. 2002. Expectation-
file parameters using multilinear regression. Prelim-               propagation for the generative aspect model. In
inary experimental results show that our algorithm                  Proceedings of the Eighteenth conference on Uncer-
performs well when compared to a similar approach                   tainty in artificial intelligence, pages 352–359. Mor-
                                                                    gan Kaufmann Publishers Inc.
based on cosine similarity with LDA.
                                                                 Anand Rajaraman and Jeffrey David Ullman. 2012. Min-
   Although we have focused on MOOCs as source                      ing of massive datasets. Cambridge University Press,
of recommendation content, nothing prevent us from                  Cambridge.
using this approach beyond such domain. In fact,                 Francesco Ricci, Lior Rokach, and Bracha Shapira.
further domains can be included by performing fea-                  2011. Introduction to recommender systems hand-
ture topic extraction. Future work will be addressed                book. Springer.
to investigate scalability issues. In this sense, topic          C. Romero and S. Ventura. 2010. Educational data min-
models such as LDA, have scalable versions avail-                   ing: A review of the state of the art. Systems, Man, and
able. For instance, a MapReduce implementation is                   Cybernetics, Part C: Applications and Reviews, IEEE
                                                                    Transactions on, 40(6):601–618, Nov.
given in the Apache Mahout library1 . There are also
                                                                 Cristbal Romero, Sebastin Ventura, JoseAntonio Del-
scalable versions for multilinear regression.                       gado, and Paul De Bra. 2007. Personalized links rec-
                                                                    ommendation based on data mining in adaptive educa-
                                                                    tional hypermedia systems. 4753:292–306.
References
                                                                 Er Sadikov and Ivan Bratko. 2011. Recommending vide-
Christopher M Bishop et al. 2006. Pattern recognition               olectures with linear regression.
  and machine learning, volume 1. springer New York.             Mark Steyvers and Tom Griffiths. 2007. Probabilistic
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.                 topic models. Handbook of latent semantic analysis,
  2003. Latent dirichlet allocation. J. Mach. Learn.                427(7):424–440.
  Res., 3:993–1022, March.
David M Blei. 2012. Probabilistic topic models. Com-
  munications of the ACM, 55(4):77–84.
Wray Buntine. 2002. Variational extensions to em and
  multinomial pca. In Machine Learning: ECML 2002,
  pages 23–34. Springer.
Richard O Duda, Peter E Hart, and David G Stork. 2012.
  Pattern classification. John Wiley & Sons.
Gerhard Fischer. 2014. Beyond hype and underestima-
  tion: identifying research challenges for the future of
  moocs. Distance Education, 35(2):149–158.
Thomas L Griffiths and Mark Steyvers. 2004. Finding
  scientific topics. Proceedings of the National academy
  of Sciences of the United States of America, 101(Suppl
  1):5228–5235.
Choochart Haruechaiyasak and Chaianun Damrongrat.
  2008. Article recommendation based on a topic model
  for wikipedia selection for schools. 5362:339–342.
Thomas Hofmann. 1999. Probabilistic latent semantic
  indexing. In Proceedings of the 22nd annual interna-
  tional ACM SIGIR conference on Research and devel-
  opment in information retrieval, pages 50–57. ACM.
Wei Kuang, Nianlong Luo, and Zilei Sun. 2011. Re-
  source recommendation based on topic model for edu-
  cational system. 2:370–374, Aug.
Fred G. Martin. 2012. Will massive open online courses
  change how we teach? Commun. ACM, 55(8):26–28,
  August.
   1
       https://mahout.apache.org/


                                                            48
                                     ANIMITEX project:
                          Image Analysis based on Textual Information


 Hugo Alatrista-Salas and Eric Kergosien and Mathieu Roche and Maguelonne Teisseire
                 TETIS (Irstea, Cirad, AgroParisTech), Montpellier, France
                 LIRMM (CNRS, Univ. Montpellier 2), Montpellier, France
                    firstname.lastname@teledetection.fr


                         Abstract                                  The ANIMITEX project has many application ar-
                                                                eas such as image annotation (Forestier et al. 2010).
        With the amount of textual data available on            For instance, identifying the precise type of culture
        the web, new methodologies of knowledge ex-             or the function of a building is not always possi-
        traction domain are provided. Some original             ble with the only use of images. Nevertheless, tex-
        methods allow the users to combine different            tual data could contain this kind of information and
        types of data in order to extract relevant
                                                                give additional meaning to the images. The develop-
        information. In this context, this paper draws
        the main objectives of the ANIMITEX project             ment of approaches based on image/text matching
        which combines spatial and textual data. The            becomes crucial in order to complete image analy-
        data preprocessing step is detailed.                    sis tasks (Alatrista-Salas and Béchet 2014). It also
                                                                enables a better classification of data.

  Keywords: Knowledge extraction, Text mining,                     Moreover, image-text matching will enrich Infor-
Satellite images, Spatial feature identification                mation Retrieval (IR) methods and it will provide
                                                                users a more global context of data (Sallaberry et al.
1       Aims of the ANIMITEX project                            2008). This can be crucial for the decision maker in
                                                                the context of land-use planning projects that have
A lot of high resolution satellite data are now avail-          to take into account opinions of experts related to a
able. This raises the issue of fast and effective satel-        territory (managers, scientists, associations, special-
lite image analysis as it still requires a costly hu-           ized companies, and so forth).
man implication. In this context, remote sensing ap-
proaches enable to tackle this challenge. The ex-                  In the context of the ANIMITEX project, we plan
ploratory and ambitious ANIMITEX project1 aims                  to investigate two specific scenarios: (i) The con-
at processing massive and heterogeneous textual                 struction of a road on the north of Villeveyrac (city
data (i.e. big data context) in order to provide cru-           close to Montpellier, France), (ii) A port activity
cial information to enrich the analysis of satellite im-        area, called Hinterland, in Thau area (near to Sète,
ages.                                                           France). The aim of this case studies is to enrich
   The large amount of data are associated to a tem-            images with information present in documents, e.g.
poral repetitivity that increases. For instance today           the opinions extracted in newspapers about land-use
around ten images are available per year (e.g. SPOT,            planning.
Landsat), and in 3 years, one image every 5 days
(based on Sentinel-2 satellites) will be available.
                                                                  The section 2 describes the proposed data prepro-
    1
    http://www.lirmm.fr/∼mroche/ANIMITEX/ (web site in          cessing step. The section 3 details the partners in-
French)                                                         volved in the project.


                                                           49
2       Data preprocessing process                                 and dedicated rules (Gaio et al. 2012) allows us to
                                                                   identify the absolute (e.g., Montpellier) and relative
The current work focuses on adapting of Natural                    (e.g., south of Montpellier) Spatial Features (ASF
Language Processing (NLP) techniques for recogni-                  and RSF) (Lesbegueries et al. 2006; Kergosien et al.
tion of Spatial Features (SF) and thematic/temporal                2014). A first framework based on sequential pattern
information (Gaio et al. 2012; Maurel et al. 2011).                mining (Cellier et al. 2010) has been proposed to
In the proposed approach, SF appearing in a text,                  discover relationships between SF (Alatrista-Salas
are composed of at least one Named Entity (NE) and                 and Béchet 2014). To this end, a two-step process
one or more spatial indicators specifying its location             has been defined (See Figure 3).
(Lesbegueries et al. 2006). For this, a set of articles               SF validation: for each identified ASF, we check
(i.e. 12000 documents) concerning Thau region be-                  on external resources if there is a corresponding spa-
tween the years 2010 and 2013 has been acquired. A                 tial representation. In particular, we have used layers
second part of the data set is composed of raster files            provided by the IGN3 (municipalities, roads, rail-
(image mosaics Pleiades - spatial resolution 2x2 m                 ways, buildings, etc.). In addition, if an ASF does
- 4 spectral bands) covering all regions of the Thau               not present on IGN ressources, we use gazetteers
lagoon (See Figure 1). Satellite images are available              (Geonames and Open Street Maps) to complet the
via the GEOSUD Equipex2 .                                          information. Concerning the representation of RSF,
                                                                   we use spatial indicators of topological order asso-
                                                                   ciates to ASF.
                                                                      Following the scopes proposed in (Sallaberry et
                                                                   al. 2008), the spatial indicators of topological order
                                                                   have been grouped in five categories:

                                                                      • Proximity: different indicators can be used
                                                                        in relationship of proximity, such as: near,
                                                                        around, beside, close to, periphery, etc..

                                                                      • Distance: the indicators used in this relation-
                                                                        ship are of the form: x km, x miles, etc.. Two
                                                                        representations are then proposed in our ap-
                                                                        proach: 1) calcul of distance from the centroid
                                                                        of the ASF and construction of a circular buffer
Figure 1: Mosaic images Pleiades around the Thau basin,
                                                                        of size x from the centroid; 2) regarding the
images on the right represent the superposition of a vector
classification on the raster file.                                      shape of the ASF and building a buffer of size
                                                                        x from the edge of the processed ASF .
   A detailed classification of the land occupation is
                                                                      • Inclusion: this binary operation allow us to
currently in progress. It will lead to a digital vec-
                                                                        check if an ASF is inside another taking into
tor layer where each SF (represented by a polygon)
                                                                        account indicators such as: center, in the heart,
belongs to a class of specific land use. The nomen-
                                                                        in, inside, etc.
clature of this classification is organized into four hi-
erarchical levels (See Figure 2). Moreover we inves-                  • Orientation: This unary relationship has been
tigate multi-scale information associated with differ-                  broadly studied in the literature. Different ap-
ent levels of classification of satellite images.                       proaches have been proposed to identify a car-
   From this corpus, NLP methods have been ap-                          dinal points of an ASF. We have chosen to use
plied in order to identify linguistic features con-                     the conical model proposed in (Frank 1991).
cerning spatial, thematic, and temporal information                     For this, we use the centroid of ASF and we
in the documents. The combined use of lexicons                        3
                                                                        Institut National de l’information Gographique et forestire,
    2
        http://www.equipex-geosud.fr/                              i.e. National Institute of Geography


                                                              50
                 Figure 2: Nomenclature of Thau region used to image classification (in French)


                     Figure 3: From document to footprint definition: the two-step process


     build a buffer around. The size of this buffer          the treated document can be mapped. In this pro-
     will be calculated taking into account the sur-         cess, two main problems have been identified. The
     face of the studied ASF. Then we decompose              first one is the persistent ambiguity of some NE con-
     the buffer into four equal areas (forming a ”X”)        tained in SF because of some NE (e.g. Montagnac)
     from the centroid. Each intersection between            corresponding to several places. To address this is-
     the buffer and cones thus formed represent the          sue, a configurable spatial filter based on predefined
     four cardinal points.                                   scenarios has been developed. For example, to iden-
                                                             tify events related to a specific land-use planning
  • Geometry: geometry relations are built from at           project occurred in a part of the area of the Thau
    least two ASF. These relationships are, for ex-          lagoon, only the SF contained in this area will be
    ample, the union, the adjacency, the difference          explored. The second issue is related to the use of
    or a position of an ASF with respect to other            external resources and the identification of the spa-
    ASF, for example, C between A and B (where               tial representation appropriate to each ASF. Taking
    A,B and C are ASF), etc.                                 into account the spatial indicator (e.g. town, road,
  Representation of the spatial footprint: after             etc.) preceding by the toponymic name is a first an-
the extraction step and spatial representation of the        swer because it allows us to specify the type of the
ASF and RSF, the spatial footprint associated with           SF and thus take into account the appropriate spatial


                                                        51
representation.                                                 Acknowledgments
   Thematic information is identified by semantic re-
                                                                The authors thank Midi Libre (French newspaper)
sources (i.e. AGROVOC thesaurus, nomenclature
                                                                for its expertise on the corpus and all the partners
resulting of image classifications ...) (Buscaldi et al.
                                                                of ANIMITEX project for their involvement. This
2013).
                                                                work is partially funded by Mastodons CNRS grant
   These linguistic features allow us to identify spe-          and GEOSUD Equipex.
cific phenomena in documents (e.g., land-use plan-
ning, environmental change, natural disasters, etc.).
The main idea is to link the phenomena identified               References
in images with subjects found in documents during               Alatrista Salas H., Béchet N. Fouille de textes : une ap-
the same period. Overall, the ANIMITEX project                    proche séquentielle pour découvrir des relations spa-
allows the users to integrate different information               tiales. In Cergeo Workshop - EGC, 2014
sources, i.e. both types of expressions (texts vs. im-          Buscaldi D., Bessagnet M.N., Royer A., Sallaberry C.
ages). The main objective is to enrich the informa-               Using the Semantics of Texts for Information Re-
tion conveyed by a text with images and vice versa.               trieval: A Concept and Domain Relation-Based Ap-
                                                                  proach. Proceedings of ADBIS (2) - Advances in
                                                                  Databases and Information Systems, pp. 257-266,
3   Consortium of the project                                     2013.
                                                                Cellier P., Charnois T., Plantevit M., Crémilleux B. Re-
The multidisciplinary consortium of the project in-
                                                                  cursive Sequence Mining to Discover Named Entity
volves three research domains: Computer Science,                  Relations Symposium on Intelligent Data Analysis,
Geography and Remote Sensing. More precisely,                     LNCS, pp. 30-41, 2010.
the expertise in remote sensing and complex min-                Forestier G., Puissant A., Wemmert C., Gançarski,
ing and heterogeneous spatio-temporal data, is one                Knowledge-based Region Labeling for Remote Sens-
of the foundations of the project.                                ing Image Interpretation Computers, Environment and
   TETIS (Territories, Environment, Remote Sens-                  Urban Systems, Vol. 36(5), pp. 470?480, 2012
ing and Spatial Information, Montpellier) aims                  Frank A. U. Qualitative spatial reasoning with car-
                                                                  dinal directions. In Seventh Austrian Conference
to produce and disseminate knowledge, concepts,
                                                                  on Artificial Intelligence, volume 287 of Informatik-
methods, and tools to characterize and understand                 Fachberichte, pages 157–167. Springer, Berlin Heidel-
the dynamics of rural areas and territories, and con-             berg, 1991.
trol spatial information on these systems. LIRMM                Gaio M., Sallaberry C., and Nguyen V.T.                Ty-
(Informatics, Robotics and Microelectronics, Mont-                page de noms toponymiques à des fins d’indexation
pellier) focuses on knowledge extraction. ICube                   geéographique. TAL, 53(2):143–176, 2012.
(Strasbourg) is specialized in image analysis and               Kergosien E., Laval B., Roche M., Teisseire M. Are opin-
complex data mining. ICube collaborates with ge-                  ions expressed in land-use planning documents? Inter-
ographers from LIVE laboratory (Image Labora-                     national Journal of Geographical Information Science,
                                                                  Vol. 28(4), pp.739-762, 2014.
tory, City, and Environment) and specialists in NLP
                                                                Lesbegueries J., Gaio M., and Loustau P. Geograph-
(LiLPa lab – Linguistics, language, speech). These                ical information access for non-structured data. In
two locations (Montpellier and Strasbourg) consti-                Proceedings of the 2006 ACM Symposium on Applied
tute a cluster of local skills related to all major as-           Computing, SAC ’06, pages 83–89, New York, NY,
pects of the project. LIUPPA (Pau) includes re-                   USA, 2006.
searchers specializing in Information Extraction (IE)           Maurel D., Friburger N., Antoine J.-Y., Eshkol-Taravella
and Information Retrieval (IR). The main work of                  I., and Nouvel D. Casen: a transducer cascade to
this partner is about extraction and managment of                 recognize french named entities. TAL, 52(1):69–96,
                                                                  2011.
geographical information. GREYC (Caen) brings
                                                                Sallaberry C., Gaio M., and Lesbegueries J. Fuzzying gis
researchers in data mining (e.g. mining sequences in              topological functions for gir needs. In Jones C. B. and
order to discover relationships between spatial enti-             Purves R., editors, 5th ACM Workshop On Geographic
ties) and NLP. For this aspect, a collaborations with             Information Retrieval, pages 1–8, 2008.
two other labs is developed (LIPN and IRISA).


                                                           52
A case study on Morphological Data from Eimeria of Domestic Fowl using a
Multiobjective Genetic Algorithm and R&P for Learning and Tuning Fuzzy
                         Rules for Classification

                  Edward Hinojosa C.                                   Cesar A. Beltran C.
             Dept. Informatics Engineering                        Dept. Informatics Engineering
          Pontifical Catholic University of Peru               Pontifical Catholic University of Peru
                       Lima, Peru                                           Lima, Peru
                ehinojosa@pucp.pe                                     ebeltran@pucp.pe


                       Abstract                               highly used of EAs. The FRBCSs are defined as Ge-
                                                              netic Fuzzy Rule-Based Systems (GFRBSs) when
     In this paper, we use fuzzy rule-based classifi-         the GAs are used to learn or tuning FRBCSs. The
     cation systems for classify cells of the Eime-           GFRBSs continue to be researched and used in re-
     ria of Domestic Fowl based on Morphological              cent years (Nojima and Ishibuchi, 2013), (Chen et
     Data. Thirteen features were extracted of the            al., 2013), (Jalesiyan et al., 2014).
     images of the cells, these features are geneti-
     cally processed for learning fuzzy rules and a              Generally a FRBCSs is composed of two com-
     method reward and punishment for tuning the              ponents (Herrera, 2005), the Knowledge Base (KB)
     weights of the fuzzy rules. The experimental             and the Inference Mechanism (IM). The KB is com-
     results show that our classifier based on inter-         posed of two components too, the Data Base (DB)
     pretability fuzzy rules has a similar classifica-        and the Rule Base (RB). This paper is concerned
     tion rate to that of a non-parametric and non-           with the genetic learning of the RB.
     interpretability method.
                                                                 The most commonly used approaches for rule
                                                              learning in FRBCSs using GAs are Pittsburgh,
1   Introduction                                              Michigan, Iterative Rule Learning (IRL) and Ge-
                                                              netic Cooperative-Competitive Learning (GCCL).
The fuzzy systems were proposed by Zadeh at                   In the Pittsburgh approach, each chromosome en-
1965 (Zadeh, 1965) and they are systems based on              codes a set of fuzzy rules, after the genetic pro-
the theory of the fuzzy sets and logic fuzzy. A of the        cess the RB is a better chromosome (De Jong et al.,
most important types of fuzzy systems are the Fuzzy           1993). In the Michigan approach, each chromosome
Rule Based Classification Systems (FRBCSs) (Her-              encodes a single rule, after the genetic process the
rera, 2005) (Herrera, 2008). Classification prob-             RB is the set of chromosomes or rules of the pop-
lem is studied in the machine learning, data min-             ulation (Holland and Reitman, 1978). In the IRL
ing, database, and information retrieval communi-             approach, each chromosome encodes a single rule
ties with applications in a several domains.                  too, but after the genetic process, the better rule is
   The rules are a paradigm for representing knowl-           selected and inserted to the RB, this process is re-
edge and they have the capacity to build a linguistic         peated iteratively until a condition is satisfied (Gon-
model interpretable to the users. The learning (or            zalez and Perez, 2012). The GCCL approach is a
automatic generation) and tuning of the fuzzy rules           hybrid of the Pittsburgh and Michigan approaches,
in FRBCSs from data sample is a difficult task (Her-          the rules or chromosomes cooperate among them-
rera, 2008). This task can be considered as an op-            selves based on Pittsburgh approach and the rules or
timization or search process that can be managed              chromosomes compete among themselves based on
by using Evolutionary Algorithms (EAs). The Ge-               Michigan approach (Giordana and Neri, 1995).
netic Algorithms (GAs) is one of the most know and               This paper is based in the IRL approach using a


                                                         53
Multiobjective Genetic Algorithms (MOGAs). We                  ample eq ∈ E with a class Cj ∈ C. Each eq is
use MOGAs because in the process of learning                   defined by a set of features or characteristics eq =
fuzzy rules in FRBCSs are considered two objec-                {aq1 , aq2 , ..., aqn }.
tives: accuracy and interpretability. This objectives             A FRCS resolves classification problems using
are considered contradictory (Casillas and Carse,              rules usually with the follow structures:
2009) and we search a trade-off of them. The ac-                   Ri : IF V1 IS T1l1 AND V2 IS T2l2 AND ... AND
curacy is measured by the classification rate and                      Tn IS Tnln THEN Class = Cj WITH CFi
the interpretability is measured for many features
of the FRBCSs, for example, quantity of the rules              where:
or quantity of the conditions of each rule. We use              Ri                         : Index of the fuzzy rule i.
specifically the well-known algorithm called Non-               V1 , V2 , ..., Vn          : Linguistic variables or featu-
dominated Sorting Genetic Algorithm II (NSGA-                                                res of each example eq .
II) (Deb et al., 2002). After the learning the                     T1l1 , T2l2 , ..., Tnln : Linguistic terms or fuzzy sets
fuzzy rules, we use a Reward and Punishment(R&P)                                             used for representing the class
method for the tuning the factors or weights of the                                          Cj .
rules (Nozaki et al., 1996) to improve the accuracy                Cj                      : The class of the fuzzy rule Ri .
of the FRBCS.                                                      CFi                     : The certainty grade (i.e. rule
                                                                                             weight) of the rule Ri .
   We use the proposed method for classify cells of
the Eimeria of Domestic Fowl. The Eimeria genus
comprises a group of protozoan parasites that infect              Usually a FRBCS has two main compo-
a wide range of hosts. A total of seven different              nents (Herrera, 2005): The Knowledge Base (KB)
Eimeria species infect the domestic fowl, causing              and the Inference Mechanism (IM), these are de-
enteritis with severe economic losses. We use three            tailed below:
groups of morphological features: geometric mea-
                                                                   1. The Knowledge Base: The KB is composed of
sures, curvature characterization, and internal struc-
                                                                      two components:
ture quantification (Beltran, 2007).
   This paper is organized as follows: we present                      (a) The Data Base: The DB contains the
in Section 2 the basic concept of classification and                       membership functions, fuzzy sets or lin-
FRBCSs employed in this paper. In Section 3 we                             guistic terms for each linguistic variable
describe the genetic algorithm multiobjetivo called                        of the classification problem.
NSGA-II used in this paper. The proposed method                        (b) The Rule Base: The RB contains the
for learning the RB and tuning the factor of each rule                     collection of fuzzy rules representing the
is detailed in Section 4. The Section 5 shows the re-                      knowledge.
sults of the classification on morphological features
of the Eimeria genus. The conclusions of this work                 2. The Inference Mechanism: The IM is the fuzzy
are presented in Section 6.                                           logic reasoning process that determines the out-
                                                                      puts corresponding to fuzzified inputs (Lakhmi
2   Fuzzy Rule Based Classification Systems                           and Martin, 1998). The most common fuzzy
                                                                      inference method for fuzzy classification prob-
Classification problem is studied in the machine                      lems are the classic and general reasing meth-
learning, data mining, database, and information re-                  ods (Cordon et al., 2013). This paper uses the
trieval communities with applications in a several                    classic method.
domains, such as medical (Kumar et al., 2013), tar-
                                                               3     Non-dominated Sorting Genetic
get marketing (Yongzhi et al., 2013), biology (Silla
                                                                     Algorithm Multiobjetive II
and Kaestner, 2013), among others.
   Any classification problem has a set of exam-               Is a new version of the NSGA (Srinivas and Deb,
ples E = {e1 , e2 , ..., ep } and a set of classes C =         1994), the NSGA-II was proposed by Deb in
{C1 , C2 , ..., Cm }, the objective is labeled each ex-        2002 (Deb et al., 2002) and it is computationally


                                                          54
more efficient, elitist and doesnt need to define addi-
tional parameters.
   In the NSGA-II, the population Qt (size N ) is
generated using the parent population Pt (size N ).
After this, the two populations are combined for
generating the population Rt (size 2N ). The popu-
lation Rt is sorted according the dominance of the
solutions in different Pareto fronts (Pareto, 1896)
and the crowding distance. A new population Pt+1
(size N ) is generated with the bests Pareto fronts F1 ,
F2 , F3 and so forth, until the Pt+1 size equals to the
value of N . The solutions in the Pareto fronts under
this limit are removed. After Pt+1 is a new Pt and
the process is repeated until a conditions is satisfied.
The figure 1 shows the process of evolutions of the
solutions in the NSGA-II. More details on NSGA-II
can be found at (Deb et al., 2002).                              Figure 2: Proposed Method for Learning Fuzzy Rules


                                                                   A set of examples is used as the set of training.
                                                                The proposed IRL method starts when is defined the
                                                                order of the class for learning. After that, a class is
                                                                selected and the module for generate the best rule
                                                                that used a MOGA is executed. The MOGA consid-
                                                                ers two objectives for minimization: accuracy and
                                                                interpretability. The accuracy is determined by the
                                                                integrity and consistency of each rule (Gonzalez and
                                                                Perez, 1999) and the interpretability is defined by the
                                                                quantity of conditions of each rule. When the best
                                                                rule in the Pareto front improves the rate of classi-
    Figure 1: Evolutions of the Solutions in the NSGA-II        fication of the RB, this rule is inserted into the RB,
                                                                some examples are marked and the process of learn-
4     Proposed Method                                           ing a fuzzy rule starts again. When the best rule in
                                                                the Pareto front doesnt improve the rate of classi-
This section presents the proposed methods for                  fication of the RB, the process verifies that all the
learning fuzzy rules using the IRL approach and a               sequence of class was learned, if the sequence is not
MOGA, and tuning the weights of the fuzzy rules                 learned a new class is selected and the process of
using a R&P method. In the next subsections each                learning a fuzzy rule starts again, else the process
method is detailed.                                             finishes and the set of the best rules is the RB.
4.1 Learning Fuzzy Rules                                           In the process detailed above all rules has a weight
                                                                equals to one. These weights can be tuning for im-
The proposed method for learning fuzzy rules is
                                                                prove the rate classification. This tuning is detailed
based in the iterative multiobjective genetic method
                                                                in the next subsection.
described in (Hinojosa and Camargo, 2012) and uses
a MOGA for learning a single fuzzy rule in each it-
                                                                4.2   Tunning Weights of the Fuzzy Rules
eration of the MOGA. The main difference with the
proposed method is the module for defining the or-              We use the method proposed in (Nozaki et al., 1996)
der of the class for learning. This method proposed             for this task. This method rewards or increases the
is illustrated in the Figure 2.                                 weight the a fuzzy rule Ri when a example eq is cor-


                                                           55
rectly classified by this rule according to the next                 Class Number    Class Name      # of Examples
equation:                                                                  1         E. acervulina         636
                                                                           2         E. maxima             321
                                                                           3         E. brunetti           418
                                                                         4         E. mitis              757
       CFinew = CFiold + n1 1 − CFiold            (1)                      5         E. praecox            747
                                                                           6         E. tenella            608
   And this method punishes or decreases the weight                        7         E. necatrix           404
of the fuzzy rule Ri when a example eq is misclas-
sifed by this rule according to the next equation:                          Table 1: Distribution of Classes

                                                                             Parameter               Value
           CFinew = CFiold − n2 CFiold            (2)                        Size the population      50.0
                                                                             Crossover rate           1.0
  In the experimental study detaild in Section 5 we                          Mutation rate            0.2
used the values n1=0.001 and n2=0.1 and the tuning                           Number of generations   500.0
procedure for 500 iterations.                                                Mark value               0.3

5   Experimental Study                                                Table 2: Parameters of the Proposed Method

The experimental study is aimed to show the appli-
cation of the proposed method and the comparation             rules has a high level of interpretability for the expert
with the classification with non-parametric method            users.
for classifying cells of the Eimeria of Domestic Fowl
based on Morphological Data. The Emeira genus
comprises a group of protozoan parasites that infect
a wide range of hosts, seven different Emeira species
infect the domestic fowl, causing enteritis with sev-
eral economic losses. This protozoan morphology
was represented by 13 features: mean of curvature,
standard deviation of curvature, entropy of curva-
ture, major axis (lenght), minor axis (width), sym-               Figure 3: Proposed Method for Learning Fuzzy Rules
metry through major axis, symmetry through minor
axis, area, entropy of internal structure, second an-            We compared the proposed method with the
gular moment, contrast, inverse difference moment,            method non-parametric classifier proposed in (Bel-
entropy of co-occurrence matrix; these features are           tran, 2007) with the same set of examples. The Ta-
used as the input pattern for the classification pro-         ble 4 shows the classification rate by each class of
cess.                                                         both classifiers. These results shows that the pro-
   The Table 1 shows the class and the number of              posed method (PM) has a similar rate classifica-
examples or instances of each class. More detail              tion (overall 77.17) that the non-parametric method
how the features were extracted or about the Eimeria          (NPM) (overall 80.24), but with a high degree of in-
genus can be found at (Beltran, 2007).                        terpretability. The non-parametric method does not
   The Table 2 shows the parameters for learning              consider the interpretability.
fuzzy rules and the NSGA-II (the MOGA used in
                                                              6     Conclusions
this paper).
   After the process of learning fuzzy rules, the pro-        In this article, we proposed a iterative multiobjective
cess of tuning the weights starts. The Table 3 shows          genetic method to learn fuzzy classification rules.
the result of the classification or dispersion matriz         The fuzzy rules are learned in each iteration depend
after the process tuning the weights.                         of the sequencia of class. After that, the weights of
   After the proposed method, the result is a set of          the each fuzzy rules are tuned using a R&P method.
rules similar of the set shows in the Figure 3. This          The results obtained have indicated that FRBCSs


                                                         56
         ACE      MAX     BRU     MIT     PRA     TEN     NEC          Holland, J. and Reitman, J. 1978. Cognitive Systems
 ACE     85.06    0.00    0.00    2.38     0.13   0.16     7.18
                                                                          Based on Adaptive Algorithms, ACM SIGART Bul-
 MAX     0.00     98.44   0.72    0.00     0.00   0.00     0.00
 BRU     0.00     1.56    87.08   0.00     6.29   1.48     0.74           letin
 MIT     1.73     0.00    0.00    86.79    4.95   1.64     4.70        Gonzalez, A. and Perez, R. 1999. SLAVE: A genetic
 PRA     2.67     0.00    5.50    6.87    69.34   10.53   24.01           learning system based on an iterative approach. IEEE
 TEN     7.86     0.00    6.70    1.19    16.06   82.73   37.62           Transactions on Fuzzy Systems, 7:176–191.
 NEC     2.67     0.00    0.00    2.77     3.21   3.45    25.74        Giordana, A. and Neri, F. 1995. Search-intensive con-
                                                                          cept induction. Evol Comput, 3:375–416.
       Table 3: Results of the Proposed Method                         Casillas, J. and Carse, B. 2009. Special issue on Genetic
                                                                          Fuzzy Systems: Recent Developments and Future Di-
                 Class Name        PM     NPM                             rections. Soft Comput., 13(5):417–418.
                 E. acervulina    85.06   87.70                        Deb, K. and Pratap, A. and Agarwal, S. and Meyarivan,
                 E. maxima        98.44   96.12                           T. 2002. A Fast and Elitist Multiobjective Genetic
                 E. brunetti      87.08   94.98                           Algorithm: NSGA-II. Trans. Evol. Comp., 6(2):182–
                 E. mitis         86.79   86.27                           197.
                 E. praecox       69.34   64.46                        Nozaki, k., Ishibuchi, H., Tanaka, H. 1996. Adaptive
                 E. tenella       82.73   76.53                           fuzzy rule-based classification systems. IEEE Trans.
                 E. necatrix      25.74   55.60                           Fuzzy Systems, 4(3):238–250.
                                                                       Beltran, C. 2007. Anlise e reconhecimento digital de for-
                 Table 4: The PM vs. NPM                                  mas biolgicas para o diagnstico automtico de parasitas
                                                                          do gnero Eimeria. PhD. Teses - USP - Brazil.
                                                                       Kumar, S.U., Inbarani, H.H., Kumar, S.S. 2013. Bijec-
have better interpretability and similar accuracy than                    tive soft set based classification of medical data. In-
a non-parametric method for classify the Eimeria of                       ternational Conference on Pattern Recognition, Infor-
domestic fowl.                                                            matics and Mobile Engineering, 1:517–521.
                                                                       Yongzhi, Ma., Hong Gao, Yi Ding, Wei Liu. 2013. Lo-
                                                                          gistics market segmentation based on extension clas-
References                                                                sification. International Conference on Information
                                                                          Management, Innovation Management and Industrial
Zadeh L. A. 1965. Fuzzy Sets. Information and Control,
                                                                          Engineering, 2:216–219.
   8(3):338–353.
                                                                       Silla, C.N. and Kaestner, C.A.A. 2013. Hierarchi-
Herrera F. 2005. Genetic fuzzy systems: Status,                           cal Classification of Bird Species Using Their Audio
   critical considerations and future directions. Inter-                  Recorded Songs. IEEE International Conference on
   national Journal of Computational Intelligence Re-                     Systems, Man, and Cybernetics, 1:1895–1900.
   search, 1(1):59–67.                                                 Lakhmi C. and Martin N. 1998. Fusion of Neural Net-
Herrera F. 2008. Genetic fuzzy systems: taxonomy, cur-                    works, Fuzzy Systems and Genetic Algorithms: Indus-
   rent research trends and prospects. Evolutionary Intel-                trial Applications (International Series on Computa-
   ligence, 1(1):27–46.                                                   tional Intelligence). CRC Press
Nojima, Y. and Ishibuchi, H. 2013. Multiobjective ge-                  Cordon, O.,Herrera, F., Hoffmann, F., Magdalena, L.
   netic fuzzy rule selection with fuzzy relational rules.                2001. Genetic Fuzzy Systems: Evolutionary Tuning
   IEEE International Workshop on Genetic and Evolu-                      and Learning of Fuzzy Knowledge Bases World Sci-
   tionary Fuzzy Systems (GEFS), 1:60–67                                  entific
Chen, S.-M, Chang, Y.-C., Pan, J.-S. 2013. Fuzzy Rules                 Srinivas, N. and Deb, K. 1994. Multiobjective Opti-
   Interpolation for Sparse Fuzzy Rule-Based Systems                      mization Using Nondominated Sorting in Genetic Al-
   Based on Interval Type-2 Gaussian Fuzzy Sets and Ge-                   gorithms. Evolutionary Computation, 2:221–248.
   netic Algorithms. IEEE Transactions on Fuzzy Sys-                   Pareto, V. 1896. Cours d’Economie Politique. Droz,
   tems, 21(3):412–425.                                                   Genve.
Jalesiyan, H., Yaghubi, M., Akbarzadeh, T.M.R. 2014.                   Hinojosa, E. and Camargo, H.A. 2012. Multiobjective
   Rule selection by Guided Elitism genetic algorithm in                  genetic generation of fuzzy classifiers using the itera-
   Fuzzy Min-Max classifier. Conference on Intelligent                    tive rule learning. International Conference on Fuzzy
   Systems, 1:1–6.                                                        Systems, 1:1–8.
                                                                       Gonzalez, A. and Perez, R. 2009. Improving the genetic
De Jong KA, Spears WM, Gordon DF. 1993. Using ge-
                                                                          algorithm of SLAVE. Mathware and Soft Computing,
   netic algorithms for concept learning. Mach Learn,
                                                                          16:59–70.
   13:161–188.


                                                                  57
          SIFR Project: The Semantic Indexing of French Biomedical Data
                                   Resources
              Juan Antonio Lossio-Ventura,                              Mathieu Roche,
                    Clement Jonquet                                  Maguelonne Teisseire
            LIRMM, CNRS, Univ. Montpellier 2                    TETIS, Cirad, Irstea, AgroParisTech
                   Montpellier, France                                 Montpellier, France
               fName.lName@lirmm.fr                            fName.lName@teledection.fr


                        Abstract                               riquecimiento de ontologı́as, con el fin de poblar
                                                               ontologı́as con los términos extraı́dos.
        The Semantic Indexing of French Biomed-                   El artı́culo es organizado como sigue. Primero
        ical Data Resources project proposes to in-            discutimos sobre la sobre la metodologı́a puesta
        vestigate the scientific and technical chal-           en marcha para este proyecto en la Sección 2.
        lenges in building ontology-based services             La evaluación de la precisión es presentada en
        to leverage biomedical ontologies and ter-             la Sección 3 seguida de las conclusiones en la
        minologies in indexing, mining and re-                 Sección 4.
        trieval of French biomedical data.
                                                               2     Metodologı́a
1       Introducción
                                                               Nuestro trabajo se divide en dos procesos princi-
Hoy en dı́a la gran cantidad de datos disponibles              pales: (i) la extracción de términos biomédicos, y
en lı́nea suele componerse de texto no es-                     (ii) el enriquecimiento de ontologı́as, explicados a
tructurado, por ejemplo reportes clı́nicos, in-                continuación.
formes de reportes adversos, historiales clı́nicos
                                                               2.1     Extracción Automática de Términos
electrónicos (Lossio-Ventura et al., 2013). Reg-
                                                                       Biomédicos
ularmente estos textos son escritos usando un
lenguaje especı́fico (expresiones y términos) us-             La extracción de términos es una tarea esencial
ados por una comunidad. Es por eso existe la                   en la adquisición de conocimiento de un dominio.
necesidad de formalizar e indexar términos o con-             En este trabajo presentamos las medidas creadas
ceptos técnicos. Lo cual implica un gran consumo              para este objetivo. Medidas que se basan en var-
de tiempo.                                                     ios criterios como lingüı́stico, estadı́stico, grafos
   Los términos relevantes son útiles para obtener           y web para mejorar el resultado de extracción
una mayor comprensión de la estructura con-                   de términos biomédicos. Las medidas presen-
ceptual de un dominio.            Estos pueden ser:            tadas a continuación son puestas a disposición de
(i) términos de una sola palabra (sencillo a ex-              la comunidad, bajo la aplicación llamada B IO -
traer), o (ii) términos de varias palabras (difı́cil).        T EX (Lossio-Ventura et al., 2014).
En el ámbito biomédico, hay una gran diferen-                2.1.1    Lingüı́stica
cia entre los recursos existentes (ontologı́as) en
inglés y francés. En Inglés hay cerca de 7 000 000          Estas técnicas intentan recuperar términos gracias
de términos asociados a 6 000 000 de conceptos,               a la formación de patrones. La idea principal es
tales como los de UMLS1 o BioPortal (Noy et al.,               la construcción de reglas para describir las es-
2009). Mientras que, en francés sólo hay alrede-             tructuras de los términos de un dominio medi-
dor de 330 000 términos asociados a 160 000 con-              ante el uso de caracterı́sticas ortográficas, léxicas
ceptos (Neveol et al., 2014). Por lo tanto, hay                o morfo-sintácticas. La idea principal es la con-
una necesidad de enriquecer terminologı́as u on-               strucción de reglas, normalmente de forma man-
tologı́as en francés. Por lo tanto, nuestro tra-              ual, que describen las estructuras comunes de
bajo se compone de dos pasos principales: (i) la               términos para ciertos campos. En muchos ca-
extracción de términos biomédicos, y (ii) el en-            sos también, diccionarios conteniendo términos
                                                               técnicos (e.g., prefijos, sufijos y acrónimos es-
    1
        http://www.nlm.nih.gov/research/umls                   pecı́ficos) son usados para ayudar a extraer


                                                          58
términos (Krauthammer et al., 2004).                          con LIDF-value, donde los nodos representan los
                                                               términos relacionados con otros términos gracias
2.1.2   Estadı́stica                                           a la co-ocurrencia en el corpus.
Las técnicas estadı́sticas se basan en la eviden-
cia presentada en el corpus a través de la infor-             2.1.4 Web
mación contextual. Tales enfoques abordan prin-               Diferentes estudios de Web Mining se enfocan en
cipalmente el reconocimiento de términos gen-                 la similitud semántica, relación semántica. Esto
erales (Van Eck et al., 2010). La mayorı́a de me-              significa para cuantificar el grado en el que al-
didas se basan en la frecuencia. La mayor parte                gunas palabras están relacionadas, teniendo en
de trabajos combinan la información lingüı́stica             cuenta no sólo similitud sino también cualquier
y estadı́stica, tal es el caso de C-value (Frantzi             posible relación semántica entre ellos.        La
et al., 2000) combina la información estadı́stica y           primera medida web creada fue WebR (Lossio-
lingüı́stica tanto para la extracción de términos de        Ventura et al., 2014), finalmente la mejora lla-
varias palabras como de términos largos y anida-              mada WAHI (Lossio-Ventura et al., 2014) (Web
dos. Es la medida más conocida en la liter-                   Association based on Hits Information). Nuestra
atura. En el trabajo de (Zhang et al., 2008), de-              medida basada en la Web tiene por objetivo volver
mostraron que C-value obtiene los mejores resul-               a clasificar la lista obtenida previamente con TeR-
tados comparado a otras medidas. Además del                   Graph. Demostramos con esta medida que la pre-
inglés, C-value también ha sido aplicado a otros             cisión de los k primeros términos extraı́dos su-
idiomas tales como japonés, serbio, esloveno, po-             peran los resultados de las medidas arriba men-
laco, chino (Ji et al., 2007), español (Barrón-              cionadas (ver Sección 3).
Cedeno et al., 2009), árabe. Es por eso, en nue-
                                                               2.2    Enriquecimiento de Ontologı́as
stro primer trabajo (Lossio-Ventura et al., 2013),
la modificamos y adaptamos para el francés.                   El objetivo de este proceso es enriquecer las ter-
   A partir de C-value, hemos creados otras me-                minologı́as u ontologı́as con los términos nuevos
didas, como F-TFIDF-C, F-OCapi, C-OKapi,                       extraı́dos en el proceso anterior. Los tres grandes
C-TFIDF (Lossio-Ventura et al., 2014), estas                   pasos a seguir en este proceso son:
medidas obtienen mejores resultados que C-                      (1) Determinar si un término es polisémico:
value. Finalmente una nueva medida basada                           con la ayuda del Meta-Learning, hemos po-
en la información lingüı́stica y estadı́stica es                  dido predecir con una confianza de 97% si un
LIDF-value (Lossio-Ventura et al., 2014) (pa-                       término es polisémico. Esta contribución será
trones Lingüı́sticos, IDF, and C-value informa-                    valorizada en la conferencia ECIR 2015.
tion), que mejora con gran diferencia los resulta-              (2) Identificar los posibles significados si el
dos obtenidos por las medidas antes citadas.                        término es polisémico: es nuestro trabajo
                                                                    actual, con la ayuda de clustering, cluster-
2.1.3   Grafos
                                                                    ing sobre los grafos tratamos de resolver este
El modelo de grafos es una alternativa al modelo                    problema.
de información, muestra claramente las relaciones              (3) Posicionar el término en una ontologı́a.
entre los nodos gracias a las aristas. Gracias a los
algoritmos de centralidad se puede aprovechar los              3     Experimentaciones
grupos de información en grafos. Existen aplica-
                                                               3.1    Datos, protocolo y validación
ciones de grafos para la Recuperación de Infor-
mación (RI) en el contexto de las redes sociales, de          En nuestros experimentos, hemos usado el corpus
colaboración y sistemas de recomendación (Noh                estándar GENIA2 , el cual es compuesto de 2 000
et al., 2009).                                                 tı́tulos y resúmenes de artı́culos de revistas que han
   Una medida basada en grafos creada para este                sido tomadas de la base de datos Medline, con-
proceso es TeRGraph (Lossio-Ventura et al., 2014)              tiene más de 400 000 palabras. GENIA corpus
(Terminology Ranking based on Graph informa-                   contiene expresiones lingüı́sticas que se refieren a
tion). Esta medida tiene como objetivo mejorar                 entidades con interés en biologı́a molecular tales
la precisión de los primeros k términos extraı́dos           como proteı́nas, genes y células.
después de haber aplicado LIDF-value. El grafo                  2
                                                                   http://www.nactem.ac.uk/genia/
es construido con la lista de términos obtenidos              genia-corpus/term-corpus


                                                          59
3.2 Resultados
Los resultados son evaluados en términos de pre-
cisión obtenidos sobre los primeros k términos
extraı́dos (P @k) para las medidas propuestas y
las medidas base (referencia) para la extracción
de términos compuestos de varias palabras. En
las subsecciones siguientes, limitamos los resul-
tados para la medida basada en grafos con sólo los
primeros 8 000 términos extraı́dos y los resulta-
dos para la medida basada en la web con sólo los
primeros 1 000 términos.                                  Figure 3: Comparación de la precisión de WAHI y
3.2.1   Resultados lingüı́sticos y estadı́sticos          TeRGraph


                                                           grandes procesos.
                                                               El primer proceso Extracción Automática de
                                                           Términos Biomédicos, terminado y siendo val-
                                                           orizado en varias publicaciones citadas anterior-
                                                           mente. En este proceso demostramos que las me-
                                                           didas propuestas mejoran la precisión de la ex-
                                                           tracción automática de términos en comparación
                                                           a las medidas más populares de extracción de
                                                           términos.
                                                               El segundo proceso Enriquecimiento de On-
                                                           tologı́as, a la vez dividio en 3 etapas, es nues-
                                                           tra tarea actual, solo la primera etapa ha sido fi-
                                                           nalizada. En este proceso buscamos encontrar la
Figure 1: Comparación de la precisión de LIDF-           mejor posición de un término en una ontologı́a.
value con las mejores medidas de base                          Como trabajo futuro, pensamos acabar el se-
                                                           gundo proceso. Además, planeamos probar es-
3.2.2   Resultados basados en grafos                       tos enfoques generales sobre otros dominios,
                                                           tales como ecologı́a y agronomı́a. Finalmente,
                                                           planeamos aplicar estos enfoques con corpus en
                                                           español.

                                                           Agradecimientos
                                                           Este proyecto es apoyado en parte por la Agencia
                                                           Nacional de Investigación de Francia bajo el pro-
                                                           grama JCJC, ANR-12-JS02-01001, ası́ como por
                                                           la Universidad de Montpellier 2, el CNRS y el pro-
                                                           grama de becas FINCyT, Perú.


                                                           References
Figure 2: Comparación de la precisión de TeR-
Graph y LIDF-value                                         Barrón-Cedeno, A., Sierra, G., Drouin, P., Ananiadou,
                                                             S. 2009. An improved automatic term recognition
                                                             method for Spanish. Computational Linguistics, In-
3.2.3   Resultados basados en la web                         telligent Text Processing, pp. 125-136. Springer.
4   Trabajo Futuro                                         Frantzi K., Ananiadou S., Mima, H. 2000. Automatic
                                                             recognition of multiword terms: the C-value/NC-
Este artı́culo presenta la metodologı́a propuesta            value Method. International Journal on Digital Li-
para el proyecto SIFR. Este proyecto consta de dos           braries, (3):115-130.


                                                      60
Ji, L., Sum, M., Lu, Q., Li, W., Chen, Y. 2007. Chinese        Van Eck, N.J., Waltman, L., Noyons, E.CM., Buter,
    Terminology Extraction Using Window-Based Con-               R.K. 2010. Automatic term identification for bib-
    textual Information. Proceedings of the 8th Inter-           liometric mapping. Scientometrics, vol. 82, pp. 581-
    national Conference on Computational Linguistics,            596.
    Intelligent Text Processing (CICLing07), pp. 62-74.
    Springer-Verlag, Mexico City, Mexico.                      Zhang, Z., Iria, J., Brewster, C., Ciravegna, F. 2008.
                                                                 A Comparative Evaluation of Term Recognition Al-
Krauthammer, M., Nenadic, G. 2004. Term Iden-                    gorithms. Proceedings of the Sixth International
  tification in the Biomedical Literature. Journal of            Conference on Language Resources, Evaluation
  Biomedical Informatics, vol. 37, pp. 512-526. Else-            (LREC08). Marrakech, Morocco.
  vier Science, San Diego, USA.
Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis-
  seire M. 2014. B IOT EX: A system for Biomed-
  ical Terminology Extraction, Ranking, and Valida-
  tion. Proceedings of the 13th International Seman-
  tic Web Conference (ISWC’14). Trento, Italy.
Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire
  M. 2014. Integration of linguistic and Web infor-
  mation to improve biomedical terminology ranking.
  Proceedings of the 18th International Database En-
  gineering and Applications Symposium (IDEAS’14),
  ACM. Porto, Portugal.
Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis-
  seire M. 2014. Yet another ranking function to
  automatic multi-word term extraction. Proceed-
  ings of the 9th International Conference on Natural
  Language Processing (PolTAL’14), Springer LNAI.
  Warsaw, Poland.
Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teis-
  seire M. 2014. Biomedical Terminology Ex-
  traction: A new combination of Statistical, Web
  Mining Approaches.      Proceedings of Journées
  internationales d’Analyse statistique des Données
  Textuelles (JADT2014). Paris, France.
Lossio-Ventura, J.A., Jonquet, C., Roche, M., Teisseire
  M. 2013. Combining C-value, Keyword Extraction
  Methods for Biomedical Terms Extraction. Pro-
  ceedings of the Fifth International Symposium on
  Languages in Biology, Medicine (LBM13), pp. 45-
  49, Tokyo, Japan.
Neveol, A., Grosjean, J., Darmoni, S., Zweigenbaum,
  P. 2014. Language Resources for French in the
  Biomedical Domain. Proceedings of the 9th In-
  ternational Conference on Language Resources and
  Evaluation (LREC’14). Reykjavik, Iceland
Noh, TG., Park, SB., Yoon, HG., Lee, SJ., Park,
  SY. 2009. An Automatic Translation of Tags for
  Multimedia Contents Using Folksonomy Networks.
  Proceedings of the 32nd International ACM SIGIR
  Conference on Research, Development in Informa-
  tion Retrieval SIGIR ’09, pp. 492-499. Boston, MA,
  USA, ACM.
Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf,
 M., Griffith, N., Jonquet, C., Rubin, D. L., Storey,
 M., Chute, C.G., Musen, M. A. 2009. BioPortal:
 ontologies and integrated data resources at the click
 of a mouse. Nucleic acids research, vol. 37(suppl
 2), pp 170–173.


                                                          61
     Mathematical modeling of the performance of a computer system.

                                    Félix Armando Fermín Pérez
                             Universidad Nacional Mayor de San Marcos
                             Facultad de Ingeniería de Sistemas e Informática
                                               Lima, Perú
                                    fferminp@unmsm.edu.pe


                                                             intervención humana, que Fox y Patterson (2003)
                     Abstract                                también han identificado como parte principal del
                                                             problema, debido al uso de procedimientos ad hoc
Generally the management of computer systems is
                                                             donde la gestión de recursos aún depende
manual; autonomic computing by self-management
                                                             fuertemente del control y administración manual.
tries to minimize human intervention using
                                                             Sobre la computación autonómica, Kephart y
autonomic controllers; for this, first the
                                                             Chess (2003) mencionan que la idea es que un
mathematical modeling of the system under study
                                                             sistema informático funcione igual que el sistema
is performed, then is designed a controller that
                                                             nervioso autonómico humano cuando regula
governs the behavior of the system. In this case,
                                                             nuestra temperatura, respiración, ritmo cardíaco y
the determination of the mathematical model of a
                                                             otros sin que uno se halle siempre consciente de
web server is based on a black box model by
                                                             ello, esto es, promueve la menor intervención
system identification, using data collected and
                                                             humana en la administración de la performance de
stored in its own log during operation of the
                                                             los sistemas informáticos tendiendo hacia la auto-
computer system under study.
                                                             administración de los mismos. Tal auto-
                                                             administración se caracteriza por las propiedades
Keywords:     autonomic      computing,         self-
                                                             de auto-configuración (self-configuration), auto-
management, system identification.
                                                             curación (self-healing), auto-optimización (self-
1    Introducción                                            optimization) y auto-protección (self-protection).

La tecnología influye en cada aspecto de la vida             En la visión de la computación autonómica los
cotidiana; las aplicaciones informáticas, por                administradores humanos simplemente especifican
ejemplo, son cada vez más complejas,                         los objetivos de alto nivel del negocio, los que
heterogéneas y dinámicas, pero también la                    sirven como guía para los procesos autonómicos
infraestructura de información, como la Internet             subyacentes. Así los administradores humanos se
que incorpora grandes cantidades de recursos                 concentran más fácilmente en definir las políticas
informáticos y de comunicación, almacenamiento               del negocio, a alto nivel, y se liberan de tratar
de datos y redes de sensores, con el riesgo de que           permanentemente con los detalles técnicos de bajo
se tornen frágiles, inmanejables e inseguras. Así, se        nivel, necesarios para alcanzar los objetivos, ya
hace necesario contar con administradores de                 que estas tareas son ahora realizadas por el sistema
servicios    y     de    servidores    informáticos,         autonómico mediante un controlador autonómico
experimentados y dedicados, además de                        en lazo cerrado, que monitorea el sistema
herramientas software de monitoreo y supervisión,            permanentemente, utiliza los datos recolectados del
para asegurar los niveles de calidad de servicio             propio sistema en funcionamiento, compara estas
pactados (Fermín, 2012).                                     métricas con las propuestas por el administrador
                                                             humano, y controla la performance del sistema, por
Diao et al. (2005) promueven la utilización de la            ejemplo, manteniendo el tiempo de respuesta del
computación autonómica mediante sistemas de                  sistema dentro de niveles prefijados.
control automático en lazo cerrado, reduciendo la


                                                        62
De acuerdo con la teoría de control, el diseño de un        estudio, con herramientas de análisis utilizando
controlador depende de un buen modelo                       técnicas de análisis estadísticas, principalmente.
matemático. En el caso de un sistema informático,
primero debe tenerse un modelo matemático para              Entre las métricas de performance inicialmente se
luego diseñar un controlador en lazo cerrado o              encontraba la velocidad de procesamiento, pero al
realimentado, pero sucede que los sistemas                  agregarse más componentes a la infraestructura
informáticos son bastante complicados de modelar            informática, surgieron nuevas métricas, siendo las
ya que su comportamiento es altamente                       principales las que proporcionan una idea del
estocástico. Según Hellerstein et al (2004) se ha           rendimiento o trabajo realizado en un periodo de
utilizado la teoría de colas para modelarlos,               tiempo, la utilización de un componente, o el
tratándolos como redes de colas y de servidores,            tiempo en realizar una tarea en particular como por
bastante bien pero principalmente en el modelado            ejemplo el tiempo de respuesta. Lalanda et al.
del comportamiento estacionario y no cuando se              (2013) mencionan que las métricas de performance
trata de modelar el comportamiento muchas veces             más populares son las siguientes:
altamente dinámico de la respuesta temporal de un           - Número de operaciones en punto flotante por
sistema informático en la zona transitoria, donde la            segundo (FLOPS), representa una idea del
tarea se complica.                                              rendimiento del procesamiento, realiza
                                                                comparaciones entre máquinas que procesan
De manera que en el presente artículo se trata el               complejos algoritmos matemáticos con punto
modelado matemático de un sistema informático                   flotante en aplicaciones científicas.
mediante la identificación de sistemas, enfoque             - Tiempo de respuesta, representa la duración en
empírico donde según Lung (1987) deben                          tiempo que le toma a un sistema llevar a cabo
identificarse los parámetros de entrada y salida del            una unidad de procesamiento funcional. Se le
sistema en estudio, basándose en los datos                      considera como una medición de la duración
recolectados      del     mismo       sistema     en            en tiempo de la reacción a una entrada
funcionamiento, para luego construir un modelo                  determinada y es utilizada principalmente en
paramétrico, como el ARX por ejemplo, con las                   sistemas interactivos. La sensibilidad es
técnicas estadísticas de autoregresión. La sección 2            también una métrica utilizada especialmente en
describe alguna teoría básica sobre las métricas                la medición de sistemas en tiempo real,
para el monitoreo de la performance de un sistema               consiste en el tiempo transcurrido entre el
informático. En la sección 3 se trata el modelado               inicio y fin de la ejecución de una tarea o hilo.
matemático, y en la sección 4 se describe el                - Latencia, medida del retardo experimentado en
experimento realizado; finalmente en la sección 5               un sistema, generalmente se le utiliza en la
se describen las conclusiones y trabajos futuros.               descripción de los elementos de comunicación
                                                                de datos, para tener una idea de la performance
2   Monitoreo de la performance.                                de la red. Toma en cuenta no solo el tiempo de
                                                                procesamiento de la CPU sino también los
En la computación autonómica los datos obtenidos
                                                                retardos de las colas durante el transporte de
del monitoreo de la performance del sistema en
                                                                un paquete de datos, por ejemplo.
estudio contribuye fundamentalmente en la
                                                            - Utilización y carga, métricas entrelazadas y
representación del estado o del comportamiento del
                                                                utilizadas para comprender la función de
sistema, esto es, en el modelo matemático del
                                                                administración de recursos y proporcionan una
mismo. Según Lalanda et al. (2013), conocer el
                                                                medida de cuan bien se están utilizando los
estado del sistema desde las perspectivas
                                                                componentes de un sistema y se describe como
funcionales y no funcionales es vital para llevar a
                                                                un porcentaje de utilidad. La carga mide el
cabo las operaciones necesarias que permitan
                                                                trabajo realizado por el sistema y usualmente
lograr los objetivos en el nivel deseado y el
                                                                es representado como una carga promedio en
monitoreo de la performance permite saber cuan
                                                                un periodo de tiempo.
bien lo está logrando. Generalmente los datos de la
performance se consiguen vía el log del sistema en
                                                            Existen muchas otras métricas de performance:
                                                            - número de transacciones por unidad de costo.


                                                       63
- función de confiabilidad, tiempo en el que un
sistema ha estado funcionando sin fallar.
- función de disponibilidad, indica que el sistema
está listo para ser utilizado cuando sea necesario.
- tamaño o peso del sistema, indica la portabilidad.
- performance por vatio, representa la tasa de
cómputo por vatio consumido.
- calor generado por los componentes ya que en
sistemas grandes es costoso un sistema de
refrigeración.

Todas ellas entre otras más permiten conocer
mejor el estado no funcional de un sistema o
proporcionar un medio para detectar un evento que
ha ocurrido y ha modificado el comportamiento no
funcional. Así, en el presente caso se ha elegido
inicialmente como métrica al tiempo de respuesta,
ya que el sistema informático en estudio es un              Figura N° 1. Modelado matemático basado en la
servidor web, por esencia de comportamiento                 teoría de control. (Adaptado de Hellerstein, 2004).
interactivo, de manera que lo que se mide es la
duración de la reacción a una entrada determinada.          En contraste, según Ljung (1987) la identificación
                                                            de sistemas es un enfoque empírico donde debe
                                                            identificarse los parámetros de entrada y salida del
3   Modelado matemático.                                    sistema en estudio, para luego construir un modelo
                                                            paramétrico, como el ARX por ejemplo, mediante
En la computación autonómica, basado en la teoría           técnicas estadísticas de autoregresión. Este modelo
de control realimentado, para diseñar un                    o ecuación paramétrica relaciona los parámetros de
controlador autonómico primero debe hallarse el             entrada y de salida del sistema de acuerdo a la
modelo matemático del sistema en estudio, tal               siguiente ecuación:
como se muestra en la Figura N° 1. El modelo
matemático de un servidor informático se puede                      y(k+1) = Ay(k) + Bu(k)                  (1)
hallar mediante dos enfoques: uno basado en las
leyes básicas, y otro en un enfoque empírico                donde y(k)      : variable de salida
denominado identificación de sistemas. Parekh et                  u(k)      : variable de entrada
al. (2002) indican que en trabajos previos se ha                  A, B      : parámetros de autoregresión
tratado de utilizar principios, axiomas, postulados,                k       : muestra k-ésima.
leyes o teorías básicas para determinar el modelo
matemático de sistemas informáticos, pero sin               Este enfoque empírico trata al sistema en estudio
éxito ya que es difícil construir un modelo debido a        como una caja negra, de manera que no afecta la
su naturaleza compleja y estocástica; además es             complejidad del sistema o la falta de conocimiento
necesario tener un conocimiento detallado del               experto, incluso cuando se actualicen las versiones
sistema en estudio, más aún, cuando cada cierto             del software bastaría con estimar nuevamente los
tiempo se va actualizando las versiones del                 parámetros del modelo. Así, para un servidor web
software, y finalmente que en este enfoque no se            Apache, la ecuación paramétrica relaciona el
considera la validación del modelo.                         parámetro entrada, Max Clients (MC), un
                                                            parámetro de configuración del servidor web
                                                            Apache que determina el número máximo de
                                                            conexiones simultáneas de clientes que pueden ser
                                                            servidos; y el parámetro de salida Tiempo de
                                                            respuesta (TR), que indica lo rápido que se


                                                       64
responde a las solicitudes de servicio de los               recoge los tiempos de inicio y fin de cada http que
clientes del servidor, ver Figura N° 2.                     ingresa al servidor web, datos que se almacenan en
                                                            el log del mismo servidor, y luego se realiza el
                                                            cálculo del tiempo de respuesta de cada http
                                                            completado en una unidad de tiempo y el tiempo
                                                            de respuesta promedio del servidor. El servidor
                                                            web por sí mismo no hace nada, por ello, la
                                                            computadora personal PC2 contiene un generador
                                                            de carga de trabajo, que simula la actividad de los
Figura N° 2. Entrada y salida del sistema a                 usuarios que desean acceder al servidor web; en
modelar. (Elaboración propia).                              este caso se utilizó el JMeter una aplicación
                                                            generadora de carga de trabajo y que forma parte
En (Hellerstein et al, 2004) se propone realizar la         del proyecto Apache.
identificación de sistemas informáticos, como los
servidores web, de la siguiente manera:                     La operación del servidor web, no solo depende de
1.- Especificar el alcance de lo que se va a modelar        la actividad de los usuarios, sino también de la
en base a las entradas y salidas consideradas.              señal de entrada MaxClients (MC), que toma
2.- Diseñar experimentos y recopilar datos que              forma de una sinusoide discreta variable que excita
sean suficientes para estimar los parámetros de la          al servidor web junto con las solicitudes http del
ecuación diferencia lineal del orden deseado.               generador de carga de trabajo (PC2). De manera
3.- Estimar los parámetros del modelo utilizando            que con la actividad del servidor web almacenada
las técnicas de mínimos cuadrados.                          en su log, un sensor software calcula los valores de
4.- Evaluar la calidad de ajuste del modelo y si la         la señal de salida Tiempo de Respuesta (TR),
calidad del modelo debe mejorarse, entonces debe            mostrados en la Figura N° 4.
revisarse uno o más de los pasos anteriores.


4    Experimento.
La arquitectura del experimento implementado, se
muestra en la Figura Nº 3.


                                                            Figura N° 4. Señal de entrada MaxClients MC y
                                                            señal de salida Tiempo de respuesta TR.

                                                            Con los datos de MC y TR obtenidos se estiman
                                                            los parámetros de regresión A y B de la ecuación
                                                            paramétrica ARX, haciendo uso de las técnicas de
Figura N° 3. Arquitectura para la identificación del        mínimos cuadrados, implementadas en el ToolBox
modelo de un servidor web (Elaboración propia).             Identificación de Sistemas del Matlab, logrando la
                                                            siguiente ecuación paramétrica de primer orden:
La computadora personal PC1 es el servidor web
Apache, asimismo contiene el sensor software que             TR(k+1) = 0.06545TR(k) + 4.984MC(k+1)


                                                       65
Se puede observar que el Tiempo de respuesta                ARX, aunque más adelante se planteará el diseño
actual depende del Tiempo de respuesta anterior y           de controladores autonómicos no lineales para
del parámetro de entrada MaxClients. El modelo              sistemas informáticos en general, ya que el
hallado es evaluado utilizando la métrica r2,               comportamiento temporal no lineal, hace adecuada
calidad de ajuste, del ToolBox utilizado, que indica        la utilización de técnicas de inteligencia artificial
el porcentaje de variación respecto a la señal              como la lógica difusa y las redes neuronales.
original. En el caso de estudio, el modelo hallado
tiene una calidad de ajuste del 87%, lo que se
puede considerar como un modelo aceptable. En la            6    Referencias bibliográficas.
Figura N° 5 puede observarse la gran similitud
                                                            Diao, Y.; Hellerstein, J.; Parekh, S.; Griffith, R.;
entre la señal de salida medida y la señal de salida
                                                            Kaiser, G.; Phung, D. (2005) A Control Theory
estimada.
                                                            Foundation for Self-Managing Computing
                                                            Systems, IEEE Journal on Selected Areas in
                                                            Communications, Vol. 23, Issue 12, pp. 2213 –
                                                            2222. IEEE. DOI: 10.1109/JSAC.2005.857206.

                                                            Fermín F. (2012). La Teoría de Control y la
                                                            Gestión Autónoma de Servidores Web. Memorias
                                                            del IV Congreso Internacional de Computación y
                                                            Telecomunicaciones. ISBN 978-612-4050-57-2.

                                                            Fox, A.; Patterson, D. (2003). Self-repairing
     Figura N° 5. TR medida y TR estimada.                  computers, Scientific American, Jun2003, Vol. 288
                                                            Issue 6, pp. 54-61.

5   Conclusiones.                                           Hellerstein, J.; Diao, Y.; Parekh, S.; Tilbury, D.
                                                            (2004). Feedback Control of Computing Systems.
La identificación del comportamiento de un                  Hoboken, New Jersey: John Wiley & Sons, Inc.
sistema informático tratado como una caja negra es          ISBN 0-471-26637-X
posible, para ello debe simularse el funcionamiento
del mismo con hardware y herramientas software              Kephart, J.; Chess, D. (2003). The vision of
como el Jmeter y Matlab, por ejemplo.                       autonomic computing. Computer, Jan2003, Vol.
                                                            36, Issue 1, pp. 41-50. IEEE. DOI:
El modelo matemático determinado en base a los              10.1109/MC.2003.1160055
datos recopilados del log del sistema en estudio se
aproxima bastante al modelo real, en este caso se           Lalanda, P.; McCann, J.; Diaconescu, A. (2013).
obtuvo un 87% de calidad de ajuste.                         Autonomic Computing. Principles, Design and
                                                            Implementation. London: Springer-Verlag. ISBN
El sensor software implementado ha permitido                978-1-4471-5006-0
calcular el tiempo de respuesta en base a los datos
almacenados en el log del mismo servidor.                   Ljung L. (1987). System Identification: Theory for the
                                                            User. Englewood Cliffs, New Jersey: Prentice Hall.
De los datos tomados se observa que los sistemas
informáticos poseen un comportamiento cercano al            Parekh, S.; Gandhi, N.; Hellerstein, J.; Tilbury, D.;
lineal solo en tramos por lo que se sugiere                 Jayram, T.; Bigus, J. (2002). Using control theory
experimentar con modelos matemáticos no lineales            to achieve service level objectives in performance
para comparar la calidad de ajuste de ambos.                management. Real-Time Systems. Jul2002, Vol.
                                                            23, Issue 1/2, pp. 127-141. Norwell,
Como trabajos futuros se plantea diseñar un                 Massachusetts: Kluwer Academic Publishers. DOI:
controlador autonómico basado en el modelo lineal           10.1023/A:1015350520175.


                                                       66
  Clasificadores supervisados para el análisis predictivo de muerte y
                         sobrevida materna
                                      Pilar Vanessa Hidalgo León

                     Universidad Andina del Cusco/San Jerónimo, Cusco-Perú
                                   phidalgo@uandina.edu.pe

                                                          cuando este problema requiere prontitud por su
                     Resumen                              naturaleza, los clasificadores transforman los
                                                          datos en conocimiento y aportan aplicaciones
El presente trabajo se basa en el análisis de los
clasificadores supervisados que puedan generar            exitosas.
resultados aceptables para la predicción de la muerte     En el caso de factores de riesgo para la salud
y sobrevida materna, según características de             materna, existen estudios estadísticos y
pacientes complicadas durante su gestación                aplicaciones salubristas para determinarla, mas
determinada por los expertos salubristas. Se describe
                                                          no integrados simultáneamente como parte de
la metodología del desarrollo, las particularidades de
la muestra, además los instrumentos utilizados para el    una probabilidad clasificatoria como modelo.
procesamiento de los datos. Los resultados de la          Por ello, este estudio determinará, el clasificador
investigación luego de la evaluación de cada              supervisado más eficiente en tiempo y resultado
clasificador y entre ellos el que mejores resultados      que establezca la diferencia de clases entre
arroja. Los histogramas acerca de cada atributo de las
                                                          pacientes gestantes complicadas durante su
pacientes, y la inclusión en la muestra. Los
parámetros determinantes para su correcta                 embarazo que pueden llegar a presentar síntomas
clasificación. La comparación de cada resultado entre     fatales y las que no, así apoyar al personal de
cada tipo de clasificador dentro de la familia a la que   salud a tomar la decisión más optima y a
pertenece para después de identificado, implementar,      prevenir futuras alzas en el índice de mortalidad
el algoritmo de Naive-Bayes con estimador de              de su comunidad.
Núcleo activado (KERNEL=TRUE), en un software
que contribuya a la toma de decisiones certera y          Los problemas que generan el alza de este
respaldada para los profesionales de la salud. En         indicador ya son conocidos y puestos en valor en
conclusión se encontró un clasificador supervisado        esta investigación:
que responde positivamente a dar cambio y mejora de
                                                          “Hoy en día existe suficiente evidencia que
la problemática que abarca a la sobrevida materna a
pesar de sus complicaciones.
                                                          demuestra que las principales causas de la
                                                          muerte materna son la hemorragia posparto, la
Palabras Clave: Mortalidad materna,                       preclampsia o la sepsis y los problemas
clasificadores supervisados, Redes bayesianas,            relacionados con la presentación del feto.
Aprendizaje supervisado.                                  Asimismo, sabemos cuáles son las medidas más
                                                          eficaces y seguras para tratar estas emergencias
    1    Introducción                                     obstétricas. Para poder aplicarlas, es necesario
                                                          que la gestante acceda a un establecimiento de
El objetivo en el uso de clasificadores                   salud    con     capacidad     resolutiva,    pero
supervisados (Araujo, 2006), es construir                 lamentablemente muchas mujeres indígenas no
modelos que optimicen un criterio de                      acuden a este servicio por diversas razones, tanto
rendimiento, utilizando datos o experiencia               relacionadas con las características geográficas,
previa. En ausencia de la experiencia humana,             económicas, sociales y culturales de sus grupos
para resolver una disyuntiva que requiere                 poblacionales, como por las deficiencias del
explicación precisa, los sistemas implementados           propio sistema de salud.
por modelos clasificadores han sido parte
importante en la toma de decisiones. A parte,             En los últimos años se han hecho muchos


                                                   67
esfuerzos para revertir esta situación, tanto            ya que el estado del registro de las historias
mediante proyectos promovidos por el estado              clínicas correspondientes a los casos de
como      ejecutados    por    organismos    no          fallecimiento no son legibles, ni están
gubernamentales de desarrollo. Estos esfuerzos           conservadas en las mejores condiciones en los
han tenido, sin embargo, resultados desiguales           archivos de la Dirección Regional de Salud
debido principalmente a la poca adecuación de            Cusco, esto hace que la muestra no pueda ser
los proyectos al contexto geográfico y de                nutrida con mayor diversidad de datos.
infraestructura en el que vive gran parte de la          (DIRESA, 2007)
población indígena, a sus dificultades
económicas para acceder al servicio, su cultura,         Existe poca investigación acerca del tema
sus propios conceptos de salud y enfermedad, y           relacionado con el uso de clasificadores
su sistema de salud.”(Cordero, 2010)                     supervisados y otras correspondientes a
                                                         diagnósticos médicos que tienen similitud con la
2 Contenido                                              muestra pero ninguna que se relacione
                                                         directamente.
Problema
                                                         La relevancia de los datos se limito a los
¿Cuáles son los clasificadores supervisados que
                                                         antecedentes sobre estudios en mortalidad
predicen la muerte o la sobrevida materna con
                                                         materna (edad, estado civil, analfabeta,
mayor efectividad?
                                                         ocupación, procedencia, anticoncepción, entorno
Entonces para determinar adecuadamente la                (estrato social), controles pre-natales, ubicación
efectividad   de   cada   algoritmos  nos                domiciliaria, tiempo de demora en atención,
cuestionamos:                                            atención profesional, antecedentes familiares,
                                                         espacio intergenésico (en años), paridad (número
    •       ¿Cuál     es la especificidad, la            de hijos), complicaciones no tratadas,
            clasificación correcta y el error absoluto   fallecimiento), conservando el anonimato de
            y     sensibilidad    del     clasificador   cada paciente.
            supervisado en relación a los datos de
            mortalidad materna?                          Objetivos

        -      Redes neuronales                               •   Determinar el clasificador supervisado
                                                                  que brinde mejores resultados para el
        -      Redes Bayesianas                                   análisis predictivo     de muerte y
                                                                  sobrevida materna.
        -      Regresión Logística
                                                                  Luego para lograr este objetivo se debe:
        -      Arboles de Decisión
                                                              •   Determinar       la   especificidad,    la
        -      Algoritmos Basados en Distancias
                                                                  clasificación correcta, el error absoluto,
Mediante la herramienta Weka, se determinó la                     la     sensibilidad    del    clasificador
sensibilidad, la certeza más cercana de cada uno                  supervisado en relación a los datos de
de estos algoritmos, y cuya conclusión sugerirá                   muerte y sobrevida materna.
el más eficiente.
                                                                      -   Redes neuronales
Actualmente la problemática en mortalidad                             -   Redes Bayesianas
materna es un indicador determinante de
desarrollo en los países Latinoamericanos.                            -   Regresión Logística
Siendo no solo un indicador de pobreza y
desigualdad sino de vulnerabilidad de los                             -   Arboles de Decisión
derechos de la mujer. (OMS, 2008)
                                                                      -   Algoritmos Basados en
Limitaciones de la Investigación                                          Distancias

Los datos recolectados para este estudio con
respecto a pacientes fallecidas que tuvieron
complicaciones durante el embarazo fueron 48,


                                                         68
Hipótesis General.                                          control sanitario de mortalidad materna de la
                                                            Región Cusco.

Hi: Existen clasificadores supervisados que                 El número es limitado, pues las historias desde
predicen la muerte o la sobrevida materna con               1992 al 2011, no han sido redactadas ni
efectividad                                                 conservadas en el mejor estado haciendo difícil
                                                            la tarea de interpretar los datos suficientes para
H. nula: No existen clasificadores supervisados             ser analizados.
que predicen la muerte o la sobrevida materna
con efectividad                                             Estas pacientes no fueron necesariamente
                                                            atendidas desde el inicio en estos
                                                            establecimientos, sino que debido a sus
                                                            complicaciones durante el parto y e embarazo
Definición de Variables:                                    fueron derivadas a las capitales y luego a los
                                                            establecimientos      de    mayor    capacidad
Variable principal:                                         resolutiva para su atención.
       Clasificadores supervisados                          Los datos de las historias clínicas que
                                                            incluyeran:
Variables Implicadas:
                                                                       •   Edad
                                                                       •   Estado civil
   Variable          Clasificadores supervisados
                                                                       •   Analfabeta
  Dimensión       Las Redes Neuronales, Algoritmos
                                                                       •   Ocupación
                        supervisados, Las Redes
                   Bayesianas, Arboles de decisión,                    •   Procedencia
                   Regresión Logística, Algoritmos                     •   Anticoncepción
                          basados en instancias                        •   Entorno (estrato social)
 Indicador/       Clasificación correcta,
 Criterios de     Clasificación incorrecta,                            •   Controles pre-natales
  Medición        Sensibilidad,Especificidad, Tiempo                   •   Ubicación domiciliaria
                  de ejecución                                         •   Tiempo de demora en atención
                  Mean absolute error
                  Kappa statistic
                                                                       •   Atención profesional
                  Root mean squared erro, Relative                     •   Antecedentes familiares
                  absolute error                                       •   Espacio intergenésico (en años)
                  Root relative squared error                          •   Paridad (#de hijos)
    Clase                  Numérica discreta
                                                                       •   Complicaciones no tratadas
 Instrumento            Weka 3.5.7, Explored
                                                                       •   Fallecimiento

            Tabla I: Variables implicadas                   Entre 52 sobrevivientes y 48 fallecidas, ambos
                                                            grupos con similares características, siendo
                                                            factores determinantes: (Ramírez, 2009)
Metodología de investigación
                                                              Ubicación domiciliaria /tiempo demora en
Cuasi Experimental, Aplicada, Inductiva                                      atención
                                                                     Controles > 2: n (49.0/2.0)
Descripción de la muestra y método de                           PRODECENCIA = rural: s (6.0/1.0)
recolección
                                                            Técnica e instrumentos de investigación
Los datos recolectados en todas 100 historias               Se utilizó la herramienta Weka Explorer para
clínicas (HC) de casos de sobrevida y de casos              la interpretación de los datos.
de muerte materna.                                          Las opciones de clasificación supervisada y los
                                                            algoritmos que propone esta herramienta.
 Las características de las gestantes son muy
                                                            (Corso, 2009)
 similares entre si y corresponden a la población
 de la ciudad del Cusco, del archivo en la Red
 Sur de la Dirección Regional de Salud sobre el


                                                       69
 Se evaluaron los siguientes clasificadores                          o    Identificación de las fuentes de
 supervisados:                                                            información externas e internas y
                                                                          selección del subconjunto de
      • Las Redes Neuronales                                              datos necesario.
          - MultilayerPerceptron                          •       Pre procesamiento: estudio de la calidad
          - RBFNetwork                                            de los datos y determinación de las
    • Las Redes Bayesianas                                        operaciones de minería que se pueden
          - BayesNet                                              realizar.
          - Bayes simple estimator                        •       Transformación de datos: conversión de
          - BMA bayes                                             datos en un modelo analítico.
          - Naive-Bayes                                   •       Análisis de datos interpretación de los
          - BayesNet Kernel                                       resultados obtenidos en la etapa anterior,
          - Naive-Bayes                                           generalmente con la ayuda de una
          - Discretizacion Supervisada                            técnica de visualización.
    • Arboles de decisión                                 •       Asimilación         de      conocimiento
          - J48                                                   descubierto(Calderón, 2006)
          - DecisionTable                                 •       Minería      de     datos:    tratamiento
    • Regresión Logística                                         automatizado de los datos seleccionados
          - MultiClassClassifier                                  con una combinación apropiada de
          - Logistic                                              algoritmos. (Ramirez, 2011).
     • Algoritmos basados en Distancias                            NEGATIVO         POSITIVO       VALORES
          - IBK                                   Edad             Menor a 19 y     Entre 19-35    14-48
          - LWL                                                    mayor 35
          - KStar                                 Estado civil     Soltera          Pareja         Soltera-pareja
Estos resultados fueron comparados con las        Analfabeta       Analfabeta       Primaria       Analfabeta-
reglas de clasificación que nos proporcionará                                                      primaria-
                                                                                                   secundaria-
los algoritmos de predicción inmediata:                                                            superior
                                                  Ocupación        No               Remunerada     Remunerada-no
              -   OneR                                             remunerada                      remunerada
                                                  Procedencia      Rural            Urbana         Rural-urbana
              -   ZeroR
                                                  Anticoncepc      No               Si             Si-no
                                                  ión
 Procedimiento de recolección de datos            Entorno          Bajo             Medio-alto     Baja-alta
                                                  (estrato
                                                  social)
 Las Historias Clínicas se insertaron en un       Controles        De 1 a 5         6 a mas        0-12
 fichero CSV (delimitado por comas) cuya
                                                  Ubicación        Más de       2   Menos 3 del    Menos de 1 hora,
 cabecera contiene las etiquetas de cada          domiciliaria/    horas            ee.ss          1-2,3-5,6 a mas
 atributo, y la última columna se refiere a la    tiempo
 clase a la pertenecen.                           demora en
                                                  atención
 Con respecto a los indicadores particulares en   Personal de      No               Si             No-si
 cada uno de los atributos de los sujetos de la   atención
 clase tenemos los siguientes valores: Tabla II   profesional
                                                  Antecedente      No               Si             No-si
                                                  s familiares
  •       La calidad de la estructuración         Espacio          Menos de 2       Entre 2-4      Primera gesta,1-
  •       Comparar todas mediciones en cada       intergenésic
                                                  o (en años)
                                                                   mayor a 4                       3,4-6, menos 1

          clasificador por algoritmo.             Paridad (#de     Primípara o      Entre 2-4      0-10
  •       Evaluar los resultados.                 hijos)           mas de 4
                                                  Complicacio      Complicacion     Sin            No-si
  •       Interpretar los resultados.             nes        no    es antes y       complicacion
                                                  tratadas         durante          es
 2.9. Plan de análisis de la información          Fallecida        Si               No             No-si


      •    Determinación de objetivos                Tabla II: Rango de valores de los atributos en las
      •    Preparación de datos                                   pacientes de la muestra
      •    Selección de datos:


                                                     70
Histogramas:


Cada Atributo es evaluado visualmente por los          •     Fracción de verdaderos negativos (FVN).
histogramas que arroja Weka (en total 15), por               Demuestra la cantidad de pacientes que
ejemplo con respecto a la EDAD de las pacientes              realmente pertenecen a la clase
de la muestra.                                               Sobreviviente. Quiere decir que si el
                                                             algoritmo estudiado tiene alto porcentaje
                                                             de especificidad determina con gran éxito
                                                             la probabilidad de sobrevida en pacientes
                                                             complicadas durante su embarazo según
                                                             los datos proporcionados en la ficha de
                                                             antecedentes.

                                                           • Clasificación correcta: de la totalidad de
                                                             datos, entre los que 52 que pertenecen a la
                                                             clase Sobreviviente, y los 48 que pertenece
                                                             a la clase Fallecida, determina dentro de
                                                             cada clase cuantas instancias luego de la
                                                             construcción del clasificador cuantas si
                                                             pertenecen a la clase determinada.
Gráfico 1: Histograma de la Edad de las
pacientes de la muestra
                                                       •     En el caso de pertenecer a la clase
Este histograma nos muestra el intervalo de edad             sobreviviente o a la fallecida de las 100
de las pacientes de la muestra, la diferencia de             instancias cuantas fueron clasificadas
colores determinan la clase a la que pertenece               correctamente.
cada intervalo. Podemos observar lo siguiente:
                                                       •     Clasificación incorrecta: del mismo
      •    Las pacientes del intervalo 14-20 años            modo la cantidad de instancias que no
           pertenecen en su mayoría a la clase               fueron clasificadas de manera correcta,
           “sobreviviente”                                   son las que de manera supervisada se sabe
                                                             que pertenecen a una u otra clase y fueron
      •    Las pacientes del intervalo 21-31 años            incluidas dentro de la cual no eran. Si el
           tienen un mayor porcentaje de sobrevida,          indicador emite un número mayor al 50%
           coincide con el promedio de edad                  de la cantidad total de instancias, no se
           adecuado y de la muestra.                         debe considerar como eficiente.
      •    El intervalo de paciente entre 32-40 años   •     Sensibilidad: es la capacidad del
           tiene mayor porcentaje de muerte.                 algoritmo de clasificar a las pacientes
                                                             complicadas dentro de la clase Fallecidas.
      •    Las pacientes mayores a 40 años
                                                             Es decir que si el clasificador tiene un alto
           pertenecen a la clase “fallecida” en gran
                                                             porcentaje tiene mejor curva de corte y
           porcentaje.
                                                             discernimiento entre los sujetos que
Resultados por algoritmo testeado:                           pertenecen o no a la clase fallecida, es así
                                                             que si la cifra de sensibilidad es del 90%,
Los algoritmos usados para evaluar la base de                existe entonces esa probabilidad de que la
datos en mortalidad materna dieron como                      paciente fallezca.
resultado cifras continuas indicando los
siguientes sucesos: Tabla 2.
                                                       •     Mean absolute error: Se define error
  •       Especificidad: es la probabilidad de que           absoluto de una medida la diferencia entre
          pacientes complicadas y de riesgo                  el valor medio obtenido y el hallado en esa
          pertenezcan a la clase Sobreviviente. Es           medida todo en valor absoluto.
          decir los verdaderos negativos.


                                                 71
•   Entonces el promedio de error absoluto, es        •      Relative absolute error: es el error
    la suma de los errores absolutos de                      relativo a cada característica de la clase,
    clasificación en cada uno de los sujetos                 por ejemplo el error relativo de tener de
    llevados a promedio. El clasificador que                 Espacio Intergenésico 0.5 años y
    arroje mayor cifra (mayor a 0.1) define un               pertenecer o no a la clase fallecida, la
    error de clasificación alto, por lo cual no              clasificación indicaría que si pertenece,
    se debe considerar por sobre los que                     por ser el valor indicado para aquellas
    arrojen una cifra menor.                                 pacientes que están en peligro.
                                                             En este caso el valor positivo para
•   Tiempo de ejecución: medido en                           pertenecer al clase sobreviviente es de
    segundos es la cantidad de tiempo que                    entre 2-4 años o primera gesta: 0.5 años
    demora en construir la arquitectura del                  incluido en la clase fallecido si el error
    clasificador y en arrojar resultados.                    entre los valores determinados por la clase
•   Puede que un clasificador se defina como                 y el valor ingresado es menor a 1.
    eficiente si el tiempo que emplea en emitir
    resultados es menor a 5 segundos, aun así         •      Root relative squared error: La raíz
    depende de los demás indicadores para                    relativa E de error al cuadrado i de un
    valerse de esta característica.                          programa individual i es evaluado por la
                                                             ecuación:
•   Kappa statistic: el Kappa statistic es la
    concordancia de comparación que tienen
    los observadores de clasificación. Quiere
    decir en una matriz de clasificación, el
    índice esperado entre el diagonal principal
    esperada (Xii elemento clasificado en la
    misma clase por ambos observadores) y el                 donde P (ij) es el valor predicho por el
    índice real luego de la clasificación                  programa para el individuo i j muestra de
    efectuada por la arquitectura seleccionada            casos (de los casos de la muestra n), T j es el
    (sea regresión lineal, backpropagation,                  valor objetivo para la muestra j caso,
    Naive-bayes, etc.), es la diferencia en                         y     está dada por la fórmula:
    porcentaje de su lejanía a este valor.

•   Si por ejemplo, la matriz esperada clasifica
    el valor en 25.00 y el resultado de la
    arquitectura es 26.7, la diferencia seria, 1.7
    equivale al 90.32%. Entonces cuanto mas               Para un ajuste perfecto, el numerador es
    grande sea el porcentaje, estará más cerca            igual a 0 y E i = 0. Así, el E i índice varía de
    de ser considerado eficiente.                         0 a infinito, con el ideal que corresponde a
                                                          0.
•   Root mean squared error: error
    cuadrático medio, es una medida de uso
                                                                                   Redes      Redes
    frecuente de las diferencias entre los                                Reglas   Bayesianas Neuronales
    valores pronosticados por un modelo o un                                       NAÏVE
    estimador y los valores realmente                                              BAYES
    observados. RMSD es una buena medida             INDICADORES OneR              KERNEL RBFNetwork
    de precisión, pero sólo para comparar                                                     0.9
                                                     Especificidad        0.836    0.91
    diferentes errores de predicción dentro de       Clasificación                            90
    un conjunto de datos y no entre los              correcta             84       91
    diferentes, ya que es dependiente de la una      Clasificación                           10
    escala       muestra. Estas     diferencias      incorrecta           16       9
    individuales           también           se                                              0.9
                                                     Sensibilidad         0.868    0.914
    denominan residuos, y la RMSD sirve para                                                 0.1468
    agregarlos en una sola medida de la              Mean      absolute
                                                     error                0.16     0.1142
    capacidad de predicción.


                                                     72
  Tiempo         de                          0.26                   • Relative absolute error <25%, <1 ideal
  ejecución            0.2      0.01                                • Root relative squared <50%, 0 ideal
                                             0.7997
  Kappa statistic      0.6759   0.819                          El clasificador que cumpla con estas
  Root        mean                           0.3032            especificaciones, se considera como optimo
  squared error        0.4      0.2737
  Relative absolute                          29.39%
                                                               para la integración en un sistema que evaluará
  error                32.03% 22.86%                           la base de datos que contengan los datos de las
  Root      relative                         60.64%            pacientes a comparar con un registro nuevo de
  squared error        80.01% 54.74%                           entrada.

          Tabla III: Algoritmos de clasificación              Etapa de Evaluación:
         supervisada con mejores resultados por
                                                              Cada uno de los clasificadores estudiados, que
                         familia
                                                              analizaron la muestra de pacientes, arrojaron los
                                                              indicadores que muestra la tabla, en nivel de
                     Arboles    Algoritmos                    rendimiento y buenos resultados, aceptables para
                     de         de            Regresión       el estudio, la arquitectura que lleva los
                     Decisión   Distancia     Logística       porcentajes mas óptimos dentro de los
INDICADORES J48                 LWL           Logistic        parámetros, es el algoritmo de Naive bayes con
                                0.889         0.82            estimador de núcleo.
Especificidad        0.9
Clasificación                   89            82
correcta             90                                       Descartando así el resto de arquitecturas como
Clasificación                   11            18              apropiadas para este tipo de muestras. Además
incorrecta           10                                       de visualizar de manera más clara y correcta los
Sensibilidad    0.9             0.902         0.82            resultados comunes entre cada una de las
Mean absolute                   0.168         0.1753          arquitecturas dejando atrás aquel paradigma que
error           0.1174                                        incluye a la regresión logística como la más
Tiempo       de                 0             0.81            adecuada a la hora de realizar diagnósticos
ejecución       0.02
                                0.6388        0.6388
                                                              preventivos en salud.
Kappa statistic      0.7994
Root        mean                0.3019        0.4157
squared error        0.299                                    Kernel estimator, al ser un método de
Relative absolute               33.64%        35.10%          construcción no paramétrico, es mas flexible que
error                23.75%                                   los clasificadores que incluyen parámetros.
Root      relative              60.39%        83.15%          Dividiendo la muestra en 40 grupos en la
squared error        59.80%                                   variable edad, paridad 15, controles 17, para la
                                                              exploración descubre nuevas alternativas de
  Tabla IV: Algoritmos de clasificación                       clasificación en los atributos de la clase,
  supervisada con mejores resultados por familia              mostrándolos como relevantes.

  Descripción de la metodología propuesta                     El problema más resaltante es que requiere
                                                              mayor tamaño de muestra para mantener estos
  Se propone entonces que evaluando cada uno                  resultados, ya que utilizado el agrupamiento para
  de los indicadores más importantes en la                    la clasificación (clustering) que podría ser una
  construcción del clasificador, se descarten                 desventaja si de clasificadores supervisados
  aquellos que no cumplen las siguientes                      hablamos. Esto se denomina the curse of
  características:                                            dimensionality, pues dependen de la elección de
                                                              un parámetro suavizado, en este caso los valores
        • Especificidad > 90%                                 la edad, número de controles, y número de hijos,
        • Clasificación correcta > 90 instancias              transformándola en no objetiva.
        • Clasificación incorrecta < 10 instancias
        • Sensibilidad >90%                                   Cuando Naive-Bayes actúa con la estimación no
        • Mean absolute error < 0.1 ideal                     paramétrica, las estructuras que se construyen a
        • Kappa statistic >0.79, >0.9 ideal                   partir de esta arquitectura se obtienen a partir de
        • Root mean squared error < 0.3, <1 ideal             un árbol formado con variables potencialmente
                                                              predictoras multidimensionales. (Serrano, 2010)


                                                         73
3 Conclusiones
Con respecto a la respuesta de las hipótesis
podemos afirmar lo siguiente:

   •   Encontramos que las arquitecturas
       propuestas por esta memoria, totas
       conservan una efectividad bastante
       aceptable e cada uno de sus indicadores
       que en conjunto hacen un 80% de
       eficacia. Siendo el más optimo el de
       Naive-Bayer Kernel.

   •   Descarta estas     hipótesis luego del
       trabajo realizado.

   •   Las Redes Bayesianas brindan al estudio
                                                           Gráfico 2: Indicadores estadísticos de
       de los datos en mortalidad y sobrevida
                                                           todos los algoritmos de clasificación
       materna una especificidad         91%,
                                                           supervisada
       clasificación correcta 91%., error
       absoluto 0.1142, y sensibilidad 0.914      4 Recomendaciones
       recomendada.
                                                       •   Se recomienda usar datos discretizados,
   •   Las Redes Neuronales brindan al estudio
                                                           si se desea usar Naive-Bayes para su
       de los datos en mortalidad y sobrevida
                                                           evaluación.
       materna una especificidad         90%,
       clasificación correcta 90%, error               •   Agrupar los datos de la manera
       absoluto 0.1468 y sensibilidad 90%                  propuesta en la Tabla 1, así podremos
       recomendada.                                        respetar los parámetros acerca de
                                                           mortalidad materna que establecen los
   •   Los Arboles de Decisión brindan al                  expertos salubristas.
       estudio de los datos en mortalidad y            •   Es importante comparar los resultados
       sobrevida materna una especificidad                 con la regla de decisión OneR para
       90%, clasificación correcta 90%, error              próximos experimentos, pues nos da la
       absoluto 0.1174 y sensibilidad 90%                  mejor noción de veracidad y efectividad
       recomendada.                                        a la hora de analizar la información.
                                                       •   Se pretende implementar un sistema en
   •   Los Algoritmos basados en Distancias                R Project, para ingresar nuevos registros
       brindan al estudio de los datos en                  y que lleve en la memoria la base de
       mortalidad y sobrevida materna una                  datos recolectada a través de este trabajo.
       especificidad    88.9%,   clasificación         •   Para ayudar a la muestra numeraria que
       correcta 89%, error absoluto 0.168 y                requiere Naive- Bayes Kernel, es
       sensibilidad 90.2% recomendada.                     necesario ingresar los registros de las
                                                           muertes maternas y complicaciones
   •   La Regresión Logística brinda al estudio            diarias de las gestantes a nivel Nacional
       de los datos en mortalidad y sobrevida              en la base de datos.
       materna una especificidad          82%,
       clasificación correcta 82%, error
       absoluto 0.1753 y sensibilidad 82%
       recomendada.


                                                  74
                                                   María García Jiménez, & Aránzazu Álvarez
                                                    Sierra. (2010) Análisis de Datos en WEKA
   Agradecimientos
                                                    – Pruebas de Selecitividad. Articulo.
   Agradezco al Dr. Basilio Araujo y al Dr.
   Yosu Yuramendi, por su apoyo y                  María N. Moreno García* L. A. (2005)
   perseverancia para la culminación de este        Obtención Y Validación De Modelos De
   proyecto de fin de master.                       Estimación De software Mediante Técnicas
                                                    De Minería De Datos. Revista colombiana
   Al personal de salud que proporciono la          de computación, 3[1], 53-71.
   información clínica que se usó en la
   investigación.                                  Msp Mynor Gudiel M., 1. E. (2001-2002)
5 Síntesis curricular de la autora:                 Modelo Predictor De Mortalidad Materna.
                                                    Mortalidad Materna, 22-29.
   Pilar Hidalgo León, Ingeniera de Sistemas de
   la Universidad Andina del Cusco en Perú,        Organización Mundial De La Salud.
   Docente en la Facultad de Ingeniería,            Mortalidad Materna En 2005, (2008) :
   Sustentación de proyecto de máster en la         Estimaciones Elaboradas Por La Oms, El
   Universidad      del       País       Vasco.     Unicef, El Unfpa Y El Banco. Ginebra 27:
   Contacto:phidalgo@uandina.edu.pe,                Ediciones de la OMS.
   Universidad Andina de Cusco, San Jerónimo,
   Cusco, Perú.                                    Porcada, V. R. (2003) Clasificación
                                                    Supervisada Basada En Redes Bayesianas.
6 Referencias bibliográficas                        Aplicación En Biología Computacional.
   Clasificadores supervisados: el objetivo es      Madrid: Tesis Doctoral.
    obtener un modelo clasificatorio valido
    para permitir tratar casos futuros.( 2006)     Ramírez Ramírez, R., & Reyes Moyano, J. ,
    Araujo, B. S. Aprendizaje Automatico:           (2011). Indicadores De Resultado
    conceptos básicos y avanzados. Madrid,          Identificados En Los Programas
    España: Pearson Prentice Hall,                  Estratégicos. Lima: Encuesta Demográfica
                                                    Y De Salud Familiar – Endes
   Cordero Muñoz, L., Luna Flórez, A., &           Ramirez, C. J. (2009)Caracterización De
    Vattuone Ramírez, M. (2010) Salud de la          Algunas Técnicas Algoritmicas De La
    mujer indígena : intervenciones para reducir     Inteligencia Artificial Para El
    la muerte materna. © Banco Interamericano        Descubrimiento De Asociaciones Entre
    de Desarrollo.                                   Variables Y Su Aplicación En Un Caso De
   Antonio Serrano, E. S., (2010) Redes              Investigación Específico. Medellin: Tesis
    Neuronales Artificiales. Universidad de          Magistral.
    Valencia.                                      Reproductive Health Matters, (2009)
   Calderón Saldaña, J., & Alzamora de los           Mortalidad Y Morbilidad
    Godos Urcia, L. (2009) Regresión Logística       Materna:Gestación Más Segura Para Las
    Aplicada A La Epidemiología. Revista             Mujeres. Lima: © Reproductive Health
    Salud, Sexualidad y Sociedad, 1[4].              Matters.

   Calderón, S. G, (2006) Una Metodología          Vilca, C. P. (2009) Clasificación De Tumores
    Unificada para la Evaluación de Algoritmos        De Mama Usando Métodos. Lima: Tesis.
    de Clasificación tanto Supervisados como       Dirección Regional De Salud Cusco, (2007)
    No-Supervisados. México D. F.: resumen de         Análisis De Situación De La Mortalidad
    tesis doctoral.                                   Materna Y Perinatal, Región Cusco.
   Corso, C. L. (2009) Aplicación de algoritmos
    de clasificación supervisada usando Weka.      Ministerio De Salud, Dirección General De
    Córdoba: Universidad Tecnológica                 Epidemiología Situación De Muerte
    Nacional, Facultad Regional Córdoba.             Materna, (2010-2011), Perú.


                                            75

</pre>