=Paper=
{{Paper
|id=Vol-2306/paper10
|storemode=property
|title=Gender Violence’s Models and Discrimination-aware Data Mining
|pdfUrl=https://ceur-ws.org/Vol-2306/paper10.pdf
|volume=Vol-2306
|authors=Pilar Hidalgo
|dblpUrl=https://dblp.org/rec/conf/ekaw/Hidalgo-Leon18
}}
==Gender Violence’s Models and Discrimination-aware Data Mining==
<pdf width="1500px">https://ceur-ws.org/Vol-2306/paper10.pdf</pdf>
<pre>
                Gender violence’s models and
              discrimination-aware data mining?

                      Pilar Hidalgo-León1[0000−0002−5427−5240]

     Pontificia Universidad Católica del Perú, Av. Universitaria 1801, San Miguel
                                       Lima-Perú


        Abstract. The violence against women, bases in the inequity of oppor-
        tunities in every social stratum. According to national surveys, women
        and girls are considered a vulnerable population due to the inequali-
        ties in access to essential services such as education, economic indepen-
        dence and technology. Currently, governments show interest to address
        this problem and provide parity of opportunities to contribute to the
        development of society. Therefore, several social studies analysed the sit-
        uation of women and their impact on welfare indicators regarding the
        development of the population, in this context, some techniques pro-
        pose solutions through Data Mining to measure and recognise possible
        discrimination and violence. The thesis work intends to develop data
        mining models to identified discrimination mainly towards women and
        girls, taking into account the environmental factors such as individual,
        community, social and institutions.

        Keywords: Discrimination aware · association rules · Spatial-Data Min-
        ing · Gender violence.


1     Introduction
The World Health Organisation defines gender violence as behaviour towards a
person, which caused physical, psychological or sexual harm. Around the world,
one out of every three women over 15 years old, is a victim of physical or verbal
aggression [10]. Gender Violence is considered a global problem, categorised as
a hate crime according to the legislation of some countries [8] and as part of a
series of consequences related to social stability.
    The relationship between discrimination and violence has been defined by
[18], since the feeling of disagreement, whether by the oppressor or the relegated
group, often generates a violent response.
    However, most of social studies seek global differences rather than analysed
gender attribute in isolation, also gender discrimination can appear in specific
contexts, because of this, registries from discrimination facts are scares and
comes from different references, complicating the analysis.
    In contrast, the use of social networks and digital devices (mobile phones,
banking transactions, etc.), leave a trace which records the behaviour of people
?
    Supported by Universidad del Pacı́fico-Early Stage
2       F. Author et al.

[11], making possible to describe the context of interaction, through an auto-
mated data collection in higher volume of data.
    For instance, the notion of ubiquity shows inequity can be compare on the
level of mobility in men and women. It is not possible to affirm that the role of
a housewife is rooted in a particular geographic space, but through studies of
trajectories, we can see the distances that women travel with respect to men are
considerably smaller [21] or like shopping habits [19], where women with regard
to men register higher expenses in supermarkets and grocery stores.
    This context leads us to pose the following question: ¿How to measure gender
discrimination in society through digital records?.
    This article is organised as follow: The state of the art gathers research
antecedents from social sciences and data sciences, to identify phenomena and
factors related to violence; the proposal and approximations regarding the state
of the art, the methodology to validate the hypothesis; so far, the results show
information about environmental factors and finally the discussion about the
difficulties that this research face it.


2   State-of-the-art

Our research lines has two main groups, the social sciences and the data sciences.
In this context, International Institutions are interest to address this problem on
detail. They develop indicators about environment factors, which might arise the
persistence of violence in emerging societies or not. [5, 6]. These studies describes
a large gap for access to basic services of housing, health, food and education
[3].
     Another view, mention that lifestyle have an influence of men who report
having perpetrated physical violence towards their couple throughout his life
[14]. For instance, multi-variant analysis of community factors, also found that
men who were witness of family violence from father to mother, are more likely
to engage in gender violence, reinforcing the theory of the transmission of inter-
generational violence [20]. In an attempt to predict gender gaps in children,
they included the participation of parents and their perspective of gender roles,
finding that parents with gender role paradigms, will have children with the
same stereotypes [4].
     They concluded, that it is necessary to understand how inequity affects the
development of societies, the difficulties to face these problems and the factors
predisposing to disparity [16].

    Although most of these studies propose indicators to face inequity [7], they
are not in agreement with each other, due to a variety of variables, which is valid
and does not turn out to be a problem in itself. However, they are not addressing
the interaction of these factors nor the influence on discrimination, that might
give emphasises a particular element to establish priorities. The proposal refers
to this problem, to incorporated data mining techniques to improve the granu-
larity of the rules that support decision making process.
             Gender violence’s models and discrimination-aware data mining           3

In contrast, Data Mining techniques allow a variate of task (managing data,
measuring and predicting social phenomena), as long as it relies on information
of the real context (variable and labels) to build models.
The measures to determine discrimination, in general, have not been fully devel-
oped by the data sciences, although, the Discrimination-Awareness Data mining
(DADM) [23] address this issue becoming into an ever-increasing field, discover-
ing discrimination hidden in a volume of historical records for decision making,
recognising direct discrimination 1 and indirect 2 according to social context.
    As indirect discrimination is not explicit in electronic digital records, to infer
direct discrimination it is necessary to know the external and individual factors
of the study sample (sex, race, pregnancy status, age, ethnicity, religion, state
civil). Thus, some research papers [2, 21] find a relationship of inequity between
social factors such as illiteracy, child malnutrition, access to contraceptives, and
external factors (climate, geography, etc.), in geographic spaces. Methodological
works proposed another measures of discrimination [9, 12, 15, 23], considering the
social and individual factors, concerning the causes of direct segregation towards
people. Some variations of the data transformation process, suggests using the
evenly distributed information to improve the representativeness of all the groups
in the datasets [22].


3    Proposed Approach

The thesis work proposes two goals within the framework of state-of-the-art:
First, to develop models to measure discrimination based on digital records.
The guiding is the DADM, to build classification rules that recognise direct and
indirect discrimination, as long as there will two elements: the class that defines
a discriminating rules and the context to validate the elements of the rule are
discriminant, as shown in Figure 1 [9]. where it is necessary know the background
to unveil discrimination.
     Because discrimination records are scare, we include historical information
from two different periods and various geographical spaces. According state-of-
art, the rules generated from databases with a known class type, provide enough
information to classified the rules into potentially discriminant (PD) and non-
discriminating(PND). However, is not the case of our data, so it is necessary
to develop new strategies to determine the class. These rules are relationships
and will add quality to the results like as co-occurrences, which means that it
is likely they would be related to social context. The second goal, refers to the
validation of the models carried out in the first stage through socio-demographic
context. However, it would be necessary to demonstrate that these variables are
sufficient or in any case, to experiment and applying engineering features.
1
  Direct discrimination: explicit, impose barriers between a group. Ex. Ethnicity: in-
  digenous, History: Good Credit: Poor
2
  Indirect discrimination: not explicit, but impose barriers between a group (conscious
  or unconscious). Ex. Provenance: Rural, History: Good Credit: Bad; if it is known
  that the rural population has a high percentage of indigenous people.
4         F. Author et al.


Fig. 1. Modeling the process of direct (left) and indirect(right) discrimination analysis
by Dino Pedreschi, Salvatore Ruggieri, Franco TuriniPublished 2008 in KDD


4      Methodology
The KDD (Knowledge discovery databases) process of extracting patterns from
databases is composed of four stages [13]. The process begins obtaining the data
and ends with the validation of the results (patterns). These stages will serve as
the guiding thread for the realisation of the research project.
    – Construction of the database: heterogeneous dataset about discrimination
      cases and environmental factors. Datesets used for this research has different
      origins and have four sources:
        • ENDES: The Demographic and Family Health Survey in Peru, with
           structured records of socio-demographic reality (health, education, sex-
           uality and domestic violence in 2016 and 2017).
        • Census to the police stations: Structured data about the crimes regis-
           tered in 2016 and 2017.
        • Banking transactions: structured data of the register of banking trans-
           actions (expenses and purchases) in customers in Peru (2016 to 2017).
        • News about violence against women: a repository of unstructured news
           data in Peru, concern to gender crimes in 2016 and 2017.
      As shown in the information above, it is possible to identify the characteris-
      tics of the environment through the data collected for this investigation.
    – Pre-data processing: to select the attributes or events related to gender dis-
      crimination, it is necessary organised the data in order to be easy to describe
      and interpret, through data cleaning techniques and feature engineering (ex-
      trapolation, prediction of classes, operation between fields) to preserve the
      integrity of the information.
    – Pattern extraction: the attributes selected in the previous stage are trans-
      formed to be used as input and output in a pattern extraction algorithm.
    – Restitution, visualisation and validation of patterns: Results would be com-
      pare with real context, it is essential at this stage to combine data mining
      methods with visualisation methods.
              Gender violence’s models and discrimination-aware data mining        5

5     Results

The first approach to the fusion of social sciences to data sciences for this work,
was the prepossessing of qualitative information extracted from interviews with
university females students [1]. In that work, relevant testimonies of the inter-
viewees were extracted through a process based on the use of TF-IDF (Term
frequency - Inverse document frequency).
    The research mentioned above, aims to know which agents were involved
in some episodes of discrimination within university atmosphere. To this end,
a survey of open questions about events of discrimination was developed, such
as: ”Did you ever hear in the university environment mentioning that women
are different concerning to their professional performance? Mention what you
heard and the person who said it”. This survey was formulated by social experts
who manage the variables according to the-state-of-the-art mentioned in [1].
In sum, through the extraction of relevant words from each testimony, it is
possible to know which testimonials represent others in a set of documents. Our
methodology is available in Github 3 .
    In this proposal, the propose method named A0 are more likely to contain
sets of different words related to the topic. In contrast, Topic Modeling methods
are memory expensive compared to methods based on TF-IDF. We conclude
that the process A0 is efficient in the extraction of relevant words, optimising
the retrieval information for qualitative research in simple and complex data.
However, the efficiency of A0 is linked to the improvement of data pre-processing
techniques.
    The sample for that work is small (214 records). Although the extracted
testimonies may represent the discrimination suffered by the students of that
school, it is not possible to generalise what happens with all the women in the
same space. Due to this, the decision was made to collect structured information
available in different resources and to unify it to be processed later.
    For this article, databases were compiled related to the individual and envi-
ronmental factors that influence gender violence. As seen in Tables 1, 2 and 3,
each factor mentioned by the studies described in Section 2, it corresponds to a
series of files that contain information related to the description of the variables.
    The description of this data has been crucial to understand how the indi-
vidual, community, social and institutional factors are related, as well as the
recognition of variables between the automatic and socio-demographic informa-
tion.
    For instance, suppose that the amount of monthly expenditure sustains the
purchasing power of the people and is defined by the type of work that they
have. A well-paid job requires specialised skills and a high degree of training
or education. Although it is not possible to assure that there is a relationship
between educational level and the amount of monthly expenses in our data. As
Figure 2 shows, inequality in educational level between men and women from 0
to 89 years (y axis) in Peru in 2016, the green bars represent the data of women
3
    Source code: https://github.com/bitmapup/violencePatterns
6            F. Author et al.

who, as observed, have a numerical difference with respect to their male pairs
at each level of the x axis (blue bars). In contrast, Figure 5 shows population

             Violence                                                        File               Data           Concatenated
    Factor            Name of datasets           Columns       Records
              factor                                                        weight              types               file
                                                                                                               Weight: 2.1 MB
             Individual   a.rech05             a. 7 columns a. 35320      a. 1.9 MB a. int64(6),object(1)
Individual                                                                                                     Records: 35084
             Woman        b.rec011             b. 5 columns b. 34002      b. 1.3 MB b. int64(4), object(1)
                                                                                                               Columns: 8
                                                                                                               Weight: 5.8 MB
                          a. recv84            a. 72 columns a. 33168     a. 18.2 MB a. float64(71), object(1)
Home         Community                                                                                         Records: 11543
                          b. rec0111           b. 5 columns b. 34002      b. 1.3 MB b. int64(4), object(1)
                                                                                                               Columns: 66
Table 1. Characteristics of the ENDES datasets associated with gender violence factors

    Violence                                                    Weight                  Data  Concatenated
                     Name            Columns Records
     Factor                                                    of the file              types      file
                                                                                             Weight: 285.1 MB
Institutional Chapter100 37 columns 1177                       340.4 MB int64(28), object(3) Records: 1177
                                                                                             Columns: 31
Individual        Chapter200 43 columns 2103                   706.6 MB int64(8), object(5)
             Table 2. Characteristics of the files that make up Complaints dataset

               Violence                                               File                  Data                    Concatenated
 Factor                            Name        Columns Records
                 factor                                             weight                  types                         file
           Individual
                                                                                                                    Weight: 87.5 MB
           Person             Transactions     10 columns 84591756 9.0 GB timestamp(1),float(3),int64(2), object(4)
Individual                                                                                                          Records: 603512
           Purchasing power Transactions class 9 columns 1806836 124.1 MB int64(2), object(7)
                                                                                                                    Columns: 18
           Financial behavior
                                Table 3. Characteristics of bank databases
of the same age range (axis y) and socio-economic class (axis x). With respect
to the ”feature engineering” process mentioned in the methodology section, the
”class” field was determined by the Formula 1:
                                       X
                               class =     Pu /n                           (1)
                                                                u=1

Where n is the total of transactions and P u is the accumulated average monthly
expenditure per user; class indicates the social class associated with that type of
behaviour and monthly expense amount. This research will not specifically deal
with the verification of this formula, but it is necessary to point out that in [19],
there are certain parameters to follow to find this variable social class. As Figure
5, differences of socio-economic class between man(blue bars) and women(green
bars) are remarkable repect to the amount of expenditure (axisy). Error showed
is caused due to few register with outlier measure.


6       Discussion

In contrast to DADM studies so far, information used in the thesis proposal is
varied and voluminous, coming from different sources of known space-time con-
texts, which in some way allows knowing the continuity of particular phenomena
(prediction of events). The discrimination metrics proposed by the social sciences
were formulated from multivariate analysis and classical statistics with some as-
pects and factors. In this research, the maximum number of possible variables
              Gender violence’s models and discrimination-aware data mining             7


              Fig. 2. Education level for women and men in Peru (2016)


      Fig. 3. Socio-economic class and age in Peruvian women and men (2016)


that could or could not appear according to the proposed model will be used
simultaneously, forming strong association rules automatically and objectively.
So, is recommendable begin through the definition of discrimination measures
in vast and varied information, that contributes to social research in the search
for explanations of social events such as gender violence.

References
1. H.Alatrista-Salas, P.Hidalgo-Leon,Nuñez-del-Prado (2018). Documents Retrieval
   for qualitative Research: Gender Discrimination Analysis. LA-CCI IEEE, confer-
   ence.(2018)
2. Bosco, Claudio, et al. ”Exploring the high-resolution mapping of gender-
   disaggregated development indicators.” Journal of The Royal Society Interface
   14.129 (2017): 20160825.
3. Braveman, Paula, and Eleuther Tarimo. ”Social inequalities in health within coun-
   tries: not only an issue for affluent nations.” Social science medicine 54.11 (2002):
   1621-1635.
4. Croft, Alyssa, et al. ”The second shift reflected in the second generation: Do parents’
   gender roles at home predict children’s aspirations?.” Psychological Science 25.7
   (2014): 1418-1428.
5. Centro de Investigación y Desarrollo (CIDE), ”Factores asociados a la presencia
   de violencia hacia la mujer” Talleres de la Oficina Técnica de Administración del
   INEI,000 -OTA-INEI (2002).
6. Centro internacional para la prevención de la criminalidad (CIPC), ”Informe Inter-
   nacional sobre la prevención de la criminalidad y la seguridad cotidiana: Tendencias
   y perspectiva”, (2015)
8       F. Author et al.

7. Chant, Sylvia. ”Re-thinking the “feminization of poverty” in relation to aggregate
   gender indices.” Journal of human development 7.2 (2006): 201-220.
8. Choy, Olivia, et al. ”Explaining the gender gap in crime: The role of heart rate.”
   Criminology 55.2 (2017): 465-487.
9. D. Pedreschi, S. Ruggieri, and F. Turini, “Integrating Induction and Deduction
   for Finding Evidence of Discrimination,” Proc. 12 th ACM Int’l Conf. Artificial
   Intelligence and Law (ICAIL ’09), pp. 157- 166, (2009)
10. Devries, Karen M., et al. ”The global prevalence of intimate partner violence
   against women.” Science 340.6140 (2013): 1527-1528.
11. Di Clemente, Riccardo, et al. ”Sequence of purchases in credit card data reveal life
   styles in urban populations.” arXiv preprint arXiv:1703.00409 (2017).
12. F. Kamiran and T. Calders, “Classification with no Discrimination by Preferen-
   tial Sampling,” Proc.19th Machine Learning Conf. Belgium and The Netherlands,
   (2010).
13. Fayyad, Usama M., et al. ”The KDD process for extracting useful knowledge from
   volumes of data.” (1996).
14. Fleming, Paul J., et al. ”Risk factors for men’s lifetime perpetration of physical
   violence against intimate partners: results from the international men and gender
   equality survey (IMAGES) in eight countries.” PloS one 10.3 (2015): e0118639.
15. Hajian, Sara, and Josep Domingo-Ferrer. ”A methodology for direct and indirect
   discrimination prevention in data mining.” IEEE transactions on knowledge and
   data engineering 25.7 (2013): 1445-1459.
16. Jayachandran, Seema. ”The roots of gender inequality in developing countries.”
   economics 7.1 (2015): 63-88.
17. Johnson, Wendi L., et al. ”The age–IPV curve: Changes in the perpetration of
   intimate partner violence during adolescence and young adulthood.” Journal of
   youth and adolescence 44.3 (2015): 708-726.
18. Karlsen, Saffron, and James Y. Nazroo. ”Relation between racial discrimination,
   social class, and health among ethnic minority groups.” American journal of public
   health 92.4 (2002): 624-631.
19. Leo, Yannick, et al. ”Correlations of consumption patterns in social-economic net-
   works.” Proceedings of the 2016 IEEE/ACM International Conference on Advances
   in Social Networks Analysis and Mining. IEEE Press, (2016).
20. Muller, Robert T., John E. Hunter, and Gary Stollak. ”The intergenerational trans-
   mission of corporal punishment: A comparison of social learning and temperament
   models.” Child Abuse Neglect 19.11 (1995): 1323-1335.
21. Pappalardo, Luca, et al. ”Using big data to study the link between human mobility
   and socio-economic development.” Big Data (Big Data), 2015 IEEE International
   Conference on. IEEE,(2015).
22. Romei, Andrea, Salvatore Ruggieri, and Franco Turini. ”Discovering gender dis-
   crimination in project funding.” Data Mining Workshops (ICDMW), 2012 IEEE
   12th International Conference on. IEEE, 2012.
23. Ruggieri, Salvatore, Dino Pedreschi, and Franco Turini. ”Data mining for discrim-
   ination discovery.” ACM Transactions on Knowledge Discovery from Data (TKDD)
   4.2 (2010): 9.

</pre>