Right to Information Query Modelling via
              Graded Response Model

                   Nayantara Kotoky1 and Vijaya V Saradhi2
                    1
                      Indian Institute of Technology Guwahati
                            nayantara@iitg.ernet.in
                    2
                      Indian Institute of Technology Guwahati
                             saradhi@iitg.ernet.in


      Abstract. Right to Information (RTI) Act, 2005 empowers citizens of
      India to access information from any governmental organization. Using
      this Act citizens can ask questions (through RTI applications/queries) to
      government offices and obtain answers. In this work we attempt to model
      RTI queries. Objective of modeling is to understand the latent patterns
      such as transparency and effectiveness of RTI Act implementation in the
      RTI query-reply process which are suggestive of possible amendments in
      the Indian Constitution. We employ Graded Response Model (GRM, a
      variant of Item Response Theory) for obtaining the latent patterns. A
      synthetic dataset corresponding to central and state educational institu-
      tions is constructed which has close characteristics to the collected RTI
      query dataset. From the GRM we infer that certain institutes are highly
      transparent in replying to citizen’s questions across various categories.
      We also infer that RTI act’s implementation is not uniform across diverse
      categories within a transparent institution.

      Keywords: Item Response Theory, Graded Response Model, Right to
      Information


1   Introduction

Right to Information (RTI) Act 2005 empowers citizens of India to access infor-
mation from any public institution (those institutions that are funded by gov-
ernment). The RTI Act came into force in October 12, 2005. Through this act
citizens can inspect official documents, contracts, press releases, records, notes,
certified copies by filing an RTI application/query. Each institution appoints
a Public Information Officer (PIO) to implement the RTI act and reply to the
questions posed by citizens. Citizens submit a hard copy of their questions to the
PIO. Every RTI application costs a sum of Indian rupees ten. PIO is responsible
to reply to the query within a fixed time period (typically 30 days).
    RTI queries form a source of information where one can witness citizens’
interaction with government establishments. Such a rich source of information
when analyzed can throw light on the sensitivities of citizens and weaknesses
in the implementations of laws. Certain acts have been amended using the RTI
2

statistics. In particular, RTI Act itself is amended through the following two
examples:
1. Inclusion of Indian Postal Orders:
   For fee payment with an RTI application, the acceptable modes of payment
   were Banker’s cheque, Demand Draft or by cash. All three modes had their
   own additional burdens. Both demand draft and banker’s cheque had their
   service charges attached, and payment by cash required visiting the public
   institution in person. Indian Postal Order (IPO) is another convenient mode
   of paying fees, with a nominal charge of 10%, which is Re.1 for the fee of Rs.
   10. However, IPOs were not acceptable as a mode of payment, because of
   which there were multiple rejections of RTI applications which were perfectly
   good with their content. This event was widespread enough to catch the
   government’s eye, and there was discussion of its inclusion as a mode of
   payment. Ultimately changes were made to the RTI Act’s scope by adding
   IPOs as a mode of payment. [3].
2. Exemption of political parties from being a public authority:
   Asking for source of funding for political parties is not uncommon. With the
   intention of understanding the inner workings of the organization, political
   parties are often asked to cite their source of funding. With the advent
   of RTI, multiple applications were filed asking their financial details. The
   parties argued that they are not under the direct funding of the central or
   state government and hence are not liable to divulge such information. Such
   queries were repeatedly rejected, and a notice was issued stating political
   parties as not being public authorities. It was finally included in the RTI
   amendment Bill 2013 [4].
    From the above two examples it is observed that ”repeated rejections” of
RTI queries served as a feedback for introduction of amendments into existing
RTI act. This leads us to believe that the latent patterns in the RTI query log
provide potential pointers for predicting future amendments. The objective of
this work is to collect the RTI queries and associated response (whether the
institute has replied to the query, rejected the query or referred to third party)
by institutions across India, model thus collected text data and identify latent
patterns in the RTI query database.
    We propose to model the RTI queries text database as a two dimensional
matrix whose rows correspond to institutions and columns correspond to topics
on which questions were posed to individual institutions. An entry ij in this
matrix correspond to percentage of replies an institute i has given against a
query topic j. This matrix is given as input to the Graded Response Model
(GRM) to identify latent patterns in the RTI query-reply process.
    After running GRM on our RTI data, each institution has been designated a
’transparency’ value that determines how effective an institution is with respect
to replying RTI queries, and indicates a difference between the central and state
educational institutions. The model also identifies differences in the query-reply
process along different query topics.
    Contributions:
                                                                                  3

 1. This is the first attempt in collecting RTI query-reply data across India.
 2. A two dimensional query-reply matrix is constructed out of the given RTI
    query-reply text database instead of using conventional text modeling meth-
    ods such as vector space model, latent semantic indexing, LDA etc.
 3. We employ for the first time psychometric models in RTI query text docu-
    ment analysis.


2     Related work
2.1   Modelling the Political Domain
Attempts to model the legislative structure and outlook have been seen in the
literature. Now and again, researchers have sought to apply mathematical mod-
els to represent affairs in the political domain. Such work opens up scope for
understanding the political issues in depth. Gerrish and Blei [5] developed a
probabilistic model for legislative data to identify voting patterns in specific po-
litical issues. They used the text of bills to identify the specific topics to which
the bills relate to, and attempted to identify what the lawmakers’ stance is with
respect to bills with different topics (issues). They argued that a lawmaker’s at-
titude cannot be captured accurately on the broad political structure since they
do not exhibit enough regularity in the voting patterns. It is assumed that they
have an overall (general) political stand but have different political stand on spe-
cific issues that the bills are based on. The paper introduces an issue adjusted
model that identifies each lawmaker’s position on individual topics, called the
’issue adjusted ideal-point model’. The adjusted model has been able to iden-
tify the lawmakers’ political stand in a more realistic way, and for each issue
individually.
     Poole and Rosenthal [6] analysed a variant of voting patterns, namely, roll
call data for legislators’ votes. They took US voting data where choosers are
representatives of the law or senators, and the choices are binary, that is, yes or
no. They developed a unidimensional model of probabilistic roll call voting, and
the methods can be applied to the analysis of voting in popular elections and
other forms of political choice behavior.

2.2   Forms of Queries
From classrooms to commercial platforms and entertainment, queries are found
everywhere and in all forms. Examples include e-commerce queries, customer
service queries, product review queries, tourism queries, personal and rhetorical
queries (natural language), queries in an Issue Tracking System, queries in med-
ical diagnosis and of course RTI queries. Each of these query types have different
models of analysis. Some of the ways of modelling are:

 1. Web search engines:
    Web queries (queries put to search engines for web search) are analysed
    to improve user experience and search engine performance. Research has
4

    been done to find user goals from queries [7] and temporal dynamics of
    query patterns have been studied [8]. [9] proposes methods for clustering
    similar queries together, which helps us to understand how frequent and how
    diverse web queries are. Traditional information retrieval mostly depended
    on simple term matching between queries and documents. However, it has
    been observed over time that understanding the meaning of the query is
    important in improving the precision of a search result, like certain keywords
    have more relevance in a given query and synonyms need to be identified.
    Attempts to find such hidden semantics in the queries have been made by
    [10].
 2. Question-answer (Q/A) system:
    Q/A systems do not retrieve documents, but give brief, relevant answers in
    short text. This speciality requires time, processing power as well as compu-
    tation and understanding of the semantics of the query. In order to overcome
    the bottlenecks of natural language understanding, an amalgamation of sta-
    tistical and representation based methods is required. Semantic information
    in questions and answers classification is studied in [11]. [12] has attempted
    to design a paraphrase component in a natural language question-answer
    system, whereas [13] has presented a new topology to support construction
    of question-answer systems.
 3. Examination sets (questionnaires/test questions):
    Questions are used to determine the qualification of individuals or behaviour
    of events. Typical examples are the survey questions under social or busi-
    ness context, tests for students, diagnosis of illness etc. Applications include
    attempts to model response behaviours and finding optimum set of ques-
    tions for judgement. Examples are equating tests [14], understanding family
    relationships [15] etc.


3      Item Response Theory
3.1     Description
Item Response Theory (IRT) is a method for psychometric analysis. It uses
statistics to analyse how people (test takers) respond to different questions and
elements. The modelling of the data is done as a function that is an adjustment
between two criteria:

    – The persons abilities, perspective or personality traits and
    – The item (question) difficulty.

    The perception behind IRT is that probability of a correct response to an
item is a mathematical function of person and the item parameters. IRT treats
difficulty of each item as information to be incorporated in scaling items. The
person parameter is interpreted as a single latent trait. Example of person pa-
rameters include intelligence, attitude etc. Likewise, we have item parameters
that are taken into consideration like difficulty of the item, discrimination (slope
                                                                                  5

or correlation) representing how sharply the rate of success of persons varies
with their ability, or guessing parameter which characterises certain items that
even low intelligent persons can attempt to get correct response by guessing.
IRT has a few presumptions. The first is that all items are independent of each
other. Hence each item is modelled separately with its own set of parameters
(which shall be discussed next). Second is that the response of a person to an
item can be modelled by a mathematical Item Response Function (IRF). Also,
the latent trait theta (θ) is assigned to each person that gives us the ability of
the persons in a unidimensional scale. The main advantage of IRT is that the
ability parameter (θ) and the item difficulty parameter are modelled on the same
scale. We can imagine ability (intelligence of the student) and difficulty (of the
questions) as two opposing parameters, both contributing to the probability of
keying the correct response.
    Measurement items with multiple response options also exist. In case of poly-
tomous models, each category function must be modelled explicitly. We can
imagine different response categories to be separated by boundaries. Respond-
ing in a particular category means responding between two boundaries of that
response category. This gives rise to two types of conditional probabilities:

 – Probability of responding in a given category
 – Probability of responding positively rather than negatively at a given bound-
   ary between two categories

   In case of polytomous items with multiple responses, in order to identify
probability of responding in a particular category, we need to identify probability
at both the boundaries. Positivity in one category boundary does not mean
response to the adjacent response category. It simply means that probability is
positive for all the subsequent categories, and might not refer to only the adjacent
response category. Hence probability of responding to a particular category shall
entail positivity at the lower category boundary and negative probability at the
upper category boundary. This idea shall be exploited in the model that we shall
use for our experiments.


3.2     Graded response model

The Graded Response Model (GRM) is a polytomous IRT model for ordinal
response categories. It belongs to the class of Thurston/Samejima models. The
GRM is an extension of the 2-Parameter Logic Model. Let θ be the latent ability
underlying the response to the test items. The probability of a candidate with
ability θ responding to item i in a particular category c is:
                                       ∗         ∗
                            Pic (θ) = Pic (θ) − Pic+1 (θ)
      where

                          ∗                   1
                         Pic (θ) =
                                     1 + exp(−αi (θ − βic ))
6

    αi is the Item slope parameter (one per item), βic is the Category threshold
                    ∗
parameters and Pic    is the Category Boundary Response Function (CBRF) for
item i and category c. There is one set of βi1 ,..., βim for each item and are
ordered, where m+1 is the number of categories [16].
    The psychological idea behind this is that in a dataset with polytomous
response categories, each response category of an item exerts a level of attraction
on persons taking the test. In the context of an entire item, being attracted
to a category must take all prior category attractions into account. In other
words, the probability of responding in any given category is a combination of
being attracted through all previous categories up to the given category, but no
further. In the case of ordered categories, this process means that to respond in a
particular category, a person must have passed through all preceding categories.
Let Pig be the probability of responding in a particular category (g) to item i.
    ∗
If Pig represents a CBRF in the Thurstone/ Samejima models (where both are
conditional on θ), then

                                       ∗     ∗
                                Pig = Pig − Pig+1
    The probability of responding in a particular category is equal to the prob-
ability of responding above (on the positive side of) the lower boundary for the
category (ig) minus the probability of responding above the category’s upper
boundary (ig+1 ).


3.3   Parameter Estimation

There are two types of parameters in IRT, that is, Item parameters and Person
(ability) parameter. Since IRT is a trade-off between the two types, both are
estimated iteratively to arrive at the best fit. For polytomous data, data is mod-
elled by multiple dichotomizations at the category boundaries and finally using
all the information to reach the final estimation of the parameters. For dichoto-
mous data, parameter estimation is done differently for different parameters.


Estimating ’ability’ parameter with known Item Parameters: To es-
timate an examinee’s unknown ability parameter, it will be assumed that the
numerical values of the parameters of the test items are known. It is an iterative
process, and begins with some known values of the item parameters. The proba-
bility of the correct response to each item is then computed, and then the ability
parameter is slightly adjusted so that the values closely match the observed val-
ues. The process is repeated until the adjustment becomes small enough that
the change in the estimated ability is negligible.
                                         P
                                           ai [ui − Pi (Θs )]
                          Θs+1 = Θs + P 2
                                           ai P (Θs )Q(Θs )
   where Θs is the estimated ability of the examinee within iterations, ai is
the discrimination parameter of item i, ui is the response given by examinee to
                                                                                      7

item i, Pi (Θs ) is the probability of correct response to item i at ability θ and
Qi (Θs ) = 1 − Pi (Θs ) is the probability of incorrect response to item i at ability
θ.
    Bayesian Estimation is used to estimate ability parameters given the item
parameters. We have from Bayes’ theorem

                                           f (u|Θ)f (Θ)
                               f (Θ|u) =
                                               f (u)
   which can also be written as

                               f (Θ|u) ∝ L(u|Θ)f (Θ)

   Taking logarithm of both sides,

                          lnf (Θ|u) ∝ lnL(u|Θ) + lnf (Θ)

    The posterior is directly proportional to the likelihood multiplied by prior,
where f (Θ | u) is the posterior estimate, L(u | θ) is the likelihood estimation
and f(Θ) is the prior distribution. For each and every Θ, we can calculate the
likelihood function and we also have the prior. So, we can calculate posterior
distribution P (θ | u). The prior distribution has a bell shaped curve; hence the
right hand side of the equation shall have a point with slope 0.


Estimating item parameter from response data: Let us divide examinees
into J groups along the Θ scale so that all the examinees within a given group
have the same ability level Θj , where j = 1, 2, 3. . . . J. If rj is the examinees that
give correct response, then at an ability level of Θj , the observed proportion of
correct response is p(Θj ) = rj /mj , where mj is the total number of examinees
in the group. Now the value of rj can be obtained and p(Θj ) computed for each
of the j ability levels established along the ability scale. The main task now is
to find an Item Characteristic Curve that best fits the observed proportions of
correct responses.
    For the estimation, initial values of item parameters are established. These
values are then used to compute p(Θj ) with the logistic equation. Iteratively, the
item parameters are adjusted as well to find better values that reflect proximity
with our observed data. This process of adjusting the estimates is continued
until the adjustments get so small that little improvement in the agreement is
possible. At this point, the estimation procedure is terminated and the current
values of b and a are the item parameter estimates.
    The method used to calculate item parameters from response data is called
the marginal maximum Likelihood. Given the joint distribution of a function
f (x1 , x2 ), we can calculate the marginal distribution of f (x1 ) as:
                                         Z ∞
                             f (x1 ) =         f (x1 , x2 )dx2
                                         −∞
8

   Let yi be the response vector for person i. Yi j shall be the response given
by person i to item j. Let J be the total number of items, Θi be the ability of
person i and Φ be the matrix of true item parameters. So we have,
                                              J
                                              Y
                             f (yi |Θ, Φ) =         P yij (Θi )
                                              j=1

      Hence, the marginal distribution of item parameters can be given as:
                                    Z
                         f (yi |Φ) = f (yi |Θ, Φ)g(Θ)d(Θ)

   Let Y be the response matrix of each and every person and let there be n
persons in total. So:
                                              n
                                              Y
                                f (Y |Φ) =          f (yi |Φ)
                                              i=1

      Taking logarithm of both sides for likelihood:
                                              n
                                              X
                            logL(Y |Φ) =            logf (yi |Φ)
                                              i=1

   The value of Φ where likelihood function is maximised is found via Bayesian
Estimation as described above.


4      Dataset

4.1     Data Collection

For the purpose of our study, we have decided to create an ”RTI database” as
part of our research. Our dataset consists of the RTI applications that have been
posted to all public educational institutions by the citizens of India. The data
collected consists of RTI applications (which include the RTI queries), date of
reply of each query and the rejected queries with their grounds of rejection. This
collection is going on and the database is not yet complete.
    The data collection was formally started on 01.01.2015. RTI data is not
found online, but have to be collected from each individual institution. Hence
we resorted to filing an RTI application of our own asking for the data required,
namely, all the RTI applications received by the institution, date of reply of each
query and the rejected queries with their grounds of rejection. There is no facility
for online RTI filing, so we had to post our RTI application to each institution.
We started with the educational boards of high school and higher secondary
level, and moved ahead towards universities. We shall collect RTI data from a
total of 1053 educational institutions across India. Till date, we have filed RTI
applications to a total of 360 institutions and have received a variety of replies
to the same application from different institutions, both positive and negative.
                                                                                    9

Of the institutions that received our RTI application requesting the RTI data,
56.38% have rejected our application citing various reasons. Up to this date, we
have collected data from a total of 44 institutions and 113 additional institutions
have agreed to give us the data (on payment of extra money or collect data by
visiting their office). The average time of receiving a reply to our application is
53.2 days. For the institutions that we have collected data from, it has taken us
an average of 73.9 days to finally receive the data. This has resulted in around
35,000 RTI applications and reply stats. Each RTI application contains multiple
queries (or sometimes just a single query). India is a multi-lingual country, and
the queries are mostly found in the local language of the area to which the
institutions belong. The data has not been processed yet, so the precise count
for total queries is unavailable.


4.2   Data modelling

A citizen of India can ask an RTI query on any topic that is relevant to the
institution to which the application is filed. There is also the provision of transfer
of the RTI application to the appropriate department if the reply or sought
document is not in the office that received the application. As a result, we find
a variety of query types belonging to different topics. Upon closer inspection
of the data received by us, it was observed that the queries can more or less
be divided into some fixed number of topics. Some topics gets more queried,
hence are popular among the masses, whereas some receive less queries. Areas
of educational institutions such as Academic (marks), research, infrastructure
are more targeted since people are more interested in knowing the workings
of these departments. Hence analysing the RTI query-reply patterns of these
specific topics is of paramount importance.
    For our experiments using Graded Response Model, analysis can be done
on the reply, rejection, and appeal stats etc. This shall indicate transparency
among institutions and categories, probability of getting a query in a particular
category accepted or rejected, etc.

 – Create matrices based on queries asked, queries replied, queries rejected,
   queries appealed.
 – Analyse behaviour patterns of institutions in answering or rejecting queries
   and identify the most frequently-asked topics.

   The GRM models items with polytomous response categories. The model
takes as input a matrix with items (questions) on one dimension and person
parameters on the other. Values are filled with the response categories for every
person to each item. Modelling consists of finding the optimum values of the
parameters of the model that best describes the data given. For our RTI data,
we can create matrices with reply stats, rejection stats and for query asked.
The matrices are filled with percentage of replied queries, rejected queries and
queries asked respectively. To draw an analogy between the two, the persons
in GRM data is represented by the institutions in our RTI data, the items are
10

represented by topics to which the queries belong. The response categories are
represented by percentages (0-100). Since the entries in the matrix represent
percentages and they are ordinal in nature, the use of GRM is appropriate.
The utility of the GRM for using on our RTI data is because of the latent
patterns that it helps to identify. With respect to our data, the ability of persons
shall represent ’transparency’ of the institutions (with respect to answering or
rejecting RTI queries), and the item difficulty shall denote the implementation of
the act across institutions for each topic of query. The parallelism of the latent
patterns between the typical dataset of GRM (multiple-choice questions) and
the RTI data (query-reply stats) is what makes this an interesting approach.
    Modelling each topic-wise statistics will give us a more in-depth picture of
the dynamics of the RTI query-reply process, and capture intrinsic details hid-
den under an envelope of the overall performance of public institution. It is often
observed that certain sections of a public body are more efficient in its perfor-
mance while some others are lethargic. With targeted analysis of RTI queries
divided into topics, we aim to discover specific issues or excellence regarding the
different divisions of the same institution.


5     Experiment and Results
5.1   Constructing the dataset
Our dataset consists of matrices constructed from the RTI database created by
us. An RTI application can have multiple queries. A survey of the data collected
has shown that queries can be more or less classified into some fixed number
of categories or topics, each independent of each other. A few examples of such
categories are Administration, Library, Exams, Courses, Results, Academics,
Admissions, Research, and Tenders etc. Each category has its own individual
characteristic with respect to reply and rejection statistics. In order to dissect
the RTI properties and understanding the hidden traits, analysing category-
wise and institution-wise trends shall equip us with more information regarding
the implementation of the RTI act. We have created matrices with topics of
queries in one dimension and various institutions on the other. The matrices are
created such that persons (in GRM) are represented by institutions, items are
represented by topics and response categories are represented by percentages.
    The RTI data collection is still going on, and only a fraction of the data is
present with us. Additional tasks like translating various local languages into
English, digitizing the data that was received by us in the form of photocopies
etc. are yet to be undertaken. For the experiment, we have constructed a syn-
thetic matrix of reply statistics that resembles our RTI dataset (the few RTI
applications that we have collected). We have created matrices with topics of
queries in one dimension (items) and various institutions on the other (person
parameters). Matrix consists of ten institutions and five topics. Institute 1 to
institute 6 are assumed to be central educational institutes and institute 7 to
institute 10 are state institutes. Institutions are arranged in rows, and columns
represent the five query topics. The matrix is filled with the percentages of the
                                                                               11

queries replied by each institution for each of the topic. The matrix with initial
values is shown in Table 1.

Table 1. Synthetic Data containing response percentages of ten institutes and five
items

             Inst.No. Finance Academic Employment Alumni Medical
                1        75      10        26       28     45
                2        35      49        70       15     11
                3        62      89        6        38     50
                4        48      78        52       95     71
                5        51      64        53       30     74
                6        84      70        94       69     97
                7        52      49        47       45     55
                8        24      29        27       22     34
                9         2      32        28        8     49
                10       30      57        65        7     86


5.2   Transforming the dataset
Table 1 gives us the raw values of our RTI dataset. In order to fit this data into
GRM, the matrix needs to be modified. We have divided the percentages into
five buckets as shown in Table 2. The buckets are created so that each response
category (each bucket) has a minimum amount of institutes’ responses. This is
done to reduce sparsity of data by clubbing together a percentage range into a
single group. The response categories follow the Likert scale with 1 being the
lowest and 5 representing the highest rating. This is done because GRM expects
data in the form of ordinal response options. Here the five buckets represent five
response options and each institution responds to one of those options corre-
sponding to the respective reply percentages. Substituting the percentages with
the values of Table 2 results in the matrix shown in Table 3.

                     Table 2. Percentage range of categories

                  Category         1      2       3       4     5
                  % Range        0-20   21-40   41-60   61-80 81-100


5.3   Results
In order to run GRM, we have chosen the open source platform ’R’. It has a few
packages for IRT and we have used the ’ltm’ package. The parameters obtained
by running GRM to our synthetic data are shown in Table 4.
12

     Table 3. Matrix created after substituting the percentages by assigned values

               Inst.No. Finance Academic Employment Alumni Medical
                  1        4       1         2         2     3
                  2        2       3         4         1     1
                  3        4       5         1         2     3
                  4        3       4         3         5     4
                  5        3       4         3         2     4
                  6        5       4         5         4     5
                  7        3       3         3         3     3
                  8        2       2         2         2     2
                  9        1       2         2         1     3
                  10       2       3         4         1     5

 Table 4. Item Parameters after running the Graded Response Model on our data

                      Items          βi1      βi2     βi3     βi4      αi
                     Finance       -1.451   -0.218   0.602   1.325   4.004
                    Academic       -2.559   -1.191   0.410   2.449   1.106
                   Employment      -5.197   -1.059   1.830   4.950   0.446
                     Alumni        -0.693   0.785    1.261   1.849   1.935
                     Medical       -2.424   -1.556   0.555   1.646   1.047


   For each and every item, a graph is drawn between ability (latent trait) and
the probability of responding on a particular category. Such curves, called Item
Response Category Characteristic Curves for each of the five items are shown in
Figures 1, 2, 3, 4 and 5.
   We have used Bayesian Estimate procedure for calculating the ability pa-
rameter (θ) for each and every institute. Theta (θ) gives us the transparency of
an institution. Transparency for each institution is shown in Table 5.


Table 5. Transparency Parameters after running the Graded Response Model on our
data

                              Inst. No.      Transparency (θ)
                                  1                0.426
                                  2               -0.906
                                  3                0.690
                                  4                0.589
                                  5                0.276
                                  6                1.623
                                  7                0.260
                                  8               -0.720
                                  9               -1.590
                                 10               -0.542
                                                                          13


  Fig. 1. Item Response Category Characteristic Curve for item Finance


 Fig. 2. Item Response Category Characteristic Curve for item Academic


Fig. 3. Item Response Category Characteristic Curve for item Employment
14


     Fig. 4. Item Response Category Characteristic Curve for item Alumni


     Fig. 5. Item Response Category Characteristic Curve for item Medical
                                                                                 15

5.4   Discussion

The GRM has assigned an ability parameter to the ten institutions based on
the reply stats. In the context of our dataset, the ability parameter represents
the transparency of an institute. Higher the ability, more percentage of reply are
given to RTI queries, hence more transparent is an institution. From Table 5 it
is seen that Institute number 6 with the scores (5,4,5,4,5) has the highest ability
(1.623), and Institute number 9 with the scores (1,2,2,1,3) has the lowest ability
(-1.590). Arranging the institutions with respect to transparency value shows
that all central institutions except institution 2 has high transparency compared
to the state institutions.
    Each βij is the θ-value of transition between adjacent response categories.
It is the boundary at which the probability of the response falling in the previ-
ous response category (left side) becomes less than 50% and the probability of
response falling on the subsequent response categories (categories on the right
side) is greater than 50%. These threshold values are different for different items,
indicating that each item is modelled differently and the response thresholds are
not uniform across items but are dependent on the data distribution of each
item.
    Each and every item has a discrimination parameter. An item with high dis-
crimination parameter can discriminate well between institutes with high ability
and low ability. From our results, Finance has the highest discrimination pa-
rameter and Employment has the lowest discrimination parameter. For an item
with low discrimination parameter, there is less distinction between the reply
patterns of high ability and low ability. Hence, observing reply stats of ’em-
ployment’ item is not enough to decide the transparency of an institute. This
gives a sort of quality assessment for each item with respect to judging the RTI
characteristics between institutions.
    GRM also models probabilities of how each institution responds to different
items, that is, query topics. It can be observed from the results that certain in-
stitutions (for example, institution 6 with θ=1.623) are very good in responding
to the finance category questions (Figure 1), but not so well in responding to
employment category questions (Figure 3). It means that a highly transparent
institution which replies efficiently to the ’finance’ related queries do not reply
as efficiently to the ’employment’ related queries. This reveals that there is in-
consistency in RTI reply across departments of the same institution, and leads
us to question as to why such inconsistencies are present.


6     Conclusion

In this paper we have modelled the RTI query-reply process via Item Response
Theory (IRT). We have created a synthetic dataset that resembles our collected
RTI data in its characteristics, and tried to model it in terms of inputs to an
IRT model. We have selected GRM as the preferred model, and successfully run
it with promising results. The novelty of our approach lies in two main points.
16

    Firstly, such an analysis of RTI data has never been undertaken. We are
collecting RTI data related to each individual, from each public educational in-
stitution and shall span multiple locations across India. Most RTI studies are
limited to specific regions or specific issues in that their surveys are based to ex-
plore a fixed set of problems. Our present work of applying learning algorithms
to uncover hidden traits in the RTI query-reply process is the first of its kind.
Moreover, the application of GRM has been limited to the examination domain.
This work is a successful attempt to extend its application scope. Secondly, the
implications from the outcomes of this experiment are enormous. With this at-
tempt, we have assigned a transparency value to the institutions with respect
to the reply patterns of each and every institution. Our experiment with the
synthetic data reveals that the central institutions are more transparent in re-
plying to citizen’s queries than the state institutions. A closer look into Tables
4 and 5 can help us extract further information. For example, certain institu-
tions (for example, institution 6) are very good in responding to the ’finance’
category questions (Figure 1), but not so well in responding to ’employment’
category questions (Figure 3). This reveals that there is inconsistency in RTI
replies across departments of the same institution, and leads us to question as
to why such inconsistencies are present. This is an indication of same laws being
implemented in different ways for different institutions as well as different de-
partments within the same institution. A solution for this may be to bring some
changes to the ordinances of the institution. Hence, this work of analysing RTI
queries and reply statistics shall also give us strong basis for proposing amend-
ments to the law of an institution. Once the data collection part is over, we shall
be able to apply this model to our actual RTI dataset, and the conclusions from
the results shall give us a clear picture of the laws and policies that govern our
public institutions.


References

1. The Constitution of India, http://lawmin.nic.in/coi/coiason29july08.pdf
2. What is the Procedure of Amendment of the Constitution of India?,
   http://www.preservearticles.com/201012251615/procedure-of-amendment-of-
   the-constitution-of-india.html
3. http://ccis.nic.in/WriteReadData/CircularPortal/D2/D02rti/10 9 2008-
   IR26042011.pdf
4. The        Right       to      Information        (Amendment)            Bill,     2013,
   http://www.prsindia.org/uploads/media/RTI%20%28A%29/RTI%20%28A%29%20Bill,%202013.pdf
5. Gerrish, S., Blei, D. M.: How they vote: Issue-adjusted models of legislative behavior.
   Advances in Neural Information Processing Systems, 2753–2761, 2012
6. Poole, K. T.,Rosenthal, H.: A spatial model for legislative roll call analysis. American
   Journal of Political Science, 357–384, 1985
7. Lucchese, C., Orlando, S., Perego, R., Silvestri, F., Tolomei, G.: Discovering tasks
   from search engine query logs. ACM Transactions on Information Systems (TOIS),
   vol. 31, no. 3, 2013
8. Beitzel, S.M.: On understanding and classifying web queries. Citeseer, 2006
                                                                                      17

9. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing.
   Communications of the ACM, vol. 18, no. 11, 613–620, 1975
10. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R.: In-
   dexing by latent semantic analysis. Journal of the American society for information
   science, vol. 41, no. 6, 1990
11. Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic
   and shallow semantic kernels for question answer classification. Annual meeting-
   association for computational linguistics, vol. 45, no. 1, 2007
12. McKeown, K. R.: Paraphrasing using given and new information in a question-
   answer system. Proceedings of the 17th annual meeting on Association for Compu-
   tational Linguistics, 67–72, 1979
13. Hovy, E., Hermjakob, U., Ravichandran, D.: A question/answer typology with
   surface text patterns. Proceedings of the second international conference on Human
   Language Technology Research, 247–251, 2002
14. Hovy, E., Hermjakob, U., Ravichandran, D.: Equating tests under the graded re-
   sponse model. Applied Psychological Measurement, vol. 16, no. 1, 87–96, 1992
15. Preston, K. S. J., Parral, S. N., Gottfried, A. W., Oliver, P. H., Gottfried, A.
   E., Ibrahim, S. M., Delany, D.: Applying the Nominal Response Model Within
   a Longitudinal Framework to Construct the Positive Family Relationships Scale.
   Educational and Psychological Measurement, 2015
16. Samejima, F.: Estimation of latent ability using a response pattern of graded scores.
   Psychometrika monograph supplement, 1969