Should birth control pills be available over the counter?

=Paper=
{{Paper
|id=Vol-2936/paper-217
|storemode=property
|title=A Search Engine System for Touché Argument Retrieval task to answer Controversial
                        Questions
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-217.pdf
|volume=Vol-2936
|authors=Edoardo Raimondi,Marco Alessio,Nicola Levorato
|dblpUrl=https://dblp.org/rec/conf/clef/RaimondiAL21
}}
==A Search Engine System for Touché Argument Retrieval task to answer Controversial
                        Questions==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-217.pdf</pdf>
<pre>
A Search Engine System for Touché Argument
Retrieval task to answer Controversial Questions
Edoardo Raimondia , Marco Alessioa and Nicola Levoratoa
a
    University of Padua, Italy


                                         Abstract
                                         We present a search engine system capable of retrieving relevant arguments inside a controversial ques-
                                         tions forum. Our focus is on processing the corpus, indexing the collection and searching for each query
                                         while also addressing the document quality problem implementing a machine learning approach. Fur-
                                         thermore, we provide a brief evaluation of our retrieval results based on 2020 topics for the Touché
                                         Argument Retrieval task.

                                         Keywords
                                         Touchè 2021, Argument Retrieval, Search Engines


1. Introduction
This report has the aim to describe the project developed for the Search Engines course 2020/21
of the University of Study of Padua, by accompanying the reader through our system in a path
of increasing specificity.
The task of the homework is the Touché Argument Retrieval for Controversial Questions.
The goal of this task is to support users who search for arguments to be used in conversations
(e.g., getting an overview of pros and cons or just looking for arguments in line with a user’s
stance). Given a question on a controversial topic, the task is to retrieve relevant arguments
from a focused crawl of online debate portals.
   The paper is organized as follows: Section 2 introduces related works; Section 3 describes our
approach; Section 4 explains our experimental setup; Section 5 discusses our main findings and
reports a statistical analysis on the 2020 edition data; finally, Section 6 draws some conclusions
and outlooks for future work.


2. Related Work
The baseline that literature provides us for this specific Controversial Questions retrieval, is
composed by the usage of BM25, TF-IDF, DPH.
“Search Engines”, course at the master degree in “Computer Engineering”, Department of Information Engineering,
University of Padua, Italy. Academic Year 2020/21
" edoardo.raimondi@studenti.unipd.it (E. Raimondi); marco.alessio.1@studenti.unipd.it (M. Alessio);
nicola.levorato.2@studenti.unipd.it (N. Levorato)

                                       © 2021 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Successful works (from 2020 edition) instead, tried to go further. They considered and managed
also:
   1. query expansion (WordNet synonyms/ antonyms -> GPT-2 generation);
   2. document representation using transformers;
   3. re-ranking on argument quality prediction;
   4. re-ranking based in sentiment (neutral sentiment);
   5. pseudo-relevance feedback.
   Even if all of those points are essential in order to provide a noteworthy system, in this case
we will focus on query expansion and re-ranking on argument quality prediction.
It will then follow a statistical and performance analysis of the chosen approaches.


3. Methodology
The general structure of our search engine system can be nimbly divided into 6 parts, which
corresponds to a different package in our source code.
In particular each component has a different aim:
   1. parsing corpus files and extracting documents;
   2. analyze input documents to extract tokens;
   3. build the index;
   4. searching for each provided topic;
   5. provide a linear classification of the retrieved documents;
   6. provide different classes to execute the main functions of the project.
The details of each of them are separately reported in the following pages.


3.1. Parsing
The parsing part of our system is implemented in the package named "parse". This section
contains the ToucheParser class used to parse every JSON corpus file inside the corpus directory
of the project.
The parsing is done using the help of the Gson library that deserialize the args.me corpus in
a streaming fashion. This is mandatory since the files are too large to be fitted in the main
memory for the parsing.
When an object instance of the ToucheParser class is created, the constructor takes the path to
the corpus file and it is deserialized using a JsonReader that is the GSON streaming JSON parser.
Then the class implements some basic methods like the iterator, hasNext and next functions
used to access between the parsed documents.

   While parsing the files of the corpus, the ToucheParser extract each document from the JSON
structure converting them into a ToucheDocument object. This class contain the most relevant
fields (like the id, the text, the conclusion) that are common in each document of the corpus
and a method used to convert the object into an object instance of the ParsedDocument class
that is the one used to represent a document inside all other sections.
The ParsedDocument class represents a parsed document to be indexed in fact it contains two
main fields:
   1. the id: is the unique identifier contained in each document of the corpus;
   2. the body: is composed by appending each other relevant field of the document (like the
      conclusion and the text in the premise) into a single string if they are not null.
   The following figure represent an example of a document in JSON format inside the corpus
file "parliamentary.json". The three fields highlighted are the ones that are extracted during
the parsing.


                         Figure 1: Example of a document in JSON format


3.2. Analyze
The analyze part of our system is implemented in the package named "analyze". This section
contains the analyzer used to parse each document of the corpus and produce for each of them
the corresponding stream of tokens.
The first phase for our analyzer is to try to filter out all junk data from the documents text.
In the final version of the project, we found that the best compromise between quality of the
results and time spent was to just discard all links through a regular expression matching filter.

   Then tokenization takes place, which splits the stream of text in tokens in correspondence of
white spaces and the most used punctuation characters. Then we remove all possessive forms
(’s) from the end of each token and we transform them in lowercase.
The second-last phase uses a stop list in order to discard the most frequent english words from
each document, that would be of no use later during indexing and searching.
At the end we apply a discard filter that aims to remove all tokens that contains non letter
characters.

   The processing done by our analyzer is very basic. During the development phase, a more
advanced analyzer has been deployed; it was aimed to group together in a single token mul-
tiple words that together forms a unit with a specific meaning different from the one of the
single words (i.e. "get" and "up" are fused together in "get up"). Because of the lack of results
improvements during searching and the very tight project deadline, we have run out of time
to further developing it and so we were forced to drop it.


3.3. Indexing
The indexing part of our system is implemented in the package named "index".
It does exactly what a standard Lucene indexer do.
It is based on two classes:
   1. ToucheIndexer class: It builds the index inside the experiment/index folder.
      It uses the ToucheParser class described before to extract documents from the corpus
      files and remove any duplicate with an HashSet. Since there are some short documents
      in the collection that are not relevant, all documents with a number of characters in the
      body less than a fixed number (in our case 10 character) are discarded.
      The ramBuffer size used is 256.
      Among a few similarity functions tested, BM25 seemed to be the best one for our case;
   2. BodyField class: It represents a Lucene field for containing the body of a document.
      It’s also used to set or enable some option like the term vectors.


   It is important to mention that the analyzer used in this section, whose details are described
in the previous paragraph, is the same used for searching purpose.


3.4. Searching
The searching part of our system is implemented in the package named "search". The class
ToucheSearcher performs all tasks related to search. For each topic, we use our analyzer to
obtain the corresponding stream of tokens that represent it. With them, we can build the
query which Lucene uses to retrieve a list of documents relative to the topic.

  In order to get better results we perform query expansion: we add to the query, for each
token:

    • similar forms of the word (e.g. conjugation of verbs, plurals of names, ...), with a total
      relative weight of 50% c.a.
    • synonyms of the word, with a total relative weight of 50% c.a.

    • antonyms of the word, with a total relative weight of 20% c.a.

With total relative weight we mean the sum of the individual weight of the words added; note
that they are all the same in our current implementation.

  The dictionary used in this phase is an ad-hoc modified version of the freely-available Word-
Net database (available at: https://wordnet.princeton.edu/, it uses its own "WordNet 3.0" license
that allows to freely use, copy, modify and distribute it without fee or royalty) obtained by dis-
carding all synsets that are formed by more than a single word.

   In order to parse the topic xml files that contains all the query to be searched we developed a
specific class : XmlTopicsParser. It extract each query from a topic xml file using the DOM API
that create an in-memory representation of the xml. It returns a QualityQuery array containing
all the query parsed.

  Once the desired number of documents has been retrieved by our search function, we then
applied machine learning subroutine to all of them. (At the beginning it was considered to
apply machine learning only on the top n retrieved documents, but this decision should be
made in light of the assumption that the first n documents are all relevant with respect to
the current query. This assumption can’t be applied in our case since our system is not good
enough to ensure such a strong assumption.)

   Given this premise, since we are not circumscribed in a binary relevance judgment, it is
necessary to consider and evaluate several aspects. Among all these possible aspects, in this
specific subroutine, we are going to take care of the general writing quality. Basically we are
not confident in saying that the first n documents are relevant and it becomes challenging at
this point to understand how much relevant, in term of quality, a document is. In order to try
to reach an acceptable bound in these terms, a linear regression model has been trained.
The training part has been object of interesting considerations that is worth describing sepa-
rately. (see Section 3.5).
Suppose we now have our run with all the k documents. We let our pre-trained model classify
each of them and save the corresponding predicted quality in a dictionary data structure.

   The next step is to provide a final score that mix the BM25 searching score and the machine
learning score. There are several way to do it but our choice (made upon measure evaluation,
see section 5) has fallen in the weighted average between the two scores.
Once done it, the run is "re-write" as an output, according to the new scores (i.e. weighted
average).
3.5. Linear Regression
It is implemented inside the package called "linear regression".
For this purpose we leaned on the machine learning library called "weka".
Touchè organizers provides us a dataset of arguments with relative labels including one named
"Combined Quality". This feature describes the combination of rhetorical, logical and dialecti-
cal quality of a specific arguments and we are going to use it as a label for the training part.
The general pipeline is : train a linear regression model, classify all the retrieved documents
and reorder them by mixing the predicted quality score by our model and the BM25 score (as
described in section 3.4).
The training of the model has been performed as follow:
   1. Document vectorization (both for training set and fresh data)
   2. Fitting
   3. Evaluation
Since the last two steps are quite standard, we will stress our explanation on the first operation.
The application of the model on our system is done, and already exposed, in the searching part.


3.5.1. Document vectorization
To represent complex sentences as a vector of numbers, literature provide us a large amount
of techniques. Both for lack of time and experience on this field, the chosen workflow is one
of the simplest but it works properly.


   First at all documents passed trough an ad-hoc customized Lucene analyzer, which applied
stop word removal (with the same stop list used for the corpora to ensure more compatibility),
stemming and filtering based on word lengths. We encapsulate the k most frequent words of a
given domain (e.g. entire training set or set of retrieved document i.e. fresh samples to classify)
in a general vector 𝑚𝑜𝑠𝑡𝐹 𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑊𝑜𝑟𝑑𝑠.
Then we need to encode each document (in this specific case a multi hot encoding is per-
formed).
To do it, we constructed a vector 𝑣 of size k+1. We set 1 or 0 the cell 𝑖 if the document contains
the 𝑖-th most frequent word. (e.g. 𝑣[0] = 1 if the document contains the most frequent word, 0
otherwise, and so on). After this step only k cell of mt vector 𝑣 will be filled.
Finally in the last position we set the "Combined quality" feature in order to use it as a label. (
Note: even in the classification part this k+1-th element need to be present, even if it will not
obviously be consider. The reason why weka required that is still a mistery... )
All the vectors (i.e. the documents) are stored in csv files. This is done to facilities the commu-
nication between our code and weka functions, since they load data via files of this type.
3.5.2. Fitting and Evalutation
In order to fit and evaluate our model, we took our pre-vectorized training set and we split it
in training and test set. Once fitted our model on the training set, we evaluate it using least
mean squared error as loss function.
The evaluation helps us to tune the hyperparameters of document vectorization in the best
possible way. The most important parameters were: which stop list to use, how many most
frequent word to consider and the usage of the absolute frequency instead of other possible
metrics.
   All this work is done by different classes, as good modular programming required, which
names are self-descriptive: CSVCleaner, RegressionAnalyzer, CSVWriter, Encoder, ToucheRe-
gression.


3.6. Execution
The package "exec" has been made thinking outside the pure information retrieval purpose,
but it is more a child of a software engineering point of view.
This section contains all the executable classes of the project and it is used mainly for debugging
purpose. For example the Indexing class is used to build the index in the experiment directory,
the Parsing class is used to get a look of the documents extracted from a chosen corpus file and
the Searching class is used to build the runs.

  All the corpus files should be placed inside a corpus directory, the directory experiment
contains the index and the run files and all the topic xml files should be inside a topics directory.


4. Experimental Setup
Our experimental set up is made upon:

    • as documents collection we used the args.me corpus (version 2020-04-01) available at
      https://zenodo.org/record/3734893#.YIxOLbUzaUk

    • as evaluation data we used the 2020 topics available at https://webis.de/events/touche-21/
      shared-task-1.html#data

    • as evaluation measures we massively used NCGD (because it is the same used in the 2020
      touche results) but without forgetting to think about precision, recall, average precision
      and all the measures provided by trec eval program.

    • the url of our git repository is https://bitbucket.org/upd-dei-stud-prj/seupd2021-mr/src/
      master/
5. Results and Discussion
5.1. Results of our system
The score provided by Lucene and the BM25 similarity judges only the relevance of a document,
without considering the overall quality of it, such as the writing style and the length. On the
other hand, the score provided only by Machine Learning takes care only of the general writing
quality. Therefore we have decided that combining the two scores could get better results.

   In order to do so, first we have to make both scores comparable, by normalizing them in the
[0.0, 1.0] range using this formula:
                                                  𝑠𝑐𝑜𝑟𝑒 − 𝑚𝑖𝑛𝑠𝑐𝑜𝑟𝑒
                             𝑠𝑐𝑜𝑟𝑒𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 =                                                 (1)
                                                 𝑚𝑎𝑥𝑠𝑐𝑜𝑟𝑒 − 𝑚𝑖𝑛𝑠𝑐𝑜𝑟𝑒
   Then we proceed by applying a linear combination to them. We used a factor 𝑎𝑙𝑝ℎ𝑎 in the
range [0.0, 1.0] to weight the score given by BM25 similarity and 1 − 𝑎𝑙𝑝ℎ𝑎 to weight the score
given by the machine learning approach. The formula we used to combine the two score is the
following:

              𝑠𝑐𝑜𝑟𝑒𝑚𝑖𝑥𝑒𝑑 = 𝑎𝑙𝑝ℎ𝑎 ∗ 𝑛𝑜𝑟𝑚𝑆𝑐𝑜𝑟𝑒𝐵𝑀25 + (1 − 𝑎𝑙𝑝ℎ𝑎) ∗ 𝑛𝑜𝑟𝑚𝑆𝑐𝑜𝑟𝑒𝑀𝐿                   (2)

  If the value of 𝑎𝑙𝑝ℎ𝑎 is within the range [0.5, 0.8], the quality of the run increases w.r.t the
two pure approaches (pure BM25 at 𝑎𝑙𝑝ℎ𝑎 = 1.0, pure ML at 𝑎𝑙𝑝ℎ𝑎 = 0.0), reaching the best
results at about 0.6 (we choose to use NDCG@5 as metric of choice because is the one used in
Touché 2020 results), as shown in Figure 2. This means that the BM25 score should be higher
weighted w.r.t ML score, that is the relevance judgement is more important than the writing
quality of the documents.


                  Figure 2: Scaling of NDCG@5 measure w.r.t. changes of 𝑎𝑙𝑝ℎ𝑎.

   From our tests we saw that all major measure values of the run increases using this mixed
approach: by keeping the top 30 of the 1000 documents initially retrieved we found the table
in Figure 3.
Figure 3: All major performance measure for all three approaches: pure BM25 score, pure ML score,
         and mixed (𝑎𝑙𝑝ℎ𝑎 = 0.6).

   For all our test we have used vectors of size 250 (i.e. max number of most frequent words) to
train the linear regression model. Our test shows that changes in the number of most frequent
words considered has almost no impact on the NDCG@5 measure, as shown in figure 4. In the
final version of our code, we decided to stick with the top 250, because the impact on runtime is
still negligible and it should provide a more robust solution compared to using a lower number
of words.


        Figure 4: Scaling of NDCG@5 measure w.r.t. the number of words used in ML phase.


5.2. Results of our system compared to Touché 2020 best system
For this comparison, we chose from Touché 2020 runs (available at https://webis.de/events/
touche-21/runs-task-1-2020.zip, dataset v2) the one with the highest average NDCG@5 mea-
sure. The chosen run is "DirichletLM" by team "Swordman", against a run obtained from our
system using 𝑎𝑙𝑝ℎ𝑎 = 0.6.

   Since both runs retrieve 1000 documents for each topic, the first metric we decided to analyze
is the recall for each topic. As we can see in Figure 5, our system performs significantly better
as shown by the average recall: 0.8637 and 0.7693.
                          Figure 5: Recall of all 49 topics of both runs.

   Although the higher recall, the Figure 6 shows that our systems performs significantly worse
w.r.t "Swordman" run when we consider the NDCG@5 metric (average NDCG@5: 0.6184 and
0.8266). From this data, we can conclude that our system retrieves more significant documents
but it ranks them poorly: in order to achieve higher results, efforts should be focused in this
aspect.


                        Figure 6: NDCG@5 of all 49 topics of both runs.
5.3. Statistical analysis
In order to better understand the behavior and the meaning of our results a proper statistical
analysis is needed. Our main goal is to ensure, with much confidence as possible, the improve-
ment of our approach with respect to a basic BM25 approach.
We computed 5 different runs: 1 using only BM25 and 4 adding also machine learning and
changing some parameters according to what already said in section 5. Each run is repre-
sented by a 49 length array 𝑣 (since we have 49 topics) and each elements of this array is a
measure score for that topic. (i.e. 𝑣[1] = AP of topic 1). Specifically we will consider in our
analysis the average precision and NDCG@5.
Now let’s have a look to our data in a more formal way. Then we will continue our investiga-
tion via hypothesis testing.


5.4. Data visualization
First at all we start with average precision.
A first overview is essential to give us a general informal taste of our data.


                    Figure 7: Statistical summary of data on average precision.

   Before doing any consideration, let’s clarify the notation. The 3 central columns represent
runs with different weights scoring as already explained in section 5 (see Figure 2). Jolly in-
stead is a run with slightly different searching parameters. This is done to check the variance
of our machine learning model. (i.e. we expect it to be similar to the other machine learning
runs). Finally Baseline BM25 is self descriptive.

  Now consider a box plot representation.
                      Figure 8: BoxPlot of the 5 runs on average precision.

  This first plot give us a preliminary intuition about the scenario. It is indeed visible that
there is a better behaviour of machine learning runs.
  Another useful representation can be done using stripcharts.
                     Figure 9: StripChart of the 5 runs on average precision

   Here the visualization of the outliners is more evident and let us performs some considera-
tions. Even if Baseline BM25 reaches higher AP scores, it reaches it in isolation, while, Mixed
and jolly run provide a more dense higher scores. Also the points with more density for BM25
stand in a very low score area. This suggest us that the two main systems (baseline and ma-
chine learning) are different, in particular, the last one provides better results.

  Exactly same considerations can be conducted looking at NDCG@5 as scoring measure,
instead of average precision.


                      Figure 10: Statistical summary of data on NDCG@5.
Figure 11: BoxPlot of the 5 runs on NDCG@5.
                         Figure 12: StripChart of the 5 runs on NDCG@5

  This informally gives us more confidence about our principal intuition (i.e. there is an im-
provement adding our machine learning subroutine to baseline BM25).
We need to better formalize it via hypothesis testing.


5.5. Hypothesis testing
Now let’s validate our assumptions using more sophisticated statistical analysis technique such
as hypotesis testing and t-test.
As we stated before we are now intent on verifying the actual differences between our systems
(i.e. runs).
In order to do that it is necessary to correctly formalize the null and the alternative hypothesis
and define a suitable significant level 𝛼.

  Hypothesis are:

   1. H0 : the two runs are not different (i.e. are sampled from the same distribution);
   2. H1 : the two runs are different (i.e. are sampled from different distributions).
  Confidence level 𝛼 = 0.05.

   A t-test has been performed between every possible combination of our 5 runs. We expect
baseline BM25 to be significantly different with respect to machine learning ones (i.e. reject
null hypothesis). While it is reasonable to see that the ones with machine learning are equal
to each other (i.e. fail to reject null hypothesis).

  Let’s now take a look to the p-value results of our t-test calculated both on average precision
and NDCG@5. They will be represented by a matrix 𝑡.


                        Figure 13: p-value t-test matrix on average precision


                           Figure 14: p-value t-test matrix on NDCG@5

   The runs are respectively baseline BM25, Mixed 0.57, Mixed 0.6, Mixed 0.62, Jolly.
Of course the matrix 𝑡 is populated only on the top diagonal area since t-test(X, Y) = t-test(Y,
X).
Particularly attention has to be made upon the first row. It indeed represents the test between
the baseline and all the other systems. All the p-values on 𝑡[1, ∶] are significantly below our
significant level (i.e. 𝑡[1, ∶] < 𝛼). This validate our preliminary assumptions and it tells us that
adding a machine learning subroutine provides different (and better, looking at the data) re-
sults. It also suggest us that we could decrease a lot our confidence level without affecting the
final statement. So we reject null hypothesis.
Another important confirmation arrives from the analysis of rows 2,3 and 4. They contain the
p-values across system using machine learning but with different hyperparameters. In partic-
ular p-values are above to 𝛼. We can conclude that we fail to reject the null hypothesis in this
case. In terms of our system this means, no matter which hyperparameters one choose, the
subroutine will provide results sampled from an unique distribution.
Finally, the fact that all these results are valid for both measures (i.e. average precision and
NDCG@5) gives us even more certainty about our beliefs.


5.6. Failure analysis
A little bit of failure analysis has been performed. Both for lack of time and workforce, we
considered only one topic, that is the worst case topic (i.e. it reaches 0.000 on NDCG@5). Let’s
have a look to it:

<topic>
<number>12</number>
<title>Should birth control pills be available over the counter?</title>
<description>
The easy access to birth control pills may have repercussions on people’s everyday behavior
in taking precautions, while disregarding negative side effects. A user wonders whether birth
control pills should be prescription drugs, so that doctors have a chance of explaining them to
their users. </description>
<narrative>
Highly relevant arguments argue for or against the availability of birth control pills without
prescription. Relevant arguments argue with regard to birth control pills and their side effects
only. Arguments only arguing for or against birth control are irrelevant.
</narrative>
</topic>


   Faced with this topic, our system mainly returned general discussion about arguing for or
against birth control, that is exactly what we do not wanted to retrieve (as specified in the
narrative). This could be due to the fact that in order to totally understand the topic it is
absolutely necessary to give maximal importance to the description field. Our system instead,
unfairly favors the words in the title, giving them much greater weights with respect to words
in the description fields.
In order to solve this situation a dynamic weights assignment could be developed.
However, looking at the runs of the previous year, it turned out this topic to be really tough for
everybody. This surely does not resolve our problem but how we say in Italy: "mal comune,
mezzo gaudio".
6. Conclusions and Future Work
To conclude, our works has enlighten different ways to pursue and manage in order to deal
with human made documents in a forum scenario. Indeed this scenario is very challenging,
due to the document corpus presenting a lot of both intra and inter documents noise. (e.g.
entire completely useless documents or document bodies full of bad written argument).

  Given all the technical results already provided in the previous sections, we can claim that
most of the easily achievable improvements lies on developing a better documents ranking
system. The things that have to be considered when trying to improve it are:

    • for ML phase, instead of just using a binary model, different parameters could be used to
      fill our 𝑚𝑜𝑠𝑡𝐹 𝑟𝑒𝑞𝑢𝑒𝑛𝑡𝑊𝑜𝑟𝑑𝑠 vector, such as Inverse Document Frequency (IDF), relative
      frequency or other state of the art metrics;

    • try to use other machine learning models instead of linear regression, such as Support
      Vector Machine (SVM) or Neural Networks (NN).


   The study of documents collection is of fundamental importance to construct a good retrieval
system: it’s difficult to provide universal assumptions regarding them, and they must be treated
case by case to get the best results. Although our system recalls a very high percentage of the
relevant documents (about 86%, as shown in Section 5), there is still room for improvements
regarding this aspect:

    • develop a more sophisticated analyzer, in particular by using shingles and more advanced
      techniques of natural language processing;

    • try different similarity functions;

    • dynamic query weights assignment.

</pre>