-

CoLe and UTAI at BioASQ 2015: experiments with similarity based descriptor assignment

Francisco J. Ribadas

Luis M. de Campos

V ctor M. Darriba

darribag@uvigo.es 2

Alfonso E. Romero

aeromero@cs.rhul.ac.uk 0 0 Centre for Systems and Synthetic Biology, and Department of Computer Science, Royal Holloway, University of London Egham , TW20 0EX , United Kingdom 1 Departamento de Ciencias de la Computacion e Inteligencia Arti cial Universidad de Granada E.T.S.I. Informatica y de Telecomunicacion, Daniel Saucedo Aranda , s/n, 18071 Granada , Spain 2 Departamento de Informatica, Universidade de Vigo E.S. Enxen~er a Informatica, Edi cio Politecnico , Campus As Lagoas, s/n, 32004 Ourense , Spain

2 5

In this paper we describe our participation in the third edition of the BioASQ biomedical semantic indexing challenge. Unlike our participation in previous editions, we have chosen to follow an approach based solely on conventional information retrieval tools. We have evaluated various alternatives for creating textual representations of MEDLINE articles to be stored in an Apache Lucene textual index. Those indexed representations are queried using the contents of the article to be annotated and a ranked list of candidate descriptors is created from the retrieved similar articles. Several strategies to post-process those lists of candidate descriptors were evaluated. Performance in the o cial runs were far from the most competitive systems, but taking into account that our approach in the performed runs did not employ any external knowledge sources, we think that the proposed method could bene t from richer representations for MEDLINE contents.

This article describes the joint participation of a group from the University of Vigo and another group from the University of Granada in the biomedical semantic indexing task of the 2015 BioASQ challenge. Participants in this task are asked to classify new MEDLINE articles, labeling those documents with descriptors taken from MeSH hierarchy.

Both groups (CoLe 4 from University of Vigo and UTAI 5 from University of Granada) have participated in the previous BioASQ editions. Our previous par4 Compiler and Languages group, http://www.grupocole.org/ 5 Uncertainty Treatment in Arti cial Intelligence group, http://decsai.ugr.es/utai/ ticipations assessed the use of two di erent machine learning based techniques: a top-down arrangement of local classi ers and a Bayesian network induced by the thesaurus structure. Both approaches modelled the task of assigning descriptors from the MeSH hierarchy to MEDLINE documents as a hierarchical multilabel classi cation problem.

In this year participation we have changed the basic approach of our systems, following a similarity based strategy, where the nal list of MESH descriptors assigned to a given article is created from the set of most similar MEDLINE articles stored in a textual index created from the training dataset. This neighbor based strategy was partially explored in our previous participations in BioASQ challenge, where a sort of k nearest neighbor was employed as a guide in the topdown traversal of local classi ers approach and also in the selection of submodels (one per MeSH subhierarchy) in the Bayesian network based method. The employment of this k nearest neighbor ltering was mainly due to performance and scalability reasons, but it also had some positive e ects on overall annotation quality. For the third BioASQ challenge we have concentrated our e orts on testing the suitability of this similarity based approach and on evaluating several strategies to improve the nal ranked list of descriptors.

The rest of the paper is organized as follows. Section 2 brie y describes the main ideas behind the proposed similarity based approach for MEDLINE article annotation and also describes the text processing being applied. Section 3 gives details about the strategies for improving the nal list of ranked descriptors by means of several post-processing methods. Finally, section 4 discusses our o cial runs in the BioASQ challenge and details the most relevant conclusions of our participation. 2

Similarity based descriptor selection

Approaches based on k nearest neighbors (k-NN) have been widely used in the context of large scale multilabel categorization, even with MEDLINE documents [ 1 ]. The choosing of k-NN based methods is mainly due to its scalability, minimum parameter tuning requirements and, despite its simplicity, its ability to deliver acceptable results in cases where large amounts of examples are available. The approach we have followed in our BioASQ challenge participation is essentially a large k-NN classi er, backed by an Apache Lucene 6 index, with some optimizations due to MeSH usage recommendations on MEDLINE articles annotation. In the case of MEDLINE annotation with MeSH descriptors, despite of being a complex problem, with more than 25,000 possible classes, arranged in a directed acyclic graph (DAG), the availability of a huge training set labeled by human experts supposes an a priori favorable scenario for labeling estimates based on k-NN.

In our case we have tried to take advantage of certain aspects of semantic indexing process with the MeSH thesaurus to improve the labeling process based 6 https://lucene.apache.org/ U Animal V Human W Male X Female

D000818 Animals D006801 Humans D008297 Male

D005260 Female Y In Vitro (PT) D066298 In Vitro Techniques b Comp Study (PT) D003160 Comparative Study J Cats D002415 Cats K Cattle D002417 Cattle L Chick Embryo D002642 Chick Embryo M Dogs D004285 Dogs O Guinea Pigs D006168 Guinea Pigs P Hamsters D006224 Cricetinae Q Mice D051379 Mice S Rabbits D011817 Rabbits T Rats D051381 Rats c Ancient d Medieval f 15th Cent g 16th Cent h 17th Cent i 18th Cent j 19th Cent k 20th Cent o 21st Cent

D049690 History, Ancient D049691 History, Medieval D049668 History, 15th Century D049669 History, 16th Century D049670 History, 17th Century D049671 History, 18th Century D049672 History, 19th Century D049673 History, 20th Century D049674 History, 21st Century on similarity. Following MeSH annotation guidelines [ 5 ] we propose a di erentiated treatment for Check Tags. According to MeSH guidelines, Check Tags are widely used descriptors, shown in Figure 1, which describe some of the broader aspects of the MEDLINE articles. MeSH annotators can assign an arbitrary number of these Check Tags without any restriction regarding their location in the thesaurus hierarchy.

To try to exploit this singularity, our system separates the processing of Check Tags and the processing of regular MeSH descriptors. In this way, our annotation scheme starts by indexing the contents of the MEDLINE training articles. For each new article to annotate that index is queried using its contents as query terms. The list of similar articles returned by the indexing engine and their corresponding similarity measures are exploited to determine the following results: { predicted number of Check Tags to be assigned { predicted number of regular descriptors to be assigned { ranked list of predicted Check Tags { ranked list of predicted regular descriptors

The rst two aspects conform a regression problem, which aims to predict the number of Check Tags and descriptors to be included in the nal list, depending on the number of Check Tags and descriptors assigned to the most similar articles identi ed by the indexing engine and on their respective scores. The other two tasks are multilabel classi cation problems, which aim to predict a Check Tags list and a regular descriptors list based on the descriptors and Check Tags manually assigned to the most similar MEDLINE articles. In both cases, regression and multilabel classi cation based on k-NN, similarity scores calculated by the indexing engine are exploited. These scores are computed during the query processing phase. Query terms employed to retrieve the similar articles are extracted from the original article contents and linked using a global OR operator to conform the nal query sent to the indexing engine.

In our case, the scores provided by the indexing engine are similarity measures resulting from the engine internal computations and the weighting scheme being employed, which do not have an uniform and predictable upper bound. In order to get those similarity scores behave like a real distance metric we have applied the following normalization procedure: 1. Articles to be annotated are preprocessed in the same way than the training articles and are indexed by the Lucene engine 2. In classi cation time, all of the relevant index terms from the article being annotated are joined by an OR operator to create the search query 3. In the similar articles ranking returned by the indexing engine the top result will be the same article used to query the index, this result is discarded but its score value (scoremax) is recorded for future normalization 4. For each element on the remaining articles set the number of Check Tags and regular descriptors are recorded and it is also recorded the list of real Check Tags and the list of real descriptors, assigning to each of them an estimated distance to the article being annotated, equals to 1 scsocroermeax , which will be employed in the weighted voting scheme of the k-NN classi cation.

With this information the number of Check Tags and the number of regular descriptors to be assigned to the article being annotated is predicted using a weighted average scheme, where the weight of each similar article is the inverse of the square of the estimated distance to the article being annotated, that is, 1 (1 scsocroermeax )2 .

To create the ranked list of Check Tags and the ranked list of regular descriptors a distance weighted voting scheme is employed, associating the same weight values (the inverse of squared estimated distances) to the respective similar article. Since this is actually a multilabel categorization task, there are as many vote tasks as candidate Check Tags or candidate regular descriptors were extracted from the articles retrieved by the indexing engine. For each candidate, positive votes come from similar articles annotated with it and negative votes come from articles not including it. 2.1

Evaluation of article representations In our preliminary experiments we have tested several approaches to extract the set of index terms to represent MEDLINE articles in the indexing process. We have also evaluated the e ects in annotation performance of the di erent weighting schemes available in the Apache Lucene indexing engine.

Regarding article representation, we have employed three index term extraction approaches. In this experiment and also in the o cial BioASQ runs we have worked only with MEDLINE articles from year 2000 onwards, indexing a total amount of 6,697,747 articles. Index terms which occurred in 5 or less articles were discarded and terms which were present in more than 50 % of training documents were also removed.

We have delegated the linguistic processing tasks to the tools provided by the ClearNLP project 8. ClearNLP project o ers a set of state-of-theart components written in the Java programming language, together with a collection of pre-trained models, ready to be used in typical natural language processing tasks, like dependence parsing, semantic role labeling, PoS tagging and morphological analysis.

In our case we have employed the PoS tagger [ 4 ] from the ClearNLP project to tokenize and assign PoS tags to the MEDLINE articles contents. We employed the biomedical tagging models available on ClearNLP repository to feed this PoS tagger, since those pre-trained resources o ered fairly good results with no need of additional training.

In order to lter the content-words from the processed MEDLINE abstracts, we have applied a simple selection criteria based on the employment of the PoS that are considered to carry the sentence meaning. Only tokens tagged as a noun, verb, adjective or as unknown words are taken into account to constitute the nal article representation. In case of ambiguous PoS tag assignment, if the second most probable PoS tag is included in the list of acceptable tags, that token is also taken into account.

After PoS ltering, the ClearNLP lemmatizer is applied on the surviving tokens in order to extract the canonical form of those words. This way we have a method to normalize the considered word forms that is slightly more consistent than simple stemming. Like in the previous case, we have customized the lemmatization process using the biomedical dictionary model available at the ClearNLP project repositories.

Noun phrases based representation. In order to evaluate the contribution of more powerful Natural Language Processing tools, we have employed a surface parsing approach to identify syntactic motivated noun phrases from which meaningful multi-word index terms could be extracted.

We have employed a chunker from the Genia Tagger project 9 to process MEDLINE abstracts and to identify chunks of words tagged as noun phrases. Genia Tagger employs a maximum entropy cyclic dependency network [ 6 ] to model the PoS tagging process and its PoS tagger is speci cally trained and tuned for biomedical text such as MEDLINE abstracts. Once the input text has been tokenized and PoS tagged by Genia Tagger, a simple surface parser searches for speci c PoS patterns in order to detect the boundaries of the di erent chunks which can constitute a syntactical unit of interest (nominal phrases, prepositional phrases, verbal phrases and other).

In our processing of MEDLINE articles, from each noun phrase chunk identi ed in the Genia Tagger output we extract the set of word unigrams (lemmas) and all possible overlapping word bigrams and word trigrams, which will constitute the nal list of index terms that will represent the given MEDLINE article in the generated Lucene index.

The reason to limit this multi-word index term extraction process to only word bigrams and trigrams was to try to get a balance between repre8 Available at http://www.clearnlp.com/ 9 Available at http://www.nactem.ac.uk/tsujii/GENIA/tagger/.

sentation power and exibility and generalization capabilities. The chunks identi ed by Genia Tagger use to be fairly correct and consistent, even when detecting large noun phrases, but employing as index terms the chunker output without some kind of generalization could lead to poor results during the search phase of the k-NN based annotation. With no generalization this approach could degenerate in being able to nd similar articles only when an exact match occurs in large multi-word terms.

All these representation methods shared a common preprocessing phase, where local abbreviation and acronyms were identi ed and expanded employing a slightly adapted version of the local abbreviation identi cation method described in [ 3 ]. This method 10 scans the input texts searching for <short-form, long-form> pair candidates, using several heuristics to identify the correct long forms in the ambiguous cases.

Table 1 summarizes the results obtained in our preliminary tests. To get the performance measures of the di erent con gurations we have employed the BioASQ Project Oracle and as evaluation data we used the MEDLINE articles included in test set number 2 in the second batch of the 2014 edition of BioASQ challenge, which were removed from the training collection the three Lucene indexes were built from.

We have evaluated the three index term generation methods using di erent values for k, the number of similar articles to be used (1) in the estimation of the number of Check Tags and regular descriptors to be assigned and (2) in the set of vote procedures that will construct the nal list of Check Tags and descriptors to attach to a given article. We have also evaluated the e ect of two index term weighting methods available in version 4.10 of Apache Lucene: a classical tf-idf weighting scheme [ 9 ] and a more complex one inspired by the Okapi BM25 family of scoring formulae [ 8 ]. These weighting schemes are employed by the Lucene engine to compute the similarity scores used to create the ranking of documents relevant to a given query. In our case, the query terms are all of the index terms extracted from the article to be annotated using one of the methods described before.

As can be seen in table 1 and also in the results of our o cial BioASQ runs, the best results are obtained with stemming and lemmatization with very similar performance values in both cases. There was a marginal gain in at measures in favor of stemming based representation and with the hierarchical measures in the case of lemmatization. The representation using multi-word terms extracted from noun phrase chunks had poor performance, probably because of the use of overlapping word trigrams. capabilities of our k-NN method and also in the scoring functions of Lucene engine. Very infrequent index terms can have the undesired e ect of boosting internal scores in schemes where inverse document frequencies are taken into account. 10 Source code provided by original http://biotext.berkeley.edu/software.html authors is available at

Finally, regarding the e ect of taking into account di erent number of nearest neighbors, the best results are obtained when using values of k around 20, which was the default value in our o cial runs in BioASQ challenge. 3

Candidate descriptors post-processing

In order to improve the results obtained by the Lucene based k-NN approach depicted in previous sections, we have evaluated several alternatives to try to get better annotation performance. We have followed two di erent lines of work to improve the prediction accuracy out k-NN based system.

The rst weak point in the proposed k-NN based method is related with the fairly simple local decisions performed by our k-NN annotator, given that the performed generalization is just a weighted average and an inverse distance weighted vote. We have tested a couple of approaches employing more sophisticated decision making. In both cases a two-steps procedure is applied.

In a rst step an expanded list with a larger amount of candidate Check Tags and candidate regular descriptors is created. Those expanded sets of descriptors will be ltered and re ned during the second step. In order to add diversity to these expanded candidates lists, the size of both lists (expanded candidates Check Tags and expanded candidate regular descriptors) is twice the size previously predicted by the weighted average procedure described in section 2. Two methods were tested to perform the ltering step: Training a per-article multilabel classi er. In this approach, after creating the expanded list of candidate Check Tags and the expanded list of regular descriptors for the MEDLINE article being annotated, two multilabel classi ers, on per expanded list, are trained. The label set for these classi ers are the two lists of expanded candidates, and the training instances comprises up to 1000 most similar articles extracted by the indexing engine. Once the training of both classi ers is completed, the contents of the article being annotated are used as input to those models in order to extract the nal ranked list of Check Tags and the nal list of regular descriptors, using the cut o limits identi ed by the weighted average estimator.

In our preliminary evaluation we have employed as multilabel categorization strategy a custom implementation of Classi er Chains [ 11 ], using as base classi ers instances of Support Vector Machines trained using the LibSVM project [ 2 ] tools. This evaluation was done with a reduced test set and the obtained results were slightly better than the basic k-NN, but still far from the most competitive teams in BioASQ challenge.

Unfortunately, we were unable to use this method e ectively in our o cial runs of BioASQ challenge. Due to the time restrictions imposed in the challenge and the large training times required by this approach, we were unable to nish any submission on time.

Iterative k-NN vote. Instead of employing a multilabel classi er to support the second step we tested the use of another k-NN method backed by the same Lucene index to post-process the expanded lists of candidates.

For each candidate (both Check Tag or regular descriptor) in each expanded list a new query is sent to the index engine. Our index is queried using the representation of the article being annotated in order to get the list of similar articles which have among their respective extended candidate list the candidate descriptor being evaluated at this moment.

This new list of similar articles, with their normalized distances, is employed in a second voting process. In this case, similar articles where the candidate descriptor was actually assigned as a relevant descriptor are considered as positive votes. Whereas, similar articles where the candidate descriptor would have been a wrong assignment are treated as negative votes.

What this second step does with the extended candidate lists can be seen as a sort of "learning to discard" procedure. We are evaluating the actual usage of every candidate descriptor in a similar document which also had it as one of its own extended candidates. So, extended candidates that have not been considered as relevant descriptors in the weighted majority of similar documents retrieved during this second phase are discarded.

Although this approach imposes an extreme use of the Lucene index and implies large disk reading loads, we were able to make it suitable to ful ll the BioASQ challenge time restrictions.

Another weak point of our basic k-NN method when applied in the context of MeSH annotation is that it does not exploit the hierarchical information carried by the thesaurus structure, whose usage is explicitly described in o cial MeSH annotation guidelines. To try to overcome this limitation we evaluated the use of semantic similarity measures among MeSH descriptors as a method to expand and rearrange the ranked list of regular descriptor assigned by the basic k-NN method described in previous sections.

Descriptor expansion with hierarchical similarity measures. We have employed D. Lin's semantic similarity measure [ 7 ], a well known semantic measure suitable to capture and summarize in a number between 0 and 1 the proximity of two concepts belonging to a common concept taxonomy. sim(si; sj) = 2 logP (LCA(si; sj)) logP (si) + logP (sj) (1)

We have followed the original formula (1), where si and sj are concepts in a taxonomy, LCA(si; sj) represents the lowest common ancestor of both concepts and P (sk) is an estimation of the probability assigned to concept sk. In our case this probability is computed as the ratio between the number of MeSH descriptors belonging to the subtree rooted at descriptor sk and the total number of descriptor in the MeSH thesaurus.

In our preliminary tests we applied Lin's measure in a very simple fashion. The ranked list of candidate regular descriptors returned by the basic k-NN based method is expanded adding all MeSH descriptors in a radio of 3 hops, according to the thesaurus hierarchical relationships. The score of those new added descriptors is computed by multiplying the score of the original candidate descriptor with the value of Lin's similarity between it and the added descriptor. For a given descriptor (original or expanded), combined scores coming from the expansion process of di erent initial candidate descriptors are accumulated.

Once the expanded list of descriptors is created and ranked according to the new scores, two simple heuristics derived from MeSH annotation guidelines [ 5 ] are employed to remove redundant annotations. These removal heuristics are applied iteratively and limited to a window of the top-most n + 3 descriptors, where n is the number of regular descriptors predicted by our k-NN based scheme.

{ when tree or more siblings appear in the descriptor window, all of them are replaced by their common parent { more speci c descriptors (descendants) are preferred over more general ones (ancestors) occurring inside the considered window, and replace them

The surviving descriptors are cut o at the number of descriptor predicted by the weighted average predictor, using the combined scores to rank the list.

A priori this approach seemed to be a promising and e ective way to add hierarchical information from the MeSH thesaurus to the k-NN prediction. However, the results we obtained were very disappointing, even worse than the vanilla k-NN approach, and lead us to not submit the results obtained with this method in our o cial runs. 4

cial BioASQ runs and discussion Even we have tested several alternatives to try to improve the results obtained by the basic Lucene based k-NN method, only the most simple ones have been submitted to the o cial batches of BioASQ challenge. Our original objective was to try to approximate to the performance values obtained by the two NLM Medical Text Indexer (MTI) [ 10 ] baselines ("Default MTI" and "MTI First Line Indexer"), since this is the reference tool employed by MEDLINE indexers.

In table 2 the o cial performance measures obtained by our runs in the Test Batch number 3 are shown. The name of our runs ("iria") originally stood for Information Retrieval based Iterative Annotator since the initial aim of this participation at BioASQ challenge was to evaluate di erent approaches to improve the initial ranked list of candidate descriptors retrieved by the indexing engine. The o cial runs sent by our group during our participation in the Test Batch number 3 were created using the following con gurations. iria1. Representation of MEDLINE articles using unigrams, bigrams and trigrams extracted from noun phrase chunks identi ed by means of Genia Tagger.

As described at the end of section 2.1 only articles from year 2000 onwards were indexed, discarding terms appearing in 5 or less abstracts and term used in more than 50% of total documents.

The predicted number of Check Tags and regular descriptors to be returned is increased a 10% in order to ensure slightly better values in recall related measures. iria2. Representation of MEDLINE articles using terms extracted using standard English stop-words removal and stemming. All other parameter are identical to iria1. iria3. Representation of MEDLINE articles using lemmas extracted with ClearNLP tools after PoS tag ltering. All other parameter are identical to iria1 iria4. Using the Lucene index created for iria2 this set of runs employs the Iterative k-NN vote approach described in section 3, using a two step k-NN method. iria-mix. This was a "control" set of runs employed to measure how close were our methods to MTI baselines.

In test sets 1,2,3 and 4 iria-mix was simply a weighted mix of our results in iria-2 run with the MTI-DEF and MTI-FLI results distributed by BioASQ organization each week. Weight assigned to each one of these three lists was the respective o cial MiF values obtained in the previous week. Every descriptor in iria-2, MTI-DEF and MTI-FLI accumulates the weight of the descriptors list where it was included. The nal list of descriptors is ranked according to these accumulated scores and the n top-most descriptors are returned as candidates, being n the number of Check Tags and regular descriptors originally predicted by iria-2 run.

In test set 5, iria-mix used the Lucene index created for iria-2 to test a di erent k-NN search. In this case, a more complex type of query to nd similar documents was evaluated. This query was constituted by the index terms extracted from the abstract to be annotated, like in iria-2 case, but it also included the descriptors assigned in the MTI-DEF results distributed by BioASQ organization that week. That is, in this case the similarity query searches for articles sharing index terms with the abstract being annotated and also with real MeSH descriptors included in the MTI-DEF prediction.

The results of our participation in the third edition of the BioASQ biomedical semantic indexing challenge are far from the results of the most competitive teams and our particular objective, try to reach performance levels similar to MTI baselines, was not achieved. As positive aspects of our participation, we have shown that k-NN methods backed by conventional textual indexers like Lucene are a viable alternative for this kind of large scale problems, with minimal computational requirements and not so bad results. We also have performed an exhaustive evaluation of the performance of several alternatives to index term extraction, ranging from simple ones, based on stemming rules, to more complex ones were natural language processing is required.

Our a priori main contribution, the proposed methods to improve initial k-NN predictions, has not obtained real performance improvements, except in the case of training a per-article multilabel classi er. More work needs to be done in this case and also in the use of taxonomy based similarity measures, like Lin's measure, since we still think that is a promising alternative to include hierarchical information on at categorization approaches.

Acknowledgements

Research reported in this paper has been partially funded by "Ministerio de Econom a y Competitividad" and feder (under projects FFI2014-51978-C2-1 and TIN2013-42741-P) and by the Autonomous Government of Galicia (under projects R2014/029 and R2014/034). week 1, labeled documents: 2530/3902

at hier. system rank MiF EBP EBR EBF MaP MaR MaF MiP MiR Acc. rank LCA-F HiP HiR HiF LCA-P LCA-R best 1/35 0.6320 0.6910 0.6041 0.6247 0.6430 0.5025 0.5000 0.6909 0.5824 0.4693 1/35 0.5181 0.8091 0.7081 0.7316 0.5773 0.4978 def. MTI 13/35 0.5805 0.6002 0.5836 0.5732 0.5536 0.5292 0.4962 0.5957 0.5661 0.4164 13/35 0.4916 0.7546 0.7107 0.7098 0.5265 0.4891 iria-2 19/35 0.4869 0.4275 0.5756 0.4780 0.3961 0.4346 0.3853 0.4311 0.5593 0.3260 19/35 0.4306 0.6033 0.7301 0.6430 0.4031 0.4896 iria-3 20/35 0.4868 0.4256 0.5770 0.4773 0.3926 0.4302 0.3796 0.4295 0.5618 0.3253 20/35 0.4297 0.6002 0.7343 0.6428 0.4007 0.4919 iria-1 21/35 0.4727 0.5024 0.4695 0.4673 0.4113 0.3096 0.3014 0.5024 0.4463 0.3184 21/35 0.4149 0.6814 0.6042 0.6150 0.4612 0.4045 iria-4 23/35 0.4164 0.3730 0.5038 0.4117 0.2738 0.4065 0.3435 0.3617 0.4905 0.2699 22/35 0.3887 0.5460 0.7075 0.5942 0.3574 0.4611 iria-mix - - - - - - - - - - - - - - - - - week 2, labeled documents: 2256/4027

at hier. system rank MiF EBP EBR EBF MaP MaR MaF MiP MiR Acc. rank LCA-F HiP HiR HiF LCA-P LCA-R best 1/39 0.6397 0.6847 0.6222 0.6331 0.6284 0.5144 0.5060 0.6820 0.6023 0.4783 1/29 0.5250 0.7960 0.7172 0.7318 0.5745 0.5127 def. MTI 18/39 0.5822 0.6056 0.5842 0.5743 0.5452 0.5128 0.4792 0.6002 0.5653 0.4184 18/39 0.4914 0.7464 0.7039 0.6997 0.5288 0.4895 iria-mix 20/39 0.5730 0.5527 0.6057 0.5636 0.5125 0.5315 0.4854 0.5617 0.5847 0.4061 19/39 0.4862 0.6968 0.7392 0.6977 0.4919 0.5076 iria-2 25/39 0.4922 0.4442 0.5636 0.4833 0.4056 0.4070 0.3693 0.4490 0.5446 0.3310 25/39 0.4330 0.6136 0.7100 0.6381 0.4145 0.4812 iria-3 26/39 0.4871 0.4256 0.5788 0.4776 0.3855 0.4199 0.3723 0.4301 0.5614 0.3257 26/39 0.4296 0.5948 0.7282 0.6353 0.4000 0.4923 iria-4 27/39 0.4700 0.5675 0.4235 0.4635 0.4271 0.3147 0.3089 0.5588 0.4056 0.3167 27/39 0.3988 0.7053 0.5484 0.5853 0.4814 0.3681 iria-1 - - - - - - - - - - - - - - - - - week 3, labeled documents: 1519/3162

at hier. system rank MiF EBP EBR EBF MaP MaR MaF MiP MiR Acc. rank LCA-F HiP HiR HiF LCA-P LCA-R best 1/42 0.6496 0.6919 0.6313 0.6420 0.6429 0.5293 0.5228 0.6892 0.6144 0.4875 1/42 0.5363 0.8082 0.7266 0.7439 0.5850 0.5235 def. MTI 17/42 0.5970 0.6202 0.5994 0.5897 0.5644 0.5346 0.5049 0.6123 0.5824 0.4329 15/42 0.5039 0.7651 0.7249 0.7202 0.5407 0.5029 iria-mix 20/42 0.5826 0.5609 0.6151 0.5727 0.5264 0.5466 0.5049 0.5679 0.5981 0.4147 17/42 0.4966 0.7098 0.7529 0.7115 0.4995 0.5205 iria-2 24/42 0.5011 0.4524 0.5726 0.4927 0.4163 0.4122 0.3771 0.4557 0.5566 0.3394 24/42 0.4396 0.6229 0.7226 0.6501 0.4218 0.4861 iria-3 27/42 0.4894 0.4277 0.5814 0.4806 0.3965 0.4214 0.3779 0.4309 0.5662 0.3283 27/42 0.4331 0.5965 0.7355 0.6402 0.4029 0.4964 iria-4 28/42 0.4868 0.7394 0.3754 0.4771 0.6789 0.2560 0.2733 0.7408 0.3625 0.3285 30/42 0.3874 0.8561 0.4581 0.5674 0.5832 0.3095 iria-1 29/42 0.4811 0.4359 0.5455 0.4721 0.3978 0.3817 0.3515 0.4402 0.5304 0.3217 28/42 0.4242 0.6094 0.6978 0.6314 0.4095 0.4667 week 4, labeled documents: 1097/3621

at hier. system rank MiF EBP EBR EBF MaP MaR MaF MiP MiR Acc. rank LCA-F HiP HiR HiF LCA-P LCA-R best 1/40 0.6190 0.6758 0.5961 0.6139 0.6272 0.5108 0.5024 0.6716 0.5739 0.4577 1/40 0.5128 0.8045 0.6998 0.7259 0.5657 0.4963 def. MTI 17/40 0.5662 0.5959 0.5674 0.5612 0.5422 0.5129 0.4830 0.5875 0.5464 0.4049 16/40 0.4854 0.7586 0.6947 0.7024 0.5247 0.4807 iria-mix 19/40 0.5577 0.5487 0.5828 0.5509 0.5169 0.5190 0.4823 0.5543 0.5610 0.3940 18/40 0.4817 0.7149 0.7262 0.7019 0.4940 0.4956 iria-3 23/40 0.4837 0.4390 0.5468 0.4745 0.4065 0.4146 0.3772 0.4425 0.5334 0.3232 24/40 0.4304 0.6254 0.7044 0.6442 0.4154 0.4725 iria-2 24/40 0.4831 0.4397 0.5461 0.4746 0.4065 0.4122 0.3760 0.4433 0.5308 0.3232 23/40 0.4305 0.6303 0.7044 0.6472 0.4158 0.4715 iria-1 25/40 0.4647 0.4263 0.5201 0.4559 0.3942 0.3893 0.3582 0.4297 0.5059 0.3075 25/40 0.4170 0.6186 0.6797 0.6282 0.4073 0.4511 iria-4 26/40 0.4453 0.4757 0.4468 0.4401 0.3477 0.3476 0.3258 0.4625 0.4293 0.2952 26/40 0.3954 0.6440 0.6124 0.6006 0.4229 0.4022

Trieschnigg ,

Pezik ,

Lee , F De Jong, W Kraaij,

Rebholz-Schuhmann. MeSH Up : e ective MeSH text classi cation for improved document retrieval . Bioinformatics 25 ( 11 ), 1412 - 1418 , 2009 .

2. C.-C. Chang and C.-J. Lin . LIBSVM : a library for support vector machines . ACM Transactions on Intelligent Systems and Technology , 2 : 27 :1{ 27 : 27 , 2011

A.S.

Schwartz ,

M.A.

Hearst . Algorithm for Identifying Abbreviation De nitions in Biomedical Text. Paci c Symposium on Biocomputing 8 : 451 - 462 ( 2003 )

4. Jinho

Choi , Martha

Palmer . Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection , Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL'12) , 363 - 367 , Jeju, Korea, 2012 .

5. U.S. National Library of Medicine. MEDLINE Indexing Online Training Course . http://www.nlm.nih.gov/bsd/indexing/training (online, 5th june, 2015 )

Yoshimasa

Tsuruoka , Yuka Tateishi, Jin-Dong

Kim

, Tomoko Ohta, John. McNaught, Sophia Ananiadou , and Jun'ichi Tsujii. Developing a Robust Part-of-Speech Tagger for Biomedical Text , Advances in Informatics, 10th Panhellenic Conference on Informatics, LNCS 3746 , pp. 382 - 392 , 2005

Dekang

Lin . An Information-Theoretic De nition of Similarity. Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998 ), Madison, Wisconsin, USA, July 24 - 27 , 1998 .

8. Stephen

Robertson , Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike

Gatford . Okapi at TREC-3 . In Proceedings of the Third Text REtrieval Conference (TREC 1994 ). Gaithersburg, USA, November 1994 .

Sparck

Jones , K. A Statistical Interpretation of Term Speci city and Its Application in Retrieval . Journal of Documentation 28 : 11 { 21 . 1972

10.

J.G.

Mork ,

A. Jimeno

Yepes ,

A.R.

Aronson . The NLM Medical Text Indexer System for Indexing Biomedical Literature . 2013 . http://ii.nlm.nih.gov/Publications/Papers/MTI System Description Expanded 2013 Accessible.pdf (online , 5th june, 2015 )

11. Jesse

Read

, Bernhard Pfahringer, Geo Holmes and

Eibe

Frank . Classi er Chains for Multi-label Classi cation . Machine Learning Journal . Vol. 85 ( 3 ), pp. 333 { 359 . 2011 .