Evaluation of Information Access Technologies at NTCIR Workshop Noriko Kando National Institute of Informatics (NII), Tokyo kando@nii.ac.jp Abstract: This paper introduces the NTCIR Workshops, a series of evaluation workshops that are designed to enhance research in information access technologies, such as information retrieval, text summarization, question answering, text mining, etc., by providing infrastructure of large-scale evaluation. A brief history, test collections, and recent progress after the previous CLEF Workshop are described with highlighting the difference from CLEF in this paper. To conclude, some thoughts on future directions are suggested. 1 Introduction The NTCIR Workshops [1]1 are a series of evaluation workshops designed to enhance research in information access (IA) technologies including information retrieval (IR), cross-lingual information retrieval (CLIR), automatic text summarization, question answering, text mining, etc. The aims of the NTCIR project are: 1. to encourage research in information access technologies by providing large-scale test collections reusable for experiments, 2. to provide a forum for research groups interested in cross-system comparisons and exchanging research ideas in an informal atmosphere, and 3. to investigate methodologies and metrics for evaluation of information access technologies and methods for constructing large-scale reusable test collections. That is to say, the main goal of the NTCIR project is to provide infrastructure of large-scale evaluation. The importance of large-scale evaluation infrastructure in IA research has been widely recognized. Fundamental text processing procedures for IA such as stemming and indexing include language-dependent procedures. In particular, processing texts written in Japanese or other East Asian languages such as Chinese is quite different from processing English, French or other European languages, because there are no explicit boundaries (i.e., no spaces) between words in a sentence. The NTCIR project therefore started in late 1997 with emphasis on, but not limited to, Japanese or other East Asian languages, and its series of workshops has attracted international participation. 1.1 Information Access The term “information access” (IA) includes a whole process to make information in the documents usable for the user who has problems or information needs. A traditional IR system returns a ranked list of retrieved documents that are likely to contain information relevant to the user’s needs. This is one of the most fundamental and core processes of IA. It is however not the end of the story for the users. After obtaining a ranked list of retrieved documents, the user skims the documents, performs relevance judgments, locates the relevant information, reads, analyses, compares the contents with other documents, integrates, summarizes 1 NTCIR-3 and 4 are sponsored by the National Institute of Informatics (NII) and Japanese MEXT Grant-in-Aid for Scientific Research on Informatics (#13224087) in and after FY2001. Patent task is organized by collaboration with Japan Intellectual Property Right Association and NII, and CLIR Task is organized by collaboration with National Taiwan University, Korean Institute for Scientific and Technological Information (KISTI). and performs information-based work such as decision making, problem solving, writing, etc., based on the information obtained from the retrieved documents. We have looked at IA technologies to help users utilize the information in large-scale document collections. IR, summarization, question answering, etc are a “family”, in which the same target is aimed while each of the technologies has been investigated by different communities with least interaction2. 1.2 Focus of the NTCIR As shown in Figure 1, we have looked at both traditional laboratory-type IR system testing and the evaluation of challenging technologies. For the laboratory-type testing, we placed emphasis on IR and CLIR with Japanese or other Asian languages and testing on various document genres. For the challenging issues, the targets are the shift from document retrieval to technologies that utilize “information” in documents, and investigation of methodologies and metrics for more realistic and reliable evaluation. For the latter, we have paid attention to users’ information seeking task in the experiment design. These two directions have been supported by a forum of researchers and discussion among them. From the beginning, CLIR has been one of the central interests of the NTCIR, because CLIR between English and own-languages is critical for international information transfer in Asian countries, and it was challenging to perform CLIR between languages with completely different structures and origins such as English and Chinese or English and Japanese. Focus of NTCIR Lab-typed IR Test New Challenges Asian Languages/cross-language Intersection of IR + NLP Variety of Genre To make information in the documents more usable for Parallel/comparable Corpus users! Realistic eval/user task Forum of Researchers Idea Exchange Discussion/Investigation on Evaluation methods/metrics Figure 1. Focus of NTCIR In the following, the next section provides a brief history of NTCIR. Section 3 describes NTCIR Test Collections, Section 4 reports recent progress after our reports at the previous CLFEs [2-4], and Section 5 outlines the features of the coming NTCIR Workshop, NTCIR-4. Section 6 is summary. 2 NTCIR 2.1 History of NTCIR In the NTCIR, a workshop is held once per about one and half years. Since we respect the interaction between participants, we call a whole the process from document release to the final meeting as “workshop”. Each workshop selects several research areas called “Task”, or "Challenge" for more challenging task. Each task has been organized by the researchers of the domain and a task may consist of more than one subtasks. Figure 2 shows the evolution of the tasks in the NTCIR Workshops and Table 1 is a list of subtasks and test collections used in the tasks [5-7]. As shown in Table 1, the 4th NTCIR Workshop hosts 5 tasks, CLIR, Patent Retrieval Task (PATENT), Question Answering Challenge (QAC), Text Summarization Challenge (TSC), and WEB Task (WEB) and their sub-tasks. 2 In addition to the above, how to define the question of the user before the retrieval is also included in the scope of the IA although it has not been explicitly investigated in NTCIR. Tasks (Research Areas) of NTCIR Workshops 1st 2nd 3rd 4th Japanese IR Start! t Cross-lingual IR a s Patent Retrieval k s Web Retrieval Term Extraction/ Role Analysis Question Answering Text Summarization Nov 98 Apr 2003 About once per 1 ½ years Figure 2. Tasks of NTCIR Workshops Table 1. History of NTCIR Workshops Period Tasks Subtasks Test collections Ad Hoc IR J-JE Nov.1998- 1 CLIR J-E NTCIR-1 Sept.1999 Term Extraction Term Extraction/ Role Analysis Chiniese IR: C-C Chinese Text Retrieval CIRB010 CLIR: E-C June 2000- Monolingual IR: J-J, E-E 2 March Japanese&English IR NTCIR-1, -2 CLIR: J-E, E-J, J-JE, E-JE 2001 Intrinsic - Extraction/Free generated Text Summarization NTCIR-2Summ Extrinsic - IR task-based Single Language IR:C-C,K-K,J-J CLIR Bilingual CLIR:x-J,x-C, x-K NTCIR-3CLIR Mulilingual CLIR:x-CJE Cross Genre w/ or wo CLIR CCKE-J Patent NTCIR-3 PATENT [Optional] Alianment, RST Analysis of Claims Subtask-1: Five Possible Answers Oct. 2001- 3 Question Answering Subtask-2: One Set of All the Answers NTCIR-3QA Oct. 2002 Subtask-3: Series of Questions Single Document Summarization Text Summarization NTCIR-3 SUMM Multi-document Summarization Survey Retrieval Web Retrieval Target Retrieval NTCIR-3 WEB [Optional] Speech-Driven Single Language IR:C-C,K-K,J-J Bilingual CLIR:x-J,x-C, x-K CLIR NTCIR-4CLIR Pivoted Bilingual CLIR Mulilingual CLIR:x-CKJE "Invalidity Search"= Search Patents by a Patent Patent NTCIR-4 PATENT [Feasibility] Automatic Patent Map Creation Apr. 2003 - Subtask-1: Five Possible Answers 4 June 2004 Question Answering Subtask-2: One Set of All the Answers NTCIR-4 QA Subtask-3: Series of Questions Text Summarization Multi-document Summarization NTCIR-4 SUMM Informational Retrieval Navigational Retrieval Web Retrieval NTCIR-4 WEB [Pilot] Geographical Information [Pilot] (Search Results) Topical Classification n-m: n=query language, m=document language(s), J:Japanese, E:English, C:Chinese, K:Korean, x *: number of active participating groups that submitted task results 2.2 Participants As shown in Figures 3 and 4, the number of participants has been gradually increasing. Different tasks attracted different research groups although many are overlapped, or changed the participating tasks over workshops. Many international participants were enrolled to CLIR. Patent Retrieval task attracted many participants from company research laboratories and “veteran” NTCIR participants. WEB task has participants from various research communities like machine learning, DBMS, and so on. The number of collaborating teams across different organizations is increasing in recent NTCIRs. Participants/Tasks NTCIR workshop: 120 Number of Participating Groups 100 QA # of ParticipatingGroups 4th workshop Summarization *The number of the Registration 80 Term Extraction 3rd workshop Web Retrieval 60 Patent Retrieval 2st workshop 40 chinese NonJapanese IR Chinese .Korean CLIR # of group J!E,E!J、 20 J!E Japanese IR IR 1st workshop # of country E!C x!CJEK 0 0 20 40 60 80 100 120 1st(1998-9) 1st(1998-9) 2nd(2000-1) 2nd(2000-1) 3rd 3rd (2001-2) (2001-2) 4th(2003-4) 4th(2003-4) Fig 3 Number of Participating Groups Fig 4 Participating Groups per Task 3 Test Collections The test collections constructed for the NTCIR Workshops are listed in Table 2. In the NTCIR project the term “test collection” is used for any kind of data set usable for system testing and experiments although it often means IR test collections used in search experiments. One of our interests is to prepare realistic evaluation infrastructure, and those efforts include scaling up the document collection, document genres, languages, topic structure and relevance judgments. 3.1 Documents Documents were collected from various domains or genres. Format of the documents are basically the same as TREC or CLEF and are plain text with SGML-like tags. Each of the specialized document genre collections contained characteristic fields for the genre – Web collection contains html tags, hyperlinks, URL of the document, etc., and patent collection has tags indicating document structure of patent, and both patent and scientific document collections have parallel corpora of English and Japanese abstracts. The task (experiment) design and relevance judgment criteria were set according to the nature of the document collection and user community who use the type of documents in their everyday tasks. 3.2 Topics. A sample topic record is shown in Figure. 5. Topics are defined as statements of “user’s requests” rather than “queries”, which are the strings actually submitted to the system, because we wish to allow both manual and automatic query construction from the topics. Emphasis has been shifted towards the topic structure capable more realistic experiments as well as to see the effect of background information of the topic. The characteristics are summarized as followings; Topic Structure: Topic Structure has slightly changed in each NTCIR. A topic basically consists of a , a description <DESC>, and a detailed narrative <NARR> of the search request as similar to those used in CLEF and TREC. It may contain additional fields as shown in Table 3. Most of NTCIR collections contain a list of concepts <CONC>, but they are not heavily used by participants. Table 2. Test collections constructed through NTCIR NTCIR Test Collections; IR and QA Task data documents topic/ relevnce collection task question judge genre filename lang year # of doc size lang # ntc1-je JE 339,483 577MB 83 IR ntc1-j J 332,918 312MB 3 grades sci. ntc1-e E 187,080 218MB 60 NTCIR-1 Term 1988-1997 J abstract extraction/ ntc1-tmrc J 2,000 - - - role analysis CIRB010 IR news CIRB010 Ct 1998-1999 132220 132MB Ct E 50 4 grades sci. ntc2-j J 400,248 600MB NTCIR-2 IR 1986-1999** JE 49 4 grades abstract ntc2-e E 134,978 200MB news KEIB010 K 1994 66,146 74MB CtKJE 30 4 grades CIRB011 Ct 132,173 NTCIR-3 CIRB020 249,508 IR CLIR news Mainichi J 1998-1999 220,078 870MB CtKJE 50 4 grades EIRB010 10,204 E Mainichi Daily 12,723 patent full kkh *3 J 1998-1999 697,262 18GB C C KJ NTCIR-3 t s IR abstract jsh *3 J 1995-1999 1,706,154 1,883MB 31 3 grades PATENT E abstract paj *3 E 1995-1999 1,701,339 2,711MB NTCIR-3 exact QA news Mainichi J 1998-1999 220,078 282MB J* 1200 QA answer NTCIR-3 Web NW100G-01 multiple crawled in 11,038,720 100GB 4grades IR J* 47 WEB (html/tex NW10G-01 *4 2001 - 10GB + relative CIRB011 132,173 Ct CIRB020 249,508 Hankookilbo + 220,078 K Chosenilbo + 149,498 NTCIR-4 Mainichi J 105,517 CLIR IR news EIRB010 1998-1999 10,204 ca.2.7GB CtKJE 60 4 grades Mainichi Daily 12,723 Korea Times + E 21,377 Hong Kong Standard + 96,856 Xinhua (AQUAINT) + 208,168 Publication of ca. patent full J 1993-2002 ca.45GB NTCIR-4 unexamined patent 3,500,000 CtCsKJ IR PATENT Patnet Abstracts of ca. E abstract E 1993-2002 ca.10GB Japan (PAJ) + 3,500,000 NTCIR-4 Mainichi 220,078 ca.776M QA news J 1998-1999 J* QA Yomiuri + ca. 340,000 B NTCIR-4 Web multiple crawled in IR NW100G-01 11,038,720 100GB J* WEB (html/text) *4 2001 J:Japanese, E:English, C:Chinese (Ct:Traditional Chinese, Cs: Simplified Chinese), K:Korean; "+" indicates the document collection newly added for NTCIR-4 * English translation is available ** gakkai subfiles: 1997-1999, kaken subfiles: 1986-1997 *3: kkh : Publication of unexamined patent application, jsh: Japanese abstract, paj: English translation of jsh *4: almost Japanese or English (some in other languages) NTCIR Text Summarization documents summaries collection task genre filename lang year # of doc types analysts total# NTCIR-2 1994.1995 single doc news Mainichi J 180 doc 7 3 3780 SUMM .1998 NTCIR-2 1000 single doc news Mainichi J 1998 2 1 2000 TAO doc NTCIR-3 single doc Mainichi J 1998- 60 docs 7 3 1260 news SUMM multi doc Mainichi J 1999 50 sets 2 3 300 NTCIR-4 Mainichi 1998- multi doc news J SUMM Yomiuri 1999 Sample Topic written statement of user’s needs <TOPIC> <NUM>0010</NUM> purpose/background <TITLE CASE="b">Aurora, conditions, observation I want to know the conditions that give rise to an aurora for observation purposes I want to observe an aurora so I want to know the conditions necessary for its occurrence and the mechanism Relevance judgment behind it.Aurora observation records, etc. list criteria the place and time so only documents that provide additional information such as the weather and temperature at the time of occurrence are relevant. Aurora, occurrence, conditions, observation, given rel docs mechanism NW003201843, NW001129327, NW002699585 1st year Master’s student, female, 2.5 years search experience user attribute Figure 5. Sample topic (NTCIR-3 WEB) Table 3 Topic fields Topic Structure of NTCIR IR Test Collections NTCIR-3 NTCIR-3 NTCIR-3 NTCIR-4 NTCIR-4 NTCIR-4 NTCIR-1 NTCIR-2 CIRB010 CLIR PATENT WEB CLIR PATENT WEB ad hoc, ad hoc, ad hoc, Cross-genre, ad hoc, Task CLIR ad hoc CLIR invalidity CLIR CLIR CLIR CLIR other T-only, T-only, CLAIM- T-only, Mandatory Run * D-only D-only N/A D-only S+A D-only D-only only D-only Topic Field TITLE ** very short very short very short very short very short query query very short query DESC yes yes yes yes yes yes yes yes yes NARR (unstructured yes yes yes yes yes yes NARR (structured) yes yes yes NARR. BACK *10 yes yes yes NARR. RELE *10 yes yes yes NARR. TERM *10 yes yes yes PURPOSE *7 yes CONC yes yes yes yes yes yes yes FIELDS yes yes TLANG / LANG *3 yes yes SLANG *3 yes yes RDOC *4 yes yes PI *4 USER *5 yes yes ARTICLE *6 yes DOC *9 yes SUPPLEMENT *6 yes CLAIM *8 yes COMP *8 yes COMP. CNUM *8 yes *: D-only=DESC only, T-only=TITLE only, A+S= run using ARTICLE and SUPPLEMENT only **: "very short"=very short description of search request; "qeury"=comma separated term list *3: TLANG/LANG=target language, the language of the topic; SLANG=source language, the language the topic originally constructed. *4: RDOC=known relevant documents; PI=the patent for the invention mentioned in the news articles. *5: USER=users' attribute *6: ARTICLE=a news article reporting an invention; SUPPLEMENT=memorandam to focus the issues in the article relevant to the user's needs; if a human knowledgeable searcher reads ARTICLE and SUPPLEMENT, he/she understand the user's search request as specif *7: Purpose of search (only "invalidity search" for NTCIR-4 PATENT) *8: CLAIM=Target claim in the query patent. It was used as qeury of the search and may consists of multiple components; COMP=Component of a claim; CNUM=Claim component ID *9: Query patent fulltext (fulltext of a patent that is used as a query of the search) *10: BACK=Background knowledge/purpose of search; RELE=relevance judgment criteria; TERM=term definitions as Query: A title is originally defined as a very short description, or “nickname” of the topic, and, since NTCIR-3 3 WEB , changed to be a "query", a string put into a search engine by users and defined as a comma-separated term lists up to three terms. Structured <NARR>: Originally a narrative <NARR> was defined and instructed to the topic authors that it may contain background knowledge, purpose of the search, detailed explanation of the topic, criteria for relevance judgment, term definitions, etc. Since NTCIR-3 WEB, such information categories in <NARR> explicitly marked by tags like <BACK>, <RELE>, etc. as Figure 5. The purpose of this change is to examine the effect of additional information on the search effectiveness explicitly. Mandatory runs: Any combination of topic fields is allowed to use in experiments for research purpose. In the Workshop, the Mandatory Runs are defined in each task, and every participant must submit at least one mandatory run using the specified topic field only. The purpose of this is to enhance the cross-system comparison based on the common condition and see the effectiveness of the additional information over it. Mandatory runs are originally “<DESC> only”, then gradually shift to “<TITLE> only as well as <DESC> only”. 3.3 Relevance Judgments Relevance judgments are done by pooling, and the format and methods are basically the same as other evaluation projects including CLEF and TREC. The differences shall be summarized as follows; 1. Pooling strategies are slightly different according to each of the task ・ Additional interactive recall-oriented searches are done to improve the exhaustivity (NTCIR-1,-2) [8] ・ Additional interactive recall-oriented search are done by professional patent intermediaries (PATENT) [9] ・ “One-click distance model”, in which hyperlinked documents are allowed to see in WEB [10] ・ Cross-lingual pooling for parallel or quasi-parallel documents (NTCIR-1,-2)[8] ・ Graded-depth pooling: pool creating top10, 11-20, 21-30, 31-41, (PATENT) [9] 2. Multi-grade and relative relevance judgments ・ Highly Relevant, Relevant, Partially Relevant [5-7], Irrelevant; Best Relevant, 2nd Best, 3rd Best, etc. [10] 3. Judgments includes the extracted passages to show the reason why the assessors assessed the documents as “relevant” 4. Pooled document lists to be judged are sorted in descending order of likelihood to be relevant (not the order of the document IDs) 5. Relevance judgment files may be prepared to each of the target language document sub-collections in CLIR For 4, it helps assessors to judge consistently over a long list of pooled documents to be judged (typically 2000 - 3000 documents). Relevance judgments may change over assessors and over time. If relevant documents are appeared intensively in the first part of the list, it is easier for the non-professional assessors to set and confirm their criteria for relevance judgments, and then they can always refer those documents to re-confirm their own relevance judgment criteria when they go down to the lower ranked document. We understand they may be suffered by “order effect” of the ranked list of pooled documents in judgments, but we intentionally have used this strategy as practical and most effective one in our environment based on the comparative tests and interviews with assessors. For 5, in multilingual CLIR, a topic can not always obtain sufficient number of relevant documents on every language document sub-collection, and this is the natural situation in multi-lingual CLIR. As a result, some topics can not be usable experiments on specific language documents. We can not find the way to manage this issue and only strategy we could take in NTCIR-4 CLIR is to increase the number of topics, so that larger number of topics can be used common across the document sub-collections and then improve the stableness of the evaluation. Assessors are users of the document genre, judgments are done by the topic author except CLIR in NTCIR-3 and -4 since topics are created in cooperation of multiple countries, and then translated into each language and tested usability on each 3 Topic authors are instructed to sort the terms in <TITLE> in descending order of importance to express the search request resembling the way language document sub-collection. Judging other users’ topics is sometimes hard for users and take longer time. First two NTCIRs used two assessors per topic then tested inter-assessors consistency and found that the inconsistency among multiple assessors on a topic does not affect the stableness of the evaluation when tested on sufficient number of topics. Based on this, single assessor per topic is used in and after NTCIR-3. 3.4 Evaluation For the evaluation, trec-eval program [11] is used by setting two threshold of the levels of relevance judgments, i.e. “Rigid Relevance” for “Relevant” or higher, “Relaxed Relevant” for “Partial Relevant” or higher ranked relevance for IR experiments. As additional metrics, several metrics for multi-grade relevance judgments are proposed including weighted mean average precision (wMAP), weighted mean reciprocal rank (wMRR, for WEB task), and used decline cumulated gain (DCG) [12-13]. For Question Answering, MRR is used for subtask-1, return 5 possible answers and no penalty for wrong answers, and F-measure for subtask-2, return one set of all the answers and penalty will be given for wrong answers, and subtask-3, series of question. For Text Summarization, content based and readability based intrinsic evaluation was done in NTCIR-3 for both single document and multi-document summarization, and proposed new evaluation methodology based on revision (edit distance) on system summaries by professional analysts who created the model summaries. 4. Further Analysis of NTCIR-3 After our previous reports at CLEFs [2-4] and the overview papers in the Proceedings of the NTCIR-3 [7], several additional analyses were done on the NTCIR-3 results and collection. For PATENT retrieval task, though a new strategy for cross-genre retrieval called “term distillation” was proposed by Ricoh group and worked well on the collection, many research questions regarding patent retrieval were remained unsolved in NTCIR-3. The questions are, for example; 1. Is there any particular IR model (or weighting scheme) specifically effective on Patent? 2. Influence of the wide variation of document length (from 100 words to 30,000 word tokens in a document!) 3. Indexing (Character bi-gram vs. Word-based) 4. Target document collections: Fulltext vs. abstract (many commercialized systems used abstracts only) For 1., it has been reported that tf is not effective on Patent at the SIGIR 2000Workshop on Patent Retrieval, but we could not find the concrete answers to the question through the NTCIR-3. To answer these question, the NTCIR-3 Patent Task organizers conducted additional experiments on the patent collection and newspaper collection, and tested 8 different weighting schemes including both vector space as well as probabilistic models, on 6 different document collections, using 4 different indexing strategies, character bi-gram, word, compound terms, hybrid of character bi-gram and word; and 3 different topic length on a system. The results will be reported in [14]. For WEB, one participating group was consisted as a collaboration of research groups with strong background of content-based text retrieval and of web-link analysis, worked well at NTCIR-3 WEB. Further analysis on the effect of link on WEB collection, link-based approaches are generally worked well especially on the short queries like using TITLE only, or more specifically the first term of the TITLE, i.e. the most important terms for the users (topic authors) [15]. in which the users input the terms as queries. The relation between the terms is specified as an attribute of the <TITLE> in WEB Task. 5. Challenges at NTCIR-4 As shown in Table 1, the 4th NTCIR Workshop hosts 5 tasks, CLIR, PATENT, QAC, TSC, and WEB and their sub-tasks. Evaluation schedule varies according to each task. April 2003: Document Release June – September 2003: Dry Run October – December 2003: Formal Run 20 February 2004: Evaluation Results Release 2-5 June 2004: Workshop Meeting at NII, Tokyo Japan For the further information including late registration of the task participation, please consult NTCIR web sites at; http://research.nii.ac.jp/ntcir and http://research.nii.ac.jp/ntcir/ntc-ws4, or contact the author. 5.1 NTCIR-4CLIR Since this is the second multilingual CLIR at NTCIR, the same task design will be continued from the previous one. Minor revision was made only to solve the major problems raised in the assessment on the NTCIR-3 as follows; ・ Enlarge the English and Korea document collections comparable to Chinese and Japanese. 2.7GB in total. ・ New sub-task of Pivot Language Bilingual CLIR ・ Restrict the pair of topic and document languages, so that comparison will be done in fruitfully ・ Set T-only run as mandatory as well as D-only run ・ Question type – topics were categorized according to the nature and types of the answers in order to take a good balance of the topic set. The new sub-task, pivot CLIR uses English as a Pivot language, then test the effectiveness of the transitive CLIR. It is one of the practical approaches of Multilingual CLIR in the environment with less availability of the direct translation resources but rich in those between each of the languages and English. 5.2 NTCIR-4 Question Answering (QAC) and Text Summarization (TSC) QAC plans three subtasks as previous one at NTCIR-3. Among the three, subtask-1 and -2 will be done without major change. Only exceptions are; use different question sets for each of subtask-1 and -2, and increase the number of topics containing multiple answers. It was decided to avoid overestimate of the groups ignoring the possibility of multiple answers and returning the first priority answer only to the every question in subtask 2. QAC subtask-3, answering to the series of question, is one of the major focus of the NTCIR-4 QAC. We plan to increase the number of sequence as well as task design aiming to tackle the problems resembling the real-world “Report Writing” task based on a set of relevant documents. The task design also related to the TSC, content-based evaluation of multi-document summarization will be done by set of questions. This is, more fundamentally, what kind of aspects of an event or topic that users want to know. Some of the questions may be more appropriate for the current factoid –oriented QA and others may covered by summarization. IR covered both and those focus of QAC and TSC has many intersection of the focus of the CLIR to see the categorization of question types. 5.3 Specialized Genre Related Tasks at NTCIR-4: Patent and WEB Both PATENT and WEB plan (1) Main task(s) and (2) Feasibility or Pilot studies for more challenging tasks as follows; PATENT- Main: Invalidity Task: To search patents to invalidate the query patents. Claims of the query patents are used as query and they are segmented into components of the invention or technologies consisting of the investigation, then search related patents. A patent may be invalidated by one patent or by combination of multiple patents. Return document IDs as well as relevant passages. PATENT - Feasibility: Long term research plan over NTCIR4-5. Automatic Patent Map Creation A kind of Text mining -- Detect sets of technologies used in a set of patents, extract them, and make a table to show the relationship between technologies and patents, and evolution or trends among them. WEB - Main: Informational Search and Navigation Oriented Search, in which find most informative and reliable page WEB – Pilot: Geographical oriented and Topical Classification of the Search results For the details, please visit the website of each task, which are linked from the NTCIR’s main web site. 6. Summary Brief history of NTCIR and recent progress after NTCIR-3 are reported in this paper. One of the characteristic features of the NTCIR is targeting “Information Access” technologies, in which a whole process for users to obtain and utilize the information in the documents are interested in and see the intersection between all the related technologies including IR, Summarization, QA, Text mining, etc., and treat them as like a “family”. Other aspects are, for see the users’ information task behind the laboratory-typed testing. We are in the process of the fourth-iteration in a series. Evaluation must be changed according to the technologies evolution and change of the social needs. We have been and are struggling for this. Collaboration and any leads and advices are always more than welcome. References 1. NTCIR Project: http://research.nii.ac.jp/ntcir/. 2. Kando, N. "NTCIR Workshop: Japanese- and Chinese-English cross-lingual information retrieval and multi-grade relevance judgments ".In Proceedings of the first Cross-Language Evaluation Forum (CLEF2000), Lisbon, Portugal, Sept.17-22, 2000 Springer, 2001, pp.24-33 (Lecture Notes in Computer Science; 2069) 3. Kando, N. " CLIR system evaluation at the second NTCIR workshop", In Proceedings of the second Cross-Language Evaluation Forum (CLEF2001) , Darmstadt, German, Sept 3-4, 2001, Springer, 2002, pp.371-388 (Lecture Notes in Computer Science; 2406) 4. Kando, N. "CLIR at NTCIR Workshop 3; Cross-Language and Cross-Genre Retrieval” In Proceedings of 3rd Cross-Language Evaluation Forum (CLEF2002), Rome, Italy, Sept. 19-20, 2002 , Springer (Lecture Note in Computer Science) (to appear) 5. NTCIR Workshop 1: Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. Tokyo Japan, 30 Aug.–1 Sept., 1999. ISBN 4-924600-77-6. (http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings/). 6. NTCIR Workshop 2: Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and Text Summarization. Tokyo Japan, June 2000 – March 2001. ISBN 4-924600-96-2. 7. NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Question Answering and Summarization, Tokyo Japan, Oct. 2001 – Oct. 2002 . ISBN-4-86049-016-9 . 8. Kuriyama, K., Yoshioka, M., Kando, N. "Construction of a Large Scale Test Collection NTCIR-2: The Effect of Additional Interactive Search and Cross-Lingual Pooling", IPSJ Transactions on Databases, Vol.43, No. SIG2 (TOD13), pp.48-59, March 2002.(in Japanese) 9. Iwayama, M., Fujii, A., Kando, N., Takano, A. "Overview of Patent Retrieval Task at NTCIR-3". In NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Question Answering and Summarization, Tokyo Japan, Oct. 2001 – Oct. 2002 . (to appear). 10. Eguchi, K., Oyama, K., Ishida, E., Kando, N., Kuriyama, K. "Overview of Web Retrieval Task at the Third NTCIR Workshop", In NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Question Answering and Summarization, Tokyo Japan, Oct. 2001 – Oct. 2002 . (to appear). 11. Buckley, C. trec-eval IR evaluation package. Available from ftp://ftp.cs.cornell.edu/pub/smart. 12. Jarvelin, K., Kekalainen, J. “IR evaluation methods for retrieving highly relevant documents”. In: Proceedings of the 23rd Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. Athens Greece, July 2000, 41–48. 13. Voorhees,E.M. “Evaluation by Highly Relevant Documents”,in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2001),New Orleans, Sept. 2001, pp. 74-82. 14. Iwayama, M., Fujii, A., Kando, N., Marukawa, K. “Empirical study on retrieval models for different document genres: Patents and newspaper articles. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, Canada, July 2003, (to appear) 15. Eguchi, K., Oyama, K., Ishida, E., Kando, N., and Kuriyama K., “Evaluation Methods for Web Retrieval Tasks Considering Hyperlink Structure”, IEICE Transaction on Information and Systems. (Sep. 2003, in Japanese, to appear).