=Paper=
{{Paper
|id=Vol-2079/paper1
|storemode=property
|title=A Plan for Ancillary Copyright: Original Snippets
|pdfUrl=https://ceur-ws.org/Vol-2079/paper1.pdf
|volume=Vol-2079
|authors= Martin Potthast,Wei-Fan Chen,Matthias Hagen,Benno Stein
|dblpUrl=https://dblp.org/rec/conf/ecir/PotthastCHS18
}}
==A Plan for Ancillary Copyright: Original Snippets==
A Plan for Ancillary Copyright: Original Snippets Martin Potthast1 Wei-Fan Chen2 Matthias Hagen3 Benno Stein2 1 2 3 Leipzig University Bauhaus-Universität Weimar Halle University martin.potthast@uni-leipzig.de. @uni-weimar.de matthias.hagen@informatik.uni-halle.de fair use laws. These exemptions are currently being reconsidered. Abstract In recent years, news publishers have raised claims for compensation from search engine companies for The snippets that web search engines generate snippets generated from their articles. Their argument for their result presentation are extracted from is as follows: search engines and news aggregators earn the retrieved web pages, reusing pieces of text money based on the publishers’ intellectual property, that match a user’s query. Copyright owners of and, since snippets are informative, they may prevent the retrieved web pages are typically not asked users from visiting the related news article, depriving for usage rights. This long-time practice now them of ad revenue. While no one forces the publishers faces increasing backlash from news publishers, to have their articles indexed, they also claim to be left legal action, and even new legislation in Ger- with no alternative to the de facto monopolist on most many and Spain: the so-called ancillary copy- search markets, Google. The fact that search engines right for news publishers. This copyright law nowadays aim at answering certain queries directly on restricts the fair use of intellectual property of search results pages, often based on content lifted from news publishers, allowing them to raise claims third party web pages, does not serve to deescalate for monetary compensation when their text is the dispute: every query answered directly by a search reused, even within snippets. If passed at the engine takes away traffic from the web pages it indexes, EU level, ancillary copyright could severely im- undermining the ad revenue model which funded the pact future information system development. creation of apparently useful pieces of information in This paper promotes a “technological remedy”, the first place. Following this line of argumentation, namely, to synthesize true original snippets publishers successfully lobbied for political support: the without text reuse. so-called ancillary copyright for news publishers has been passed into law in Germany and Spain. Despite 1 Introduction the German version still exempting individual words or An organic search result for a keyword query on a web “smallest text snippets,” 1 Google instantly demanded search engine is typically displayed as title and URL free-of-charge usage rights from all major German pub- along with a brief excerpt of the respective page, show- lishers, delisting those who did not agree, whereas the ing selected pieces of text that contain keywords from Spanish law2 caused the shutdown of Google News in the query, the snippet. Snippets guide users in deciding Spain.3 While the European Union—amidst a fierce which of the pages on a search results page to visit, if public debate among stakeholders both in favor as well any. Since snippets are extracted from the found web as opposed—deliberates an ancillary copyright for all pages, they form a kind of text reuse. Reusing a third of its members and all kinds of information systems party’s text is governed by copyright laws and typically (not only search engines), Google News has recently requires written consent. The operators of web search been redesigned worldwide: the new version does not engines have been exempt from this regulation under show snippets anymore.4 Figures 1 and 2 contrast the new with the old layout. Copyright c 2018 for the individual papers by the papers’ au- 1 https://www.gesetze-im-internet.de/urhg/__87f.html (German) thors. Copying permitted for private and academic purposes. 2 https://www.boe.es/boe/dias/2014/11/05/pdfs/BOE-A-2014- This volume is published and copyrighted by its editors. 11404.pdf (Spanish) In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, 3 https://europe.googleblog.com/2014/12/an-update-on-google- A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop news-in-spain.html 4 https://www.blog.google/topics/journalism-news/redesigning- at ECIR, Grenoble, France, 26-March-2018, published at http://ceur-ws.org google-news-everyone Figure 1: New Google News layout without snippets. Based on our comprehensive literature survey (Sec- tion 2), we are unaware of any evidence that the usabil- ity of a search engine is improved by dropping snippets. Figure 2: Google News as it used to be, obtained via However, despite recent experiments showing that users the “News” facet of the main search engine. may prefer longer snippets over shorter ones [MAM17], not a single experiment has quantified the impact of White et al. [WRJ02a, WRJ02b] found that snippets dropping snippets. Therefore, Google must be given should be re-generated based on implicit relevance feed- the benefit of the doubt, since extensive A/B tests may back, selecting different sentences when a user returns have revealed that snippets are unimportant for Google to a search results page. To speed up snippet genera- News. Meanwhile, Google recently “reintroduced” fea- tion, Turpin et al. [TTHW07] evaluate software archi- tured snippets to the main search engine, where the tectures based on compressed data structures and RAM search result that best answers a question query is high- caching. Bando et al. [BST10] ask humans to manually lighted by showing it in a box and above the blue link create reuse snippets, comparing the results to machine- and the green URL instead of below. Google claims generated reuse snippets. They observe that humans that despite “concerns that they might cause publishers select the same pieces of text as machines in around 73% to lose traffic”, “it quickly became clear that featured of cases. Savenkov et al. [SBL11] survey approaches snippets do indeed drive traffic.” 5 regarding the evaluation of snippet generation, suggest- Similarly, we are also unaware of any evidence that ing automated evaluation approaches and A/B testing, snippets are useful only if they reuse text from the which both can only be trained (used) if a search en- web page described. This thought gave us a subversive gine with a reasonably large user base is available. idea: What if a snippet was an original explanation Thomaidou et al. [TLKV13] consider the special case of how a web page relates to a query? This would of snippets generated for ads shown on search results resolve the quandary to some extent since search en- pages to allow users to understand how the ads relate to gines need no longer rely on the intellectual property their queries. Further research has been invested into of others to present their search results, but can resort studying how the length of snippets affects perceived to technology for snippet synthesis instead. With deep search result quality on desktops [MAM17, KHL08] and learning-based text generation on the rise, this does mobile devices, where screen space is limited [KTS+ 17]. not appear impossible, anymore, albeit very difficult. Eye-tracking studies have been conducted to determine to what parts of a results page users pay most at- 2 Background and Related Work tention [GJG04, CG07]; unsurprisingly, snippets play a major role. Finally, reuse snippets are also gen- Snippet generation is a variant of extractive summa- erated in XML retrieval [HLC08] and semantic web rization, where the summaries are biased toward the search [PWTY08]. queries. Extractive summarization and information re- The companion task to extractive summarization is trieval have common ancestry, with Luhn, the inventor abstractive summarization, where summaries are syn- of term frequency weighting, being one of the earli- thesized without text reuse. Generating abstractive est contributors [Bax58, Luh58]. Current research on summaries has been a long-standing task in the natural snippet generation for search engines focuses on extrac- language generation community [GG17], yet, it has tive summarization: Tombros and Sanderson [TS98] not been applied to snippet generation. In their user ascertained the importance that snippets relate to a study, Bando et al. [BST10] come close, using manu- user’s query, while Brin and Page [BP98] implemented ally written, original snippets as a gold standard to query-biased snippets for the first version of Google. evaluate snippets that were generated automatically 5 https://www.blog.google/products/search/reintroduction-googles- and manually by extracting text from a web page. It featured-snippets was shown that humans pay attention to the same Table 1: Survey: How often do you read snippets? References P [Bax58] P. B. Baxendale. Machine-Made Index for Technical Always Often Sometimes Seldom Never Literature - An Experiment. IBM Journal of 1782 2652 1470 87 9 6000 Research and Development, 2(4):354–361, 1958. [BP98] S. Brin and L. Page. The Anatomy of a Large-Scale 29.7% 44.2% 24.5% 1.4% 0.2% 100% Hypertextual Web Search Engine. Computer Networks, 30(1-7):107–117, 1998. parts of a document when composing an original snip- [BST10] L. L. Bando, F. Scholer, and A. Turpin. Constructing Query-biased Summaries: A Comparison of Human pet compared to when selecting sentences for a snip- and System Generated Snippets. In Proc. of IICS, p. pet. Machines sometimes select different sentences 195–204, 2010. [CAR16] S. Chopra, M. Auli, and A. M. Rush. Abstractive to generate reuse snippets, leaving room for improve- Sentence Summarization with Attentive Recurrent ment. Recently, neural network models have made Neural Networks. In Proc. of NAACL/HLT, 2016. [CG07] E. Cutrell and Z. Guan. What are you Looking for?: great progress toward the task of generating abstrac- An Eye-tracking Study of Information Usage in Web tive summaries [CAR16, NZN+ 16, RCW15, SLM17], Search. In Proc. of CHI, p. 407–416, 2007. [GG17] M. Gambhir and V. Gupta. Recent Automatic Text which renders snippet synthesis feasible if the lack of Summarization Techniques: A Survey. Artificial large-scale training data can be overcome. Intelligence Review, 47(1):1–66, 2017. [GJG04] L. A. Granka, T. Joachims, and G. Gay. Eye-tracking Analysis of User Behavior in WWW Search. In Proc. 3 Discussion and Future Work [HLC08] of SIGIR, p. 478–479, 2004. Y. Huang, Z. Liu, and Y. Chen. Query biased Snippet Generation in XML Search. In Proc. of SIGMOD, p. All things considered, the proponents of ancillary copy- 315–326, 2008. right have a point: an information economy whose infor- [KHL08] M. Kaisser, M.A. Hearst, and J.B. Lowe. Improving Search Results Quality by Customizing Summary mation sources are funded by displaying ads to informa- Lengths. In Proc. of ACL, p. 701–709, 2008. tion consumers cannot withstand information interme- + [KTS 17] J. Kim, P. Thomas, R. Sankaranarayana, T. Gedeon, and H.-J. Yoon. What Snippet Size is Needed in diaries that take the information from the sources and Mobile Web Search? In Proc. of CHIIR 2017, 2017. share it directly with the consumers for their own bene- [Luh58] H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and fit. If the “plight” of news publishers does not convince, Development, 2(2):159–165, 1958. perhaps that of Wikipedia does: its ongoing decline [MAM17] D. Maxwell, L. Azzopardi, and Y. Moshfeghi. A Study of Snippet Length and Informativeness: Behaviour, of editors since 2007 [SCCP09] has been attributed, Performance and User Experience. In Proc. of SIGIR, among other things, to Google’s oneboxes [MJH17], p. 135–144, 2017. [MJH17] C. McMahon, I. Johnson, and B. Hecht. The which have been introduced around that time. But Substantial Interdependence of Wikipedia and Google: the opposition has a point, too: information interme- A Case Study on the Relationship Between Peer Production Communities and Information diaries offer high-quality services to both sources and Technologies. In Proc. of ICWSM, 2017. consumers of information free of charge; their share of + [NZN 16] R. Nallapati, B. Zhou, C. Nogueira dos Santos, Ç. Gülçehre, and B. Xiang. Abstractive Text ad revenue is well-deserved. Moreover, major publish- Summarization using Sequence-to-Sequence RNNs and ers are misusing the intermediaries’ platforms to spread Beyond. In Proc. CoNLL, 2016. [PKSH16] M. Potthast, S. Köpsel, B. Stein, and M. Hagen. significant amounts of clickbait [PKSH16]. Publishers Clickbait Detection. In Proc of ECIR, 2016. would maybe not mind laws that regulate information [PWTY08] T. Penin, H. Wang, T. Tran, and Y. Yu. Snippet Generation for Semantic Web Search Engines. In Proc. systems to only refer users instead of informing them. of ASWC, p. 493–507, 2008. This, however, would not be in the best interest of the [RCW15] A.M. Rush, S. Chopra, and J. Weston. A Neural Attention Model for Abstractive Sentence information society, which desperately needs strong(er) Summarization. In Proc. of EMNLP, 2015. retrieval technology. [SBL11] D. Savenkov, P. Braslavski, and M. Lebedev. Search Snippet Evaluation at Yandex: Lessons Learned and Given the significant advances in text generation as Future Directions. In Proc. of CLEF, 2011. of recent, we believe that future information systems [SCCP09] B. Suh, G. Convertino, E.H. Chi, and P. Pirolli. The singularity is not near: slowing growth of Wikipedia. will not present information as provided by its sources, In Proc. of WikiSym, 2009. anymore, but tailor them to a user’s information need. [SLM17] A. See, P.J. Liu, and C.D. Manning. Get To The Point: Summarization with Pointer-Generator Regulating verbatim reuse is hence short-sighted: the Networks. In Proc. of ACL, 2017. true societal challenge ahead is the question whether [TLKV13] S. Thomaidou, I. Lourentzou, P. Katsivelis-Perakis, and M. Vazirgiannis. Automated Snippet Generation automatically generated paraphrases are copyright pro- for Online Advertising. In Proc. of CIKM, 2013. tected, especially when the training data used does [TS98] A. Tombros and M. Sanderson. Advantages of Query Biased Summaries in Information Retrieval. In Proc. not include the to-be-paraphrased subject. We are cur- of SIGIR, p. 2–10, 1998. rently taking the first steps towards a proof-of-concept [TTHW07] A. Turpin, Y. Tsegay, D. Hawking, and H.E. Williams. Fast Generation of Result Snippets in Web Search. In for non-reuse snippet generation technology to demon- Proc. of SIGIR, p. 127–134, 2007. strate its viability. Key to our approach is the crowd- [WRJ02a] R. White, I. Ruthven, and J.M. Jose. Finding Relevant Documents Using Top Ranking Sentences: sourcing of large-scale training data composed of topics, An Evaluation of Two Alternative Schemes. In Proc. search results, and original snippets. Out of curiosity, of SIGIR, p. 57–64, 2002. [WRJ02b] R. White, I. Ruthven, and J.M. Jose. The Use of we ask our workers about their snippet reading habits, Implicit Evidence for Relevance Feedback in Web with (un)surprising results; see Table 1. Retrieval. In Proc. of ECIR, p. 93–109, 2002.