=Paper= {{Paper |id=Vol-2079/paper1 |storemode=property |title=A Plan for Ancillary Copyright: Original Snippets |pdfUrl=https://ceur-ws.org/Vol-2079/paper1.pdf |volume=Vol-2079 |authors= Martin Potthast,Wei-Fan Chen,Matthias Hagen,Benno Stein |dblpUrl=https://dblp.org/rec/conf/ecir/PotthastCHS18 }} ==A Plan for Ancillary Copyright: Original Snippets== https://ceur-ws.org/Vol-2079/paper1.pdf
          A Plan for Ancillary Copyright: Original Snippets

        Martin Potthast1                 Wei-Fan Chen2                 Matthias Hagen3                  Benno Stein2
          1                                 2                                                    3
          Leipzig University                Bauhaus-Universität Weimar                        Halle University
    martin.potthast@uni-leipzig.de         .@uni-weimar.de            matthias.hagen@informatik.uni-halle.de



                                                                   fair use laws. These exemptions are currently being
                                                                   reconsidered.
                        Abstract                                      In recent years, news publishers have raised claims
                                                                   for compensation from search engine companies for
     The snippets that web search engines generate                 snippets generated from their articles. Their argument
     for their result presentation are extracted from              is as follows: search engines and news aggregators earn
     the retrieved web pages, reusing pieces of text               money based on the publishers’ intellectual property,
     that match a user’s query. Copyright owners of                and, since snippets are informative, they may prevent
     the retrieved web pages are typically not asked               users from visiting the related news article, depriving
     for usage rights. This long-time practice now                 them of ad revenue. While no one forces the publishers
     faces increasing backlash from news publishers,               to have their articles indexed, they also claim to be left
     legal action, and even new legislation in Ger-                with no alternative to the de facto monopolist on most
     many and Spain: the so-called ancillary copy-                 search markets, Google. The fact that search engines
     right for news publishers. This copyright law                 nowadays aim at answering certain queries directly on
     restricts the fair use of intellectual property of            search results pages, often based on content lifted from
     news publishers, allowing them to raise claims                third party web pages, does not serve to deescalate
     for monetary compensation when their text is                  the dispute: every query answered directly by a search
     reused, even within snippets. If passed at the                engine takes away traffic from the web pages it indexes,
     EU level, ancillary copyright could severely im-              undermining the ad revenue model which funded the
     pact future information system development.                   creation of apparently useful pieces of information in
     This paper promotes a “technological remedy”,                 the first place. Following this line of argumentation,
     namely, to synthesize true original snippets                  publishers successfully lobbied for political support: the
     without text reuse.                                           so-called ancillary copyright for news publishers has
                                                                   been passed into law in Germany and Spain. Despite
1    Introduction                                                  the German version still exempting individual words or
An organic search result for a keyword query on a web              “smallest text snippets,” 1 Google instantly demanded
search engine is typically displayed as title and URL              free-of-charge usage rights from all major German pub-
along with a brief excerpt of the respective page, show-           lishers, delisting those who did not agree, whereas the
ing selected pieces of text that contain keywords from             Spanish law2 caused the shutdown of Google News in
the query, the snippet. Snippets guide users in deciding           Spain.3 While the European Union—amidst a fierce
which of the pages on a search results page to visit, if           public debate among stakeholders both in favor as well
any. Since snippets are extracted from the found web               as opposed—deliberates an ancillary copyright for all
pages, they form a kind of text reuse. Reusing a third             of its members and all kinds of information systems
party’s text is governed by copyright laws and typically           (not only search engines), Google News has recently
requires written consent. The operators of web search              been redesigned worldwide: the new version does not
engines have been exempt from this regulation under                show snippets anymore.4 Figures 1 and 2 contrast the
                                                                   new with the old layout.
Copyright c 2018 for the individual papers by the papers’ au-      1 https://www.gesetze-im-internet.de/urhg/__87f.html (German)
thors. Copying permitted for private and academic purposes.        2 https://www.boe.es/boe/dias/2014/11/05/pdfs/BOE-A-2014-
This volume is published and copyrighted by its editors.
                                                                    11404.pdf (Spanish)
In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete,   3 https://europe.googleblog.com/2014/12/an-update-on-google-
A. Vlachos (eds.): Proceedings of the NewsIR’18 Workshop            news-in-spain.html
                                                                   4 https://www.blog.google/topics/journalism-news/redesigning-
at ECIR, Grenoble, France, 26-March-2018, published at
http://ceur-ws.org                                                  google-news-everyone
Figure 1: New Google News layout without snippets.
   Based on our comprehensive literature survey (Sec-
tion 2), we are unaware of any evidence that the usabil-
ity of a search engine is improved by dropping snippets.            Figure 2: Google News as it used to be, obtained via
However, despite recent experiments showing that users              the “News” facet of the main search engine.
may prefer longer snippets over shorter ones [MAM17],
not a single experiment has quantified the impact of                White et al. [WRJ02a, WRJ02b] found that snippets
dropping snippets. Therefore, Google must be given                  should be re-generated based on implicit relevance feed-
the benefit of the doubt, since extensive A/B tests may             back, selecting different sentences when a user returns
have revealed that snippets are unimportant for Google              to a search results page. To speed up snippet genera-
News. Meanwhile, Google recently “reintroduced” fea-                tion, Turpin et al. [TTHW07] evaluate software archi-
tured snippets to the main search engine, where the                 tectures based on compressed data structures and RAM
search result that best answers a question query is high-           caching. Bando et al. [BST10] ask humans to manually
lighted by showing it in a box and above the blue link              create reuse snippets, comparing the results to machine-
and the green URL instead of below. Google claims                   generated reuse snippets. They observe that humans
that despite “concerns that they might cause publishers             select the same pieces of text as machines in around 73%
to lose traffic”, “it quickly became clear that featured            of cases. Savenkov et al. [SBL11] survey approaches
snippets do indeed drive traffic.” 5                                regarding the evaluation of snippet generation, suggest-
   Similarly, we are also unaware of any evidence that              ing automated evaluation approaches and A/B testing,
snippets are useful only if they reuse text from the                which both can only be trained (used) if a search en-
web page described. This thought gave us a subversive               gine with a reasonably large user base is available.
idea: What if a snippet was an original explanation                 Thomaidou et al. [TLKV13] consider the special case
of how a web page relates to a query? This would                    of snippets generated for ads shown on search results
resolve the quandary to some extent since search en-                pages to allow users to understand how the ads relate to
gines need no longer rely on the intellectual property              their queries. Further research has been invested into
of others to present their search results, but can resort           studying how the length of snippets affects perceived
to technology for snippet synthesis instead. With deep              search result quality on desktops [MAM17, KHL08] and
learning-based text generation on the rise, this does               mobile devices, where screen space is limited [KTS+ 17].
not appear impossible, anymore, albeit very difficult.              Eye-tracking studies have been conducted to determine
                                                                    to what parts of a results page users pay most at-
2    Background and Related Work                                    tention [GJG04, CG07]; unsurprisingly, snippets play
                                                                    a major role. Finally, reuse snippets are also gen-
Snippet generation is a variant of extractive summa-                erated in XML retrieval [HLC08] and semantic web
rization, where the summaries are biased toward the                 search [PWTY08].
queries. Extractive summarization and information re-                  The companion task to extractive summarization is
trieval have common ancestry, with Luhn, the inventor               abstractive summarization, where summaries are syn-
of term frequency weighting, being one of the earli-                thesized without text reuse. Generating abstractive
est contributors [Bax58, Luh58]. Current research on                summaries has been a long-standing task in the natural
snippet generation for search engines focuses on extrac-            language generation community [GG17], yet, it has
tive summarization: Tombros and Sanderson [TS98]                    not been applied to snippet generation. In their user
ascertained the importance that snippets relate to a                study, Bando et al. [BST10] come close, using manu-
user’s query, while Brin and Page [BP98] implemented                ally written, original snippets as a gold standard to
query-biased snippets for the first version of Google.              evaluate snippets that were generated automatically
5 https://www.blog.google/products/search/reintroduction-googles-   and manually by extracting text from a web page. It
 featured-snippets                                                  was shown that humans pay attention to the same
    Table 1: Survey: How often do you read snippets?         References
                                                     P       [Bax58]  P. B. Baxendale. Machine-Made Index for Technical
Always Often Sometimes Seldom Never
                                                                      Literature - An Experiment. IBM Journal of
  1782      2652      1470         87        9     6000               Research and Development, 2(4):354–361, 1958.
                                                             [BP98]   S. Brin and L. Page. The Anatomy of a Large-Scale
 29.7%     44.2%     24.5%        1.4%     0.2%    100%               Hypertextual Web Search Engine. Computer
                                                                      Networks, 30(1-7):107–117, 1998.
parts of a document when composing an original snip-         [BST10]  L. L. Bando, F. Scholer, and A. Turpin. Constructing
                                                                      Query-biased Summaries: A Comparison of Human
pet compared to when selecting sentences for a snip-                  and System Generated Snippets. In Proc. of IICS, p.
pet. Machines sometimes select different sentences                    195–204, 2010.
                                                             [CAR16]  S. Chopra, M. Auli, and A. M. Rush. Abstractive
to generate reuse snippets, leaving room for improve-                 Sentence Summarization with Attentive Recurrent
ment. Recently, neural network models have made                       Neural Networks. In Proc. of NAACL/HLT, 2016.
                                                             [CG07]   E. Cutrell and Z. Guan. What are you Looking for?:
great progress toward the task of generating abstrac-                 An Eye-tracking Study of Information Usage in Web
tive summaries [CAR16, NZN+ 16, RCW15, SLM17],                        Search. In Proc. of CHI, p. 407–416, 2007.
                                                             [GG17]   M. Gambhir and V. Gupta. Recent Automatic Text
which renders snippet synthesis feasible if the lack of               Summarization Techniques: A Survey. Artificial
large-scale training data can be overcome.                            Intelligence Review, 47(1):1–66, 2017.
                                                             [GJG04]  L. A. Granka, T. Joachims, and G. Gay. Eye-tracking
                                                                      Analysis of User Behavior in WWW Search. In Proc.
3     Discussion and Future Work                             [HLC08]
                                                                      of SIGIR, p. 478–479, 2004.
                                                                      Y. Huang, Z. Liu, and Y. Chen. Query biased Snippet
                                                                      Generation in XML Search. In Proc. of SIGMOD, p.
All things considered, the proponents of ancillary copy-              315–326, 2008.
right have a point: an information economy whose infor-      [KHL08]  M. Kaisser, M.A. Hearst, and J.B. Lowe. Improving
                                                                      Search Results Quality by Customizing Summary
mation sources are funded by displaying ads to informa-               Lengths. In Proc. of ACL, p. 701–709, 2008.
tion consumers cannot withstand information interme-              +
                                                             [KTS 17] J. Kim, P. Thomas, R. Sankaranarayana, T. Gedeon,
                                                                      and H.-J. Yoon. What Snippet Size is Needed in
diaries that take the information from the sources and                Mobile Web Search? In Proc. of CHIIR 2017, 2017.
share it directly with the consumers for their own bene-     [Luh58]  H. P. Luhn. The Automatic Creation of Literature
                                                                      Abstracts. IBM Journal of Research and
fit. If the “plight” of news publishers does not convince,            Development, 2(2):159–165, 1958.
perhaps that of Wikipedia does: its ongoing decline          [MAM17] D. Maxwell, L. Azzopardi, and Y. Moshfeghi. A Study
                                                                      of Snippet Length and Informativeness: Behaviour,
of editors since 2007 [SCCP09] has been attributed,                   Performance and User Experience. In Proc. of SIGIR,
among other things, to Google’s oneboxes [MJH17],                     p. 135–144, 2017.
                                                             [MJH17]  C. McMahon, I. Johnson, and B. Hecht. The
which have been introduced around that time. But                      Substantial Interdependence of Wikipedia and Google:
the opposition has a point, too: information interme-                 A Case Study on the Relationship Between Peer
                                                                      Production Communities and Information
diaries offer high-quality services to both sources and               Technologies. In Proc. of ICWSM, 2017.
consumers of information free of charge; their share of           +
                                                             [NZN 16] R. Nallapati, B. Zhou, C. Nogueira dos Santos, Ç.
                                                                      Gülçehre, and B. Xiang. Abstractive Text
ad revenue is well-deserved. Moreover, major publish-                 Summarization using Sequence-to-Sequence RNNs and
ers are misusing the intermediaries’ platforms to spread              Beyond. In Proc. CoNLL, 2016.
                                                             [PKSH16] M. Potthast, S. Köpsel, B. Stein, and M. Hagen.
significant amounts of clickbait [PKSH16]. Publishers                 Clickbait Detection. In Proc of ECIR, 2016.
would maybe not mind laws that regulate information          [PWTY08] T. Penin, H. Wang, T. Tran, and Y. Yu. Snippet
                                                                      Generation for Semantic Web Search Engines. In Proc.
systems to only refer users instead of informing them.                of ASWC, p. 493–507, 2008.
This, however, would not be in the best interest of the      [RCW15]  A.M. Rush, S. Chopra, and J. Weston. A Neural
                                                                      Attention Model for Abstractive Sentence
information society, which desperately needs strong(er)               Summarization. In Proc. of EMNLP, 2015.
retrieval technology.                                        [SBL11]  D. Savenkov, P. Braslavski, and M. Lebedev. Search
                                                                      Snippet Evaluation at Yandex: Lessons Learned and
    Given the significant advances in text generation as              Future Directions. In Proc. of CLEF, 2011.
of recent, we believe that future information systems        [SCCP09] B. Suh, G. Convertino, E.H. Chi, and P. Pirolli. The
                                                                      singularity is not near: slowing growth of Wikipedia.
will not present information as provided by its sources,              In Proc. of WikiSym, 2009.
anymore, but tailor them to a user’s information need.       [SLM17]  A. See, P.J. Liu, and C.D. Manning. Get To The
                                                                      Point: Summarization with Pointer-Generator
Regulating verbatim reuse is hence short-sighted: the                 Networks. In Proc. of ACL, 2017.
true societal challenge ahead is the question whether        [TLKV13] S. Thomaidou, I. Lourentzou, P. Katsivelis-Perakis,
                                                                      and M. Vazirgiannis. Automated Snippet Generation
automatically generated paraphrases are copyright pro-                for Online Advertising. In Proc. of CIKM, 2013.
tected, especially when the training data used does          [TS98]   A. Tombros and M. Sanderson. Advantages of Query
                                                                      Biased Summaries in Information Retrieval. In Proc.
not include the to-be-paraphrased subject. We are cur-                of SIGIR, p. 2–10, 1998.
rently taking the first steps towards a proof-of-concept     [TTHW07] A. Turpin, Y. Tsegay, D. Hawking, and H.E. Williams.
                                                                      Fast Generation of Result Snippets in Web Search. In
for non-reuse snippet generation technology to demon-                 Proc. of SIGIR, p. 127–134, 2007.
strate its viability. Key to our approach is the crowd-      [WRJ02a] R. White, I. Ruthven, and J.M. Jose. Finding
                                                                      Relevant Documents Using Top Ranking Sentences:
sourcing of large-scale training data composed of topics,             An Evaluation of Two Alternative Schemes. In Proc.
search results, and original snippets. Out of curiosity,              of SIGIR, p. 57–64, 2002.
                                                             [WRJ02b] R. White, I. Ruthven, and J.M. Jose. The Use of
we ask our workers about their snippet reading habits,                Implicit Evidence for Relevance Feedback in Web
with (un)surprising results; see Table 1.                             Retrieval. In Proc. of ECIR, p. 93–109, 2002.