<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>First International Workshop on Scholarly Information Access (SCOLIA), April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>of query systems for temporal n-gram corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Richter</string-name>
          <email>fabian.richter@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Schäfer</string-name>
          <email>benjamin.schaefer@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klemens Böhm</string-name>
          <email>klemens.boehm@kit.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Google Books Ngram Corpus, Query System, Query Algebra</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Karlsruhe Institute of Technology</institution>
          ,
          <addr-line>Kaiserstraße 12, 76131 Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Related work</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>10</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Natural languages evolve over time and with increasing digitalization these evolutions are quantitively studied in humanities and social sciences. One important observable is the frequency of individual words, as well as word tuples (n-grams) over time. Diferent tools exist to analyze these changing frequencies in large text corpora, with diferent levels of complexity and eficiency. However, a systematic overview and evaluation of the expressiveness and practical usability of these diferent tools is missing. In this article, we present a structured approach to such an evaluation by defining a query algebra and a set of information needs expressed therein, followed by a comparison of 12 diferent query systems. Overall, we identify several systems as similar to the Google Books Ngram Viewer (GBNV) or as systems specific to a subcorpus, and find that the theoretically most potent and lfexible systems lack a practical implementation, pointing out further research needs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>https://fr2501.github.io/ (F. Richter); https://www.benjaminschaefer.org/ (B. Schäfer)</p>
      <p>
        ceur-ws.org
ISSN1613-0073
and they focus more on the analysis of full texts. Metadata search engines and query systems like
SchenQL [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] and LexiDB [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] are also not included into our review. Even though some of the
TNC query operators defined in the following rely on metadata, it is not the core concept of TNCs.
Paper outline In Section 2, we define a formal model of TNCs and a query algebra for them. Section
3 presents the results of our query system review, with a focus on expressive power in Section 3.3 and
practical usability concerns in Section 3.4. Section 4 discusses limitations of our review and gives an
outlook on future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Fundamentals</title>
      <sec id="sec-2-1">
        <title>2.1. Data model</title>
        <p>In this section, we first describe a formal data model for TNCs. We then define a query algebra operating
on that data model, which will later be used to express common information needs across diferent
query systems.</p>
        <p>
          In this section, we construct a formal model of temporal n-gram corpora, starting from diachronic
document collections and leading up to an extension of relational algebra [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. This model allows us to
precisely express abstract information needs and query types across diferent concrete systems.
Definition 1 (Temporal text corpus). A temporal text corpus  consists of documents   = (  ,   ,   ).
Document   comprises the text   , its metadata as a (partial) key-value-mapping   , and a specific
metadata entry   , the timestamp. The set of all timestamps in  is called  and must admit a natural
ordering. Furthermore, we use   ≔ {  ∈  |   =  } to denote the set of all documents with the specific
timestamp  ∈  .
        </p>
        <p>Typical keys for the metadata map   are, e.g., author or title, and the timestamp usually represents
the publication date. The metadata of a document is modeled as a partial key-value-mapping because
not every metadata field needs to be populated for every document. We do not specify a granularity
of the timestamps in  . For the GBNC, the timestamps correspond to years, while other corpora and
systems might use finer-grained timestamps or coarser aggregates.</p>
        <p>Definition 2 (n-gram). Given a set Σ,  ∈ Σ  is a tuple of length  over Σ, an n-gram. To denote the set
of all n-grams over Σ, without fixing  , we write Σ∗. Given some document  , we use   () ∶ Σ ∗ → ℕ0
to obtain the number of occurrences of an n-gram  in the document’s text.</p>
        <p>
          In the context of natural language processing, the set Σ in Definition 2 is usually the vocabulary of a
natural language, possibly annotated with linguistic metadata, e.g., part-of-speech tags [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. An n-gram
is then an ordered sequence of  words from that vocabulary.
        </p>
        <p>Definition 3 (Temporal n-gram corpus (TNC)). Consider a temporal text corpus  and the set  of its
timestamps. Let Σ be the set of all words that occur in the texts of  . We then construct
  ∶ Σ∗ ×  → ℕ 0, (,  ) ↦
∑   ()
∈ 
to capture the number of occurrences of an n-gram  in all documents  with timestamp  . Fixing  and
an n-gram  ,   ≔   (, ⋅) becomes a function over  , a time series. We say that  ∈  if and only if
there is some  ∈  such that   ( ) ≠ 0 . Letting  ≔ { ∈ Σ ∗ |  ∈ } and  ≔ {  |  ∈ } ,  ′ ≔ ( , ,  )
is the temporal n-gram corpus over  .</p>
        <p>Constructing a TNC  ′ out of a temporal text corpus  may seem redundant, but it greatly simplifies
notation and analysis in the following sections. Furthermore, in practice,  may not always be available,
even if  ′ is. This is the case for the GBNC.</p>
        <p>
          Finally, we define a relation   ′ in the sense of [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and using notation from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] that represents a
TNC  ′ = ( , ,  ) , as follows:  C′(,  ) = {(  ,    )} ⊆  ×  . Neither of the two attributes of  C′ are
atomic, as the first one represents an n-gram, so a tuple of words, and the second one a frequency map
from  to ℕ0. In both cases, we use square brackets to denote member access, either via the 1-based
tuple index or the map key, i.e., for some n-gram  , [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] shall represent the second word in  , and
  [2025] its frequency at timestamp 2025.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Query types</title>
        <p>After defining our data model for TNCs, we now present diferent abstract query types. These are
derived from the capabilities of existing query system implementations. First, we focus on filtering
TNCs, using diferent types of constraints. Then, we focus on analytic queries. Finally, we describe
how to incorporate external knowledge into queries and how to construct TNCs for specific purposes.
Even though we introduce the diferent kinds of queries and operators independently, the underlying
algebra allows for flexible combinations.</p>
        <p>In the following, a TNC  ′ over a text corpus  is always given through its corresponding relation
  ′(,  ) . We use (,   ) to refer to specific elements of   ′(,  ) .</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Filtering</title>
          <p>A user is not necessarily interested in the full TNC  ′, but maybe rather a subset of it or even a single
time series. We express this subcorpus filtering as a selection   (  ′), where  is an arbitrary boolean
predicate on n-grams and their frequencies. The general semantics of this operation is as follows:
  (  ′) = {(,   ) | (,   ) = true}.</p>
          <p>The structure of  can be used to further categorize filtering operations. For example, some  may only
consider  and be independent of   , or vice versa. This is an important distinction, as many of the
existing query systems only allow for the first, but not the second kind of predicates In our abstract
algebra, this strict distinction does not exist and predicates can be freely mixed.
n-gram filtering First, we focus on filter predicates  that are independent of   and only depend on
 itself. The simplest kind is strict equality with a given n-gram  ref :</p>
          <p>= ref (  ′) = {( ref ,   ref )},
which yields the singleton set of  ref and its frequency series. In a similar way, one can define  as
strict set membership for some set  ref to filter for several n-grams at once. An example for this is
shown in Figure 1, showing the frequencies of the 3 2-grams information retrieval, digital humanities,
and machine learning.</p>
          <p>A more complex type of filtering employs external or internal wildcards. When employing the
external wildcard operator ∗, two n-grams do not need to be exactly equal, but only in all non-wildcard
positions. For an n-gram  ref = ( 1, … , ∗, … ,   ), the semantics of equality up to wildcards (or matching)
is as follows:</p>
          <p>≃  ref ≡ ∀0 &lt;  ≤  ∶  ref [] = ∗ ∨  ref [] = [],
with  ≃ ref (  ′) as above. Another type of wildcards, which we call internal wildcard and denote as ∗̂ ,
does not operate on n-grams themselves, but on the words therein: For example, ‘∗̂ ing’ matches all
words that end with ‘ing’; and consequentially, (∗, ∗̂ ing) matches all 2-grams starting with an arbitrary
word and ending with a word that ends in ‘ing’.</p>
          <p>
            While strict equality and wildcards are the most straightforward methods, n-gram filtering is not
limited to these. Aleksandrov and Strapparava [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] include linguistic information from WordNet [
            <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
            ],
allowing for queries about, for example, synonym or antonym relationships between words. Willkomm
et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] apply sentiment analysis to identify and select n-grams carrying positive or negative sentiments.
Frequency filtering Second, we introduce predicates operating on the n-grams’ frequencies. A
simple example for a frequency-based predicate is a pointwise threshold: Setting  = (  [2000] ≥ 100),
for example, selects only those n-grams that occurred at least 100 times at timestamp 2000. Another
option is a global threshold, i.e., selecting n-grams  where ∑∈   () &gt;  for some fixed value  .
          </p>
          <p>A more complex predicate uses information about all of  ′, to select the  most frequent n-grams in
the corpus:
top(, ) ≡ ∑   ( ) &lt; ∑   ′( ) for fewer than  n-grams  ′.</p>
          <p>∈ ∈</p>
          <p>
            Frequency filtering can become almost arbitrarily complex: Willkomm et al. [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] introduce a kNN
operator, that selects, for a given reference n-gram and some constant  , the  most similar n-grams
according to the Euclidean distance of their frequencies. Frequency filtering can also be combined with
diferent operators, for example those presented in the next section.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Data analysis and computation</title>
          <p>While the frequency of a single n-gram can be interesting in some research applications, being able to
combine diferent n-grams’ frequencies into aggregate time series or even single statistical values opens
up many new possibilities. For example, it becomes possible to not only analyze specific words, but
rather concepts, by summing the frequencies of synonyms. Another possibility is to identify n-grams
with similar frequency patterns by computing pairwise correlations.</p>
          <p>Arithmetic operations The easiest way to combine the frequencies of two n-grams is applying
pointwise arithmetic operations, as follows:</p>
          <p>( 1,   1) ∘ ( 2,   2) ≔ ( 1 ∘  2,   1∘ 2[ ] =  1[ ] ∘  2[ ],  ∈  ), ∘ ∈ {+, −, ⋅, /}.</p>
          <p>Here,  1 ∘  2 denotes a symbolic identifier for the combination of  1 and  2 using ∘, which is not an
n-gram in itself, but ensures compatibility with other frequency-based operators of our algebra. An
example of this can be seen in Figure 2, showing the ratio of the frequencies of machine learning and
artificial intelligence , which was close to 0 in 1980, but has rapidly increased since then—showing a
significant increase in popularity for the term machine learning over artificial intelligence . Replacing
( 2,   2) with a constant  and using multiplication as operation defines a constant scaling operator.</p>
          <p>Since the result of an arithmetic operation in this sense is structurally similar to its operands—albeit
with a symbolic identifier instead of an n-gram—arithmetic operations can be nested and combined
following the usual rules of arithmetics. The definition of other operations to use as ∘ is also possible,
as long as their signature fits the same pattern.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Statistical analysis</title>
          <p>
            The general goal of statistical analysis is to extract knowledge from data. This

 ∶ 
can take many diferent forms, we want to focus on
numerical knowledge: Given one or several
ngram frequency series, compute a number representing an ‘interesting’ property  ; or, more formally
 → ℝ is a  -ary function mapping  n-grams to one real number quantifying the property  .
Examples for  include unary ( = 1 ) functions like min, max and mean, as well as binary ( = 2 )
functions like  (Pearson’s Correlation Coeficient) or
Mutual Information [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-2-4">
          <title>2.2.3. External knowledge integration</title>
          <p>TNCs are based on textual data, usually written in natural languages. There exists a wealth of linguistic
knowledge that one might want to include into queries on TNCs, for example by grouping synonyms
together or matching diferent spellings of the same word.</p>
          <p>
            External knowledge can take many forms, so it is dificult to give a general formalization. The
relational data model [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] has, however, proven to be flexible enough to represent almost any kind of
data. It is therefore reasonable to assume that the external knowledge is available as some relation
 (
1, … ,   ). Then, combining   ′ and  can be expressed as
          </p>
          <p>′ ⋈  ≔   (  ′ ×  ),
an (inner) join over some predicate  —or, equivalently, a cartesian product followed by a selection.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>2.2.4. Metadata and corpus construction</title>
          <p>relation   ̂, which in turn defines the smaller TNC  ̂′.</p>
          <p>Until now,  and  ′ have been fixed. In some cases, users might want to restrict a large text collection
 to some smaller text collection  ̂ and then consider the TNC  ̂′ separately. This is fundamentally
diferent from the previously described operators, as the filtering here does not operate on n-grams, but
on entire documents and texts. On a purely conceptional level, however, the algebraic operators are
similar: Instead of operating on   ′, we now consider   ( ,  )
, where  contains the texts and  the
metadata map (including the timestamp   ) of the documents   . A selection on   then yields a smaller</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Query system review</title>
      <p>Having defined our data model and a formal framework for query formulation, we can now proceed to
review 12 existing query systems for TNCs. We begin by describing our literature research process and
justifying our selection of the 12 query systems. We then consider diferent properties of these systems:
the corpora they are meant to analyze, the expressive power of their query languages, and finally their
practical usability.</p>
      <sec id="sec-3-1">
        <title>3.1. Literature selection</title>
        <p>
          For selecting the literature relevant for our review, we attempted to perform targeted keyword
searches in Google Scholar. This, however, proved problematic: The GBNV is significantly more popular
than any competitors, and thus heavily dominates the result sets. Excluding terms like ‘Google’ was
not useful to restrict the result, as competing query systems almost necessarily mention the GBNV.
The most useful query we found was ["ngram viewer" -intitle:google], yielding NB N-Gram by
Birkenes et al. [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and Slash/A by Todorova et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Following bi-directional citation links from
Meta Trend Viewer
Icelandic Gigaword
n-Gram Viewer
2025
2020
NB N-gram
        </p>
        <p>2015</p>
        <p>Slash/A</p>
        <p>NgramQuery
Ngram Search Engine
2010
ParlaMint Ngram</p>
        <p>Viewer
PoliticalMashup
Ngramviewer
(an:a)-lyzer</p>
        <p>CHQL
GB-Adv</p>
        <p>
          GBNV
there, we further identified the Icelandic Gigaword n-Gram Viewer (Steingrímsson et al. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]) and the
Meta Trend Viewer (Indig et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]).
        </p>
        <p>
          This motivated the following selection procedure: Starting from 8 seed articles (including the two
mentioned above, [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]) we were aware of from previous research, we followed bi-directional
citation links to other articles describing query systems. We did not include the GBNV itself into the
seed set, as its corresponding papers have been cited several thousand times, but most of these citations
do not describe query systems. We intended to continue this procedure transitively—but the only
further query system we found was the PoliticalMashup Ngramviewer [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. This may be a consequence
of the generally sparse citation graph between the considered articles: Figure 3 illustrates that many of
the query system articles do not cite others, except for the GBNV. This fragmentation makes it dificult
to systematically identify and evaluate relevant literature, finally leading us to settle on our selection of
12 systems.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Underlying text corpora</title>
        <p>One of the first decisions to make for any n-gram frequency analysis is which corpus is suitable for
the concrete research questions. This already restricts the choice of available query systems, as most of
the existing ones operate on one, predefined corpus and cannot easily be applied to other corpora, as
summarized in Table 2. If the table lists no TNC, it is the one constructed from the corresponding text
corpus as described in Section 2.1. While, in theory, all systems could be applied to any TNCs, most
implementations do not allow changing the corpus and might not scale well.</p>
        <p>
          The GBNC is the most well-known and by far the biggest TNC. It is based on numerous library
collections, digitized as part of the Google Books project. It spans more than 500 years and contains at
least around 6% of all books ever published [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Due to copyright reasons, neither the full texts nor
any metadata of the included documents are available, making it impossible to reconstruct the GBNC’s
exact composition. Thus, the GBNC is a case where the TNC  G′BNC is available, even though the text
corpus  GBNC is not. Nonetheless, it has attracted widespread attention from researchers.
        </p>
        <p>
          A similar, but older and smaller corpus, is the Google Web 1T 5 corpus [
          <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
          ]. Released in 2006, it
contains 1 trillion tokens, extracted from crawled websites. Lacking a temporal dimension, it is not
strictly a TNC in our sense, but rather a snapshot of a single point in time. Again, the underlying
documents are unavailable.
        </p>
        <p>Most of the other query systems operate on smaller, domain-specific corpora: a corpus of international
European parliament proceedings for the ParlaMint Ngram Viewer; specifically Dutch parliament
proceedings for the PoliticalMashup Ngram Viewer; Norwegian and Icelandic texts for NB-N-Gram and
the Icelandic Gigaword n-Gram Viewer, respectively; and the English Wikipedia for the Ngram Search
Engine.</p>
        <p>
          One special case are Slash/A and the Meta Trend Viewer. While their demos contain predefined
corpora, letters between Elizabeth Barrett and Robert Browning for Slash/A, and Hungarian news
articles for the Meta Trend Viewer, both are explicitly designed for diferent corpora as well. Slash/A
uses the TCF format2 for its corpora, the Meta Trend Viewer stores them in a relational database and
can import a format also used by the Sketch Engine [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Four systems have source code available,
so local instances of these systems can potentially also handle custom corpora (more detail is found
in Section 3.4, under Extensibility). Another special case are NgramQuery and CHQL, for which no
implementations are available at all—in this case, Table 2 lists the corpora the systems were designed
for.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Expressive power of query languages</title>
        <p>The expressive power of a query language defines what kinds of queries and information needs can be
formulated within it. In this section, we investigate the languages of the diferent query systems and
give insights into their expressive powers.</p>
        <p>First, we define a set of information needs we want to evaluate. This set is shown in Table 3. It covers
the types of information needs we defined in Section 2 and suficiently diferentiates the query systems.</p>
        <p>Information need Algebraic expression
keyword search  = ref (  ′)
single wildcard  ≃ ref (  ′), 1 ∗ in  
multiple wildcards  ≃ ref (  ′), multiple ∗ in  
internal wildcard  ≃ ref (  ′), where some   in   contains ∗̂
edit distance  edit ( 1,  2)
frequency threshold  ∑∈   ()&gt; (  ′), for some constant 
most frequent  top(,) (  ′), for some constant 
similarity search  sim(  1,   2)
correlation (  1,   2)
arithmetic ( 1,   1) ∘ ( 2,   2)
synonym grouping   ′(,  ) ⋈ .=. (, Syn), where Syn is a list of synonyms of  .
dictionaries  ∈ (  ′), for some dictionary 
sentiment   ′(,  ) ⋈ .=. (,  ) , where  is a sentiment dictionary
metadata filtering  ()= (  ), for some metadata key  and value 
2https://weblicht.sfs.uni-tuebingen.de/englisch/tutorials/html/index.html</p>
        <p>The results of our review regarding expressive power are summarized in Table 4. First, we want to
highlight common patterns across diferent systems, and then go into further detail on some of the
systems that do not fit those patterns.</p>
        <p>The first group of systems we identified are the GBNV-like systems. This group comprises the GBNV
itself, the Ngram Search Engine, GB-Adv, Slash/A, NB N-Gram and the Icelandic Gigaword n-Gram
Viewer. While there are some diferences within the group, they share common characteristics: Search
and filtering operations are (mostly) limited to n-grams and cannot take frequencies into account. Most
of the systems support simple arithmetic operations to aggregate a small number of frequency series
into a combined one. Their output is usually a visualization of frequencies over time, most often as
a line plot. Within this group, the main diference between the systems is their flexibility regarding
wildcards.</p>
        <p>The second group are the subcorpus-oriented systems. This group includes the PoliticalMashup
Ngramviewer, the Meta Trend Viewer and the ParlaMint Ngram Viewer. Their unifying characteristic
is their focus on subcorpora: They ofer the possibility to construct subcorpora according to documents’
metadata, and operators to visualize and compare frequencies across these subcorpora. Their user
interfaces are similar, ofering exact keyword searches and separate fields for metadata-based restrictions.</p>
        <p>The three remaining systems are NgramQuery, the (an:a)-lyzer and CHQL. The (an:a)-lyzer is a very
specific case. Its focus is the investigation of one very specific linguistic phenomenon, the pronunciation
of the letter h at the beginning of English words. It is not a general-purpose query system, so it stands
out compared to all the others, not even ofering keyword searches on its corpus. NgramQuery focuses
mainly on combining TNCs with WordNet. It ofers a rich selection of operators mapping relations in
WordNet, e.g., synonym and antonym relationships between words. Finally, CHQL is a very flexible
query language and ofers a lot of unique operators. An actual system implementing this language is,
however, not available, only the formal definitions of these operators. A query system based on CHQL
would be vastly more expressive than any of the other systems we investigated.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Practical usability</title>
        <p>We consider four dimensions of practical usability: availability extensibility, scalability, and query
language complexity. As already mentioned in Section 3.2, implementations of NgramQuery and CHQL
are not available, so in the following we assume these systems would behave in the way they are
described in their corresponding articles.
Availability A query system can only be used if some implementation of it is available to researchers.
This usually comes in one of two forms: as a hosted web application or as a repository of source code.3
Both options have their own advantages: A hosted web application requires no user setup and entry
barriers are very low, but it is usually less flexible. A source code repository is more dificult to set
up and requires the user to have suficiently powerful hardware, but can be more easily adapted for
specific use cases. The ideal case is therefore a combination of both: a web demonstrator of the query
system, together with source code allowing for modifications.</p>
        <p>NgramQuery and CHQL are not available in any form. At the time of writing, hosted web applications
are accessible for the GBNV, GB-Adv, NB N-gram, the Icelandic Gigaword n-Gram Viewer and the Meta
Trend Viewer. The web demonstrators of the Ngram Search Engine, PoliticalMashup Ngramviewer,
(an:a)-lyzer and the ParlaMint Ngram Viewer exist or existed in the past, but could not be accessed.
Source code is available for Slash/A, the Icelandic Gigaword n-Gram Viewer, the Meta Trend Viewer
and the ParlaMint Ngram Viewer. Web demonstrators or source code archives, respectively, are linked
in Table 1.</p>
        <p>Extensibility No query system developer is able to anticipate all potential information needs of their
users. Therefore, extensibility of such systems is useful in adapting them for a wider range of research
questions. We consider three kinds of extensibility: (1) the possibility to analyze custom corpora, (2)
extensions of the query language itself, and (3) the export of data to enable processing in diferent tools.</p>
        <p>
          (1) None of the web demonstrators support the upload of custom corpora, so we focus on the four
systems for which source code is available. Slash/A and the Meta Trend Viewer address user-defined
corpora explicitly and the required file format is documented in the respective papers [
          <xref ref-type="bibr" rid="ref25 ref30">25, 30</xref>
          ]. The
Icelandic Gigaword n-Gram Viewer is based on Korp [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and the Stuttgart Corpus WorkBench (CWB)
[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]. It might be possible to adapt the viewer to custom corpora stored within these tools, but this is
not intended and the required efort is unclear. The ParlaMint Ngram Viewer is based on XML files and
ElasticSearch [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]—a potentially flexible technology, but the exact process to use the viewer for custom
corpora is not described.
        </p>
        <p>(2) None of the systems we reviewed ofers any defined mechanisms for query language extensions.
For pure web applications, this makes any extensions impossible, as it would require the execution of
custom code on foreign servers, which is usually not allowed. The systems available as source code
might ofer the possibility to be extended, but not in a documented and intended way, so a deeper
technical understanding of their implementations is required—a significant efort.</p>
        <p>(3) Being able to export query results from a query system to a common format like CSV enables
further analyses which are not possible within the system itself. Of the systems we investigated, 2 ofer
this possibility: NB N-Gram and the Icelandic Gigaword n-Gram Viewer. While not the raw data, at
least plots can be exported from Slash/A, the (an:a)-lyzer and the Meta Trend Viewer. For other systems,
workarounds to extract numerical data sometimes exist,4 but these are not an oficial part of the query
system and potentially outdated.</p>
        <p>Scalability Experimentally evaluating the scalability of diferent query systems would require access
to comparable implementations of the systems in question. Since that is not available, we take a more
theoretical approach towards evaluating scalability, based on the size of the corpora the systems were
designed for. These sizes are shown in Table 5, in terms of number of distinct n-grams or number of
documents, depending on what was reported by the respective authors. We want to highlight two
main points about the table: (a) the corpus sizes difer by several orders of magnitude, so it is to be
expected that not all query systems can eficiently handle the bigger corpora; and (b) there is no clear
trend towards larger corpora over time—recall from Table 1 that the rows are ordered by year of first
publication.
3A third option would be the distribution of compiled binary files and non-disclosure of source code, but none of the systems
we investigated follow that model.
4https://github.com/econpy/google-ngrams, https://github.com/prasa-dd-vp/google_ngram_api
Query language complexity Measuring the complexity of a query language is not trivial. We will
not define a formal model for it, but rather give a plausible intuition regarding the diferent query
systems. By ‘complexity’, we do not mean the computational complexity, but rather complexity in the
process of query formulation.</p>
        <p>On one end of the spectrum are simple keyword queries within a graphical interface: Users only have
to supply an n-gram they are interested in to retrieve a plot of its frequency. Most of the GBNV-like
systems follow this pattern. The syntax of wildcard queries is equally simple within these systems,
replacing words or parts of words with the respective wildcard characters.</p>
        <p>
          Two systems with significantly more complex query language are NgramQuery and CHQL.
NgramQuery contains many operators relating to WordNet and semantic similarity, for example the query
‘~#food#n’ “retrieves the hyponyms of the noun food for all of its senses” [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and ‘food#L#15#0.6’
“retrieves the similar terms of food with cosine value at least 0.6 among the first 15 most similar terms”
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. A summary of all the operators defined in the language is given in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. CHQL, as already
mentioned in Section 3.3, defines a query algebra instead of a query language implementation. A query
language representing that algebra would be of similar complexity to a relational query language like
SQL, so significantly more complex than the other systems.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Summary of review findings</title>
        <p>We have reviewed 12 diferent query systems for TNCs with regards to their expressive power, scalability
and practical usability. We identified two groups of systems, namely the GBNV-like and the
subcorpusoriented systems. We found that most systems support keyword and wildcard queries, with diferent
levels of flexibility. Of the 12 systems we investigated, 5 were accessible as web demonstrators at
the time of writing, 4 more ofered web demonstrators in the past that could not be accessed any
more. Source code is available for 4 of the systems. None of the systems ofer defined mechanisms for
extension, but some let users analyze custom corpora and export data for use in diferent systems. The
size of the underlying corpora varies significantly between the diferent systems, with no clear trend
towards larger corpora over time. Most of the systems ofer simple query languages through graphical
user interfaces, making them attractive to users without a strong technical background.</p>
        <p>When conducting a study on TNCs, the choice of a suitable tool is mainly limited by three factors: (1)
the corpus of interest, as most systems are specifically tailored to one corpus; (2) the complexity of the
research question itself, as not every tool’s query language is expressive enough for any information
need; and (3) the complexity of expressing the specific information needs within a system, including the
willingness and ability of researchers to do so. Furthermore, an implementation of the chosen system
needs to be available.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>We conclude by giving a brief summary of our work, discussing its limitations and give an outlook
towards future research.</p>
      <p>Summary We have defined a data model and query algebra for TNCs. We used this query algebra
to express abstract information needs that can be satisfied using specific query systems. A literature
review of these systems allowed us to categorize them and evaluate their capabilities, emphasizing
diferent aspects: the corpora the systems can analyze, their expressiveness, and their practical usability.</p>
      <p>
        While the GBNV is by far the most popular tool to analyze TNCs, similar tools exist for smaller
corpora. Furthermore, there are query systems which are significantly more flexible in what information
needs they can express. Our review is the first to systematically evaluate these systems and bring them
together in a concise way, summarizing the state of the art for TNC query systems.
Limitations While we are confident to have given a proper overview of the field of TNC query
systems, we are aware of limitations of our approach:
• The literature survey we conducted was not systematic. For reasons we detailed in Section 3.1,
a systematic study was not feasible. Nonetheless, we cannot guarantee that we did not miss
potentially interesting query systems. Our selection of query systems is rooted in an unsystematic,
but thorough literature research. The systems are diverse and implement very diferent query
languages, so we are confident that our review covers the most important classes, even if we may
have missed certain systems.
• We only investigated existing query systems, so some interesting ideas for potential query systems
could not be included. For example, to the best of our knowledge, there has been no investigation
into whether pure SQL can be used to query TNCs eficiently and flexibly. While relational
databases are the backend for several systems [
        <xref ref-type="bibr" rid="ref24 ref30">24, 30</xref>
        ], none of these systems allow free query
formulation. Another example is due to the recent development of Large Language Models
and their impressive performance on query synthesis tasks [
        <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
        ]: No existing query system
leverages these new capabilities, but being able to transfer natural language queries into actual
database queries might lower the user-facing complexity of such tasks considerably.
• We evaluated practical usability in a rather abstract manner instead of conducting user studies.
      </p>
      <p>We consider our criteria to be well-defined and objective enough to still yield valuable insights,
especially because a user study to compare 12 diferent systems would be a very substantial efort.
Future work We plan to make use of our results by implementing a query system based on the
algebra we defined in Section 2. Such a system would combine the strengths of the existing query
systems into one unified interface, and—if implemented eficiently—enable more complex TNC analyses
than previously possible.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the pilot program Core-Informatics of the Helmholtz Association (HGF)
and the Initiative and Networking Fund through Helmholtz AI.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlüter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vetter</surname>
          </string-name>
          ,
          <article-title>An interactive visualization of Google Books Ngrams with R and Shiny: Exploring a(n) historical increase in onset strength in a(n) huge database</article-title>
          ,
          <source>Journal of Data Mining &amp; Digital Humanities</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Breit</surname>
          </string-name>
          ,
          <article-title>The Distribution of English Isograms in Google Ngrams and the British National Corpus</article-title>
          ,
          <source>Opticon1826</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Younes</surname>
          </string-name>
          , U.-D. Reips,
          <article-title>The changing psychology of culture in German-speaking countries: A Google Ngram study</article-title>
          ,
          <source>International Journal of Psychology</source>
          <volume>53</volume>
          (
          <year>2018</year>
          )
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O</given-names>
            <surname>. O'Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <article-title>Culturomics and the history of psychiatry: testing the Google Ngram method</article-title>
          ,
          <source>Irish Journal of Psychological Medicine</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Teepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Glase</surname>
          </string-name>
          , U.-D. Reips,
          <article-title>Increasing digitalization is associated with anxiety and depression: A Google Ngram analysis</article-title>
          ,
          <source>PLOS One 18</source>
          (
          <year>2023</year>
          )
          <article-title>e0284091</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Willkomm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmidt-Petri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schäler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schefczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böhm</surname>
          </string-name>
          ,
          <article-title>A query algebra for temporal text corpora</article-title>
          ,
          <source>in: Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J.-B. Michel</surname>
            ,
            <given-names>Y. K.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Veres</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>G. B.</given-names>
          </string-name>
          <string-name>
            <surname>Team</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Pickett</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hoiberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Clancy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Norvig</surname>
          </string-name>
          , et al.,
          <article-title>Quantitative analysis of culture using millions of digitized books</article-title>
          ,
          <source>Science</source>
          <volume>331</volume>
          (
          <year>2011</year>
          )
          <fpage>176</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Michel</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          <string-name>
            <surname>Lieberman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Orwant</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Brockman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Petrov</surname>
          </string-name>
          ,
          <article-title>Syntactic annotations for the Google Books Ngram Corpus</article-title>
          ,
          <source>in: Proceedings of 50th Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Petrov</surname>
          </string-name>
          ,
          <article-title>Enhanced search with wildcards and morphological inflections in the Google Books Ngram Viewer</article-title>
          ,
          <source>in: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilgarrif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Baisa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bušta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakubíček</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovář</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michelfeit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rychlỳ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suchomel</surname>
          </string-name>
          ,
          <source>The Sketch Engine: ten years on, Lexicography</source>
          <volume>1</volume>
          (
          <year>2014</year>
          )
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Borin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roxendal</surname>
          </string-name>
          ,
          <article-title>Korp-the corpus infrastructure of Språkbanken</article-title>
          , in: LREC, volume
          <year>2012</year>
          ,
          <year>2012</year>
          , pp.
          <fpage>474</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>C. K. Kreutz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolz</surname>
          </string-name>
          , R. Schenkel,
          <article-title>SchenQL: A concept of a domain-specific query language on bibliographic metadata</article-title>
          ,
          <source>in: 21st International Conference on Asia-Pacific Digital Libraries, ICADL</source>
          <year>2019</year>
          ,
          <string-name>
            <given-names>Kuala</given-names>
            <surname>Lumpur</surname>
          </string-name>
          , Malaysia, November 4-
          <issue>7</issue>
          ,
          <year>2019</year>
          , Proceedings 21, Springer,
          <year>2019</year>
          , pp.
          <fpage>239</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C. K. Kreutz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Knack</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Weyers</surname>
          </string-name>
          , R. Schenkel,
          <article-title>SchenQL: in-depth analysis of a query language for bibliographic metadata</article-title>
          ,
          <source>International Journal on Digital Libraries</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>113</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Coole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Mariani,</surname>
          </string-name>
          <article-title>LexiDB: A scalable corpus database management system</article-title>
          ,
          <source>in: 2016 IEEE International Conference on Big Data (Big Data)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>3880</fpage>
          -
          <lpage>3884</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Coole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          , J. Mariani, LexiDB: Patterns &amp;
          <article-title>methods for corpus linguistic database management</article-title>
          ,
          <source>in: Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3128</fpage>
          -
          <lpage>3135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Codd</surname>
          </string-name>
          ,
          <article-title>A relational model of data for large shared data banks</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>13</volume>
          (
          <year>1970</year>
          )
          <fpage>377</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          ,
          <article-title>Principles of database and knowledge-base systems</article-title>
          , Volume I,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aleksandrov</surname>
          </string-name>
          , C. Strapparava,
          <article-title>NgramQuery-Smart Information Extraction from Google N-gram using External Resources</article-title>
          ., in: LREC,
          <year>2012</year>
          , pp.
          <fpage>563</fpage>
          -
          <lpage>568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Wordnet: a lexical database for english</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wordnet:</surname>
          </string-name>
          <article-title>An electronic lexical database</article-title>
          ,
          <source>MIT Press google schola 2</source>
          (
          <year>1998</year>
          )
          <fpage>678</fpage>
          -
          <lpage>686</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Duncan</surname>
          </string-name>
          ,
          <article-title>On the calculation of mutual information</article-title>
          ,
          <source>SIAM Journal on Applied Mathematics</source>
          <volume>19</volume>
          (
          <year>1970</year>
          )
          <fpage>215</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dalwani</surname>
          </string-name>
          ,
          <article-title>Ngram search engine with patterns combining token, POS, chunk and NE information</article-title>
          .,
          <source>in: LREC</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>B. de Goede</surname>
            , J. van Wees,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Marx</surname>
          </string-name>
          , R. Reinanda, PoliticalMashup Ngramviewer:
          <article-title>Tracking who said what and when in parliament</article-title>
          ,
          <source>in: Research and Advanced Technology for Digital Libraries: International Conference on Theory and Practice of Digital Libraries, TPDL</source>
          <year>2013</year>
          , Valletta, Malta,
          <source>September 22-26</source>
          ,
          <year>2013</year>
          . Proceedings 3, Springer,
          <year>2013</year>
          , pp.
          <fpage>446</fpage>
          -
          <lpage>449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <article-title>Making Google Books n-grams useful for a wide range of research on language change</article-title>
          ,
          <source>International Journal of Corpus Linguistics</source>
          <volume>19</volume>
          (
          <year>2014</year>
          )
          <fpage>401</fpage>
          -
          <lpage>416</lpage>
          . doi:
          <volume>10</volume>
          .1075/ijcl.19.3.04dav.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>V.</given-names>
            <surname>Todorova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chinkina</surname>
          </string-name>
          , R. de Haan,
          <article-title>Slash/A n-gram tendency viewer-visual exploration of n-gram frequencies in correspondence corpora</article-title>
          ,
          <source>in: Proc. of the ESSLLI</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>M. B. Birkenes</surname>
            ,
            <given-names>L. G.</given-names>
          </string-name>
          <string-name>
            <surname>Johnsen</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Lindstad</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ostad</surname>
          </string-name>
          ,
          <article-title>From digital library to n-grams: NB N-gram</article-title>
          ,
          <source>in: Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA</source>
          <year>2015</year>
          ),
          <year>2015</year>
          , pp.
          <fpage>293</fpage>
          -
          <lpage>295</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmidt-Petri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schäler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schefczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böhm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Willkomm</surname>
          </string-name>
          ,
          <article-title>The CHQL Query Language for Conceptual History Using Google Books Ngrams</article-title>
          , in: Data for History,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Willkomm</surname>
          </string-name>
          , Querying and Eficiently Searching Large, Temporal Text Corpora,
          <source>Ph.D. thesis, Dissertation</source>
          , Karlsruhe,
          <source>Karlsruher Institut für Technologie (KIT)</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Steingrímsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barkarson</surname>
          </string-name>
          , G. T. Örnólfsson,
          <article-title>Facilitating corpus usage: Making icelandic corpora more accessible for researchers and language users</article-title>
          ,
          <source>in: Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3399</fpage>
          -
          <lpage>3405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>B.</given-names>
            <surname>Indig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sárközi-Lindner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nagy</surname>
          </string-name>
          ,
          <article-title>Use the metadata</article-title>
          , Luke!
          <article-title>-an experimental joint metadata search and n-gram trend viewer for personal web archives</article-title>
          ,
          <source>in: Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31] A. de Jong, T. Kuzman,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larooij</surname>
          </string-name>
          , M. Marx, ParlaMint Ngram Viewer:
          <article-title>Multilingual comparative diachronic search across 26 parliaments</article-title>
          , in
          <source>: Proceedings of the IV Workshop on Creating</source>
          , Analysing, and
          <article-title>Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN)@ LREC-COLING</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>110</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Steingrímsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Helgadóttir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rögnvaldsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barkarson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guðnason</surname>
          </string-name>
          ,
          <article-title>Risamálheild: A very large Icelandic text corpus</article-title>
          ,
          <source>in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Barkarson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steingrímsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hafsteinsdóttir</surname>
          </string-name>
          ,
          <article-title>Evolving large text corpora: Four versions of the Icelandic Gigaword Corpus</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2371</fpage>
          -
          <lpage>2381</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>T.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ogrodniczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Osenova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fišer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wissik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schopper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kirnbauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          , et al.,
          <source>Multilingual comparable corpora of parliamentary debates ParlaMint 3</source>
          .0 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants</surname>
          </string-name>
          ,
          <source>Web 1T 5-Gram Version 1</source>
          , Philadelphia Linguistic Data Consortium (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Franz</surname>
          </string-name>
          ,
          <source>The Google Web 1T 5-Gram corpus version 1</source>
          .1,
          <issue>LDC2006T13</issue>
          17 (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Evert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hardie</surname>
          </string-name>
          ,
          <article-title>Twenty-first century corpus workbench: Updating a query architecture for the new millennium</article-title>
          ,
          <source>in: Proceedings of the Corpus Linguistics</source>
          <year>2011</year>
          conference, Citeseer,
          <year>2011</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gormley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <article-title>Elasticsearch: the definitive guide: a distributed real-time search and analytics engine, ”</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Next-generation database interfaces: A survey of LLM-based text-to-</article-title>
          <string-name>
            <surname>SQL</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:2406.08426</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A survey on employing large language models for text-to-SQL tasks</article-title>
          ,
          <source>arXiv preprint arXiv:2407.15186</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>