1. Introduction

Studying Linked Data Accessibility Healthiness for the Long Tail of the Data Web

Johannes Frey

0 1

Marvin Hofer

Sebastian Hellmann

1 0 KMI Competence Center @ Institute for Applied Informatics, Leipzig University , Germany, https:// kmi-leipzig.de 1 Knowledge Integration and Linked Data Technologies (KILT/AKSW) / DBpedia Association @ Institute for Applied Informatics , Leipzig, Germany , https:// aksw.org/Groups/ KI LT

2023

In this paper, we explore the accessibility healthiness of Linked Data within the context of the Data Web, focusing on the long tail of data sources. Unlike the traditional web, Linked Data lacks a driving infrastructure to enhance accessibility, leading to negative impacts on data consumers, adoption, and the creation of large-scale infrastructures. We investigate challenges posed by issues such as link rot, unparseable content, downtime, and timeouts that hinder efective access to Linked Data. The study involves a novel Linked Data client that logs debugging information, providing insights into the eficiency and efectiveness of accessing Linked Data. The research also includes discussions on the methods and approach taken, IRI identity mismatch handling, crawling results, and Linked Data parsing statistics. Through extensive analysis of HTTP response status codes and accessibility issues, the paper quantifies common problems but also proposes methods for enhancing Linked Data accessibility in order to retrieve consistent sub-graphs from the Data Web.

eol>Linked Data Accessibility Issues Web Crawling Long Tail Data Quality

1. Introduction

Linked Data was proposed as a way of creating a giant global graph (GGG) of interconnected data on the Web. The idea was that by using shared vocabularies and standard protocols, disparate datasets could be connected to form a single, unified resource that could be navigated and explored in a way similar to the WWW with web browsers.

The evolution of the data itself but also the hosting environment and circumstances lead to accessibility issues such as link rot, unparseable content, downtime, or timeouts when trying to access it. In the traditional web, Google’s search engine ofers specific browsing entry points based on detailed information needs and enhances accessibility by caching sites and incentivizing proper syntax and standards (e.g. schema.org). Such a driving infrastructure is missing for the Web of Data. Accessibility issues are negatively afecting consumers and adoption, but also hinder the creation of large-scale infrastructures for a better usability of the Data Web, impacting areas such as data management (e.g. entity indexing, sameAs link clustering) and preservation (archiving, crawling) of Linked Data. While Linked Data crawls - that also explore accessibility aspects - have already been subject to previous work, we exclusively focused on Linked Data according to Tim Berners-Lee’s original Design Issues1, thus excluding SPARQL endpoints and RDF dataset dumps, but allowed anything – in particular embedded JSON-LD – that follows the rules. Particularly, we collected crawling seed IRIs from several million domains to assess the long tail, which deserves better exploration, especially given the context of establishing aforementioned usability infrastructure. Furthermore, we implemented a novel Linked Data client which extensively logs and stores debugging information as necessary first step to study the eficiency and efectiveness of accessing Linked Data.

2. Related Work

Linked Data Crawls have been the subject of several research eforts typically aimed at building large collections of Linked Open Data resources from the Web. In this section, we summarize, to the best of our knowledge, the most notable Linked Data and RDF crawling eforts with public access.

The the most recent iteration of the Billion Triple Challenge - the BTC-2019 dataset [ 1 ], involved crawling of over 2.6 million documents from 394 pay-level domains of the Web. The authors retrieved more than 2.1 billion unique quads and 256 million distinct triples. The crawl has been performed using LD-Spider 1.3 in a breadth-first manner based on 442 URLs from DyLDO.

The Dynamic Linked Data Observatory (DyLDO) project [ 2 ] performs weekly crawls on Linked Open Data since 2012. Based on a fixed seed list (containing 95,737 URIs from 652 domains), it dereferences RDF data in a first round. Subsequently, all discovered IRIs are used to perform another crawl, persisting the retrieved RDF data, HTTP headers, and redirects. The crawler applies breadth-first search and performs 2-5 more rounds, whereas each round, dereferences all unseen IRIs of the next hop based on a frontier list from the previous round. DyLDO was developed to assess the temporal stability of Linked Data resources; the availability and functioning of the Linked Data mechanisms as well as data evolution for particular RDF resources can be analyzed over time. Although the authors used techniques aiming at covering a wide cross-section of domains in the initial seed, the design is focused on completing the crawling given limited time and resource constraints and the seed list is capturing an over 12 years old state of the LOD cloud.

LOD Laundromat [ 3 ] is a tool that crawls and cleans RDF dumps. The seed was created using a combination of manual and automated methods. The authors added dump URLs using the CKAN API, from e.g. Datahub, but also added several datasets where they knew the location of the dumps. The dump files are retrieved with a custom and fault-tolerant crawler/parser, and VoID triples in the dumps are used to (recursively) discover new datasets and their dump ifle locations. Moreover, users can submit URLs locating to either an RDF dump or a VoID description of a dataset. However, as of August 2023, the service URL http://lodlaundromat.org did not host any data or project related information anymore and the GitHub page states that it is closed for maintenance since July 2021. Fortunately, a subset of the data (650K RDF documents summing up to 524 GB of compressed HDT [ 4 ] data and over 3.3 billion triples) is available in LOD-a-lot [ 5 ], which has been used as a basis for this work.

Web Data Commons [ 6 ] is a large-scale project that aims to extract structured data from the Common Crawl. Since 2012, the project released several structured data dumps based on semantic annotations in the crawled HTML files, including Microdata, Microformat, RDFa, as well as Schema.org from embedded JSON-LD [ 7 ]. Out of 1.5 billion URLs with semantic markup (of over 3 billion URLs from the October 2022 crawl) that were hosted on 14 million domains, over 19 billion typed entities and 86 billion triples were extracted2.

DBpedia Archivo [ 8 ] is an augmented ontology archive that automatically crawls, discovers, versions, and archives ontologies. In order to discover OWL and SKOS ontologies, it performs follow-your-nose Linked Data on (transitive) dependencies/imports in ontologies from previous iterations of Archivo crawls, but also employs vocabulary usage reports in VoID files, ontology repositories, and user inclusion requests. As such it tries to crawl the Web of Ontologies, a subset of the LOD cloud, and a study in 2022 [ 9 ] has shown that it improves the accessibility of the terminological context (property & class IRIs) for 80% of the triples in LOD-a-lot respectively 45% the used terms.

To the best of our knowledge, none of the eforts specifically focus on and evaluate the long tail of Linked Data.

3. Methods and Approach 3.1. Linked Data Access

Linked Data access follows the principles of web architecture3, a multitude of standards, protocols, and rules, including the use of URIs (Uniform Resource Identifiers) and IRIs (Internationalized Resource Identifier) to identify resources, and data models, like RDF (Resource Description Framework), to represent data.

As the main actors of the Linked Data web architecture, the implementation of clients and servers plays an important role in enabling the publication, discovery, and consumption of Linked Data. A Linked Data server provides access to resources through HTTP(S) IRIs. Clients, on the other hand, are responsible for consuming and processing Linked Data from servers and thus retrieve a local sub-graph of the globally accessible Linked Data graph. The Linked Data consumption process involves the following major phases between client and server. 1. IRI dereferencing: The client sends a GET request to the server identified by the HTTP(S)

IRI and follows redirects. 2. Representation selection: The server responds with a representation of the resource in a particular format (such as plain RDF serialization formats, RDFa, or JSON-LD). 3. Representation parsing: The client parses the representation (the payload contained in the HTTP body) to extract the RDF or other structured data from it.

2http://webdatacommons.org/structureddata/2022-12/stats/stats.html 3https://www.w3.org/standards/webarch/

4. Follow-your-nose: The client might dereference any IRI it finds in the structured data to retrieve additional resources.

3.2. Access Mechanisms and Pitfalls

Several mechanisms allow for flexibility in requesting and serving Linked Data and therefore lead to an increased variety and complexity for both server and clients. Furthermore, implementations and setups may not adhere fully to expectations and specifications, which can result in accessibility failures when trying to fetch data from the GGG. A third area, that is highly relevant for access, is retrieval performance and throughput, i.e. how much data can be requested in what time.

• Redirects and Links. Servers can use redirects (HTTP code 3xx) as well as link rel="alternate"...\ in the HTTP header and the HTML <meta> tag to point to the data document. The correct server configuration is a common pitfall; long(er) redirect chains decrease the performance of servers and clients; loops prevent accessibility. • Serialization Variety. There is no single mandatory format specified in Linked Data, rather a multitude of RDF serialization formats (e.g. Turtle, N-Triples, RDF/XML) and HTML embedded formats such as RDFa or JSON-LD exist that the server could use in response; clients should support them to be able to retrieve all parts of the GGG. • Content Negotiation. As not all servers/clients can deal with all formats, clients may send a prioritized list of formats in the Accept header. The server should select a supported format in favor of the client‘s request. However there are no guarantees on how the server selects the response format, thus requiring the client to be flexible and employing try-and-error heuristics; especially manual proxy/rewrite rule configurations on the server side are a common source of errors. • Parsing. The retrieved serialization can contain erroneous elements, syntactical errors, or a deviant serialization format that was incorrectly reported, requiring fault-tolerant parsing methods (e.g. skipping erroneous parts). • Access Limits & Performance. Servers can be at capacity resulting in timeout errors or they can apply rate limits indicated with HTTP 429 error, clients need to obey the robots.txt and the Retry-after Headers (which might not be correctly computed by the server). Overall, low speed of responding to requests (due to server capacity or configured limits) can be a large bottleneck, when accessing the GGG. • IRI normalization URIs/IRIs can be represented and encoded in diferent forms, but merging interlinked subgraphs of the GGG fetched from several servers requires string equality. IRI normalization can be necessary to make use of the links (see Section 3.3). To create a practical Linked Data client4, we used a bottom-up approach and made changes to the configuration and code based on what we learned from Section 4 & 5.

3.3. Handling IRI identity mismatch

Many RDF tools such as graph databases, RDF libraries, and reasoners often require exact matches for IRIs in order to correctly identify them as the same, whereas Linked Data does not as it has mechanisms like redirects (e.g. for HTTPS upgrades).

While some of the IRI representation variety can be normalized by syntactical transformation on the IRI string which is tackled in RFC standards, others need to be further canonicalized by dereferencing and matching the IRI in the delivered RDF response.

IRI Syntax Normalization is a set of rules that transform an IRI into a normal form that allows equality checks. The RFC5 includes case normalization, character normalization, percent-encoding normalization, path segment normalization, and scheme-based normalization. However, we argue that this is not suficient for Linked Data in practice.

Canonicalization & Consistently Dereferenceable IRIs (CDIRI). An identity mismatch that needs additional canonicalization occurs when the normalized IRI of the IRI that was supposed to be dereferenced does not occur or is not described as an element of the dereferenced Linked Data document for . Subsequently, we define an IRI as the consistently dereferenceable (CDIRI) for if its normal form matches the normalized IRI of the resource (most commonly used in subject position) of the data or in simple words - What you request is what you get. In our implementation, we treat the CDIRI as the canonical IRI for the referenced resource or its parts if several CDIRIs are present, which is the case for fragment (#) IRIs. Determining the CDIRIs and using them as a replacement for all IRIs in third-party Linked Data, which provide owl:sameAs or other relationships to for a resource and thus make the resulting merged local graphs for and connected (e.g. such that a SPARQL pattern like owl:sameAs . ?p ?o2 would succeed). In the following snippet, we show 2 examples of RDF resources, listing the actual Entity IDs in the RDF and used IRIs in third-party Linked Data (CDIRIs underlined): 1. http://dbpedia.org/resource/Björk , https://dbpedia.org/resource/Björk and https://dbpedia.org/page/Björk (html view) redirect,resolve, or link to RDF using http://dbpedia.org/resource/Björk as entity ID

http://d-nb.info/gnd/1140180746 and https://d-nb.info/gnd/1140180746 nowadays redirect/resolve to RDF using https://d-nb.info/gnd/1140180746 as entity ID

Note, that the DNB identifiers were switched to HTTPS several years ago. However, datasets linking to the legacy DNB HTTP identifiers are still widespread. W.r.t. DBpedia, sometimes the HTML page IRI is confused with the entity ID. Working with CDIRIs thus supports cleaning up the connection of sub-graphs provided by diferent Linked Data providers and increases accessibility.

We developed the Pinguin6 canonicalization algorithm for Linked Data clients. It collects all intermediate IRIs, normalizes them and creates a surjective mapping of IRIs to the CDIRI as well as validates the CDIRI by resolving it again.

5Normalization of URIs in RFC3986 Section 6, and for IRIs in RFC3987 Section 5 6named after the random resonance of Ping URI to the German word Pinguin 4. Evaluation 4.1. IRI Seed Selection

We picked two set of IRIs that cover a huge spectrum of domains to study the long tail.

Source 1 (LAL) from LOD-a-lot: For the first source of IRI seeds for crawling, we used LOD-a-lot [ 5 ], a compact archive of RDF dumps. It contains more than 28.36 billion triples with 3.21 billion distinct subjects. Following a previous study [ 9 ], the data still contains noisy and erroneous data (e.g. IRIs containing prefixes although supposed to be absolute, usage of subject IRIs in property position etc.). As we are interested in IRIs that enable us to retrieve RDF via Linked Data, we filtered LOD-a-lot for all rdf:type statements (3,321,354,308 triples) and then collected all distinct IRIs (2,911,686,622) that occur in the subject position of those. This also removed all invalid and not absolute IRIs as confirmed via the Java 11 URL checker.

Source 2 (DEL) from DBpedia: As second source, we used the DBpedia external links 2022-09 (DEL) dataset7, which contains all hyperlinks to external websites from the Wikitext of articles of 137 Wikipedia versions. The links point to over 33.5 million distinct IRIs (2 invalid IRIs were removed).

Since almost 3 billion URLs would pose a significant challenge (in terms of time, trafic, and storage requirements) for a crawling experiment, we decided to study the distribution of URLs per domain (FQDN - fully qualified domain name), in order to understand whether we can shrink it without limiting the number of domains. The rational behind this is, that our focus is on studying the accessibility of Linked Data for the domains, instead of a full analysis of the (payload) data of the domain.

As can be seen in Figure 1 (left), the distribution of the URL counts per domain for the top-100k domains (the 100,000 domains having the most URLs) follows a power law distribution. This type of statistical distribution is characterized by the pattern that a small number of items occur frequently (called the "head"), while a large number of items occur rarely (denoted as the "long tail"). In our case, the items compare to the domains that occur in the host part in the seed URLs.

The curve of LAL is steeper compared to DEL in log-log-scale, implying that the URL counts decrease more rapidly for LAL as the rank increases, indicating a greater inequality in the

7https://databus.dbpedia.org/dbpedia/generic/external-links/2022.09.01

distribution. Moreover, LAL top-k domains have more URLs per domain than DEL until rank 2,802. In Figure 1 (right) is shown, how much portion of the overall amount of URLs of the datasets are contained in the accumulative counts of the top-k ranks. While for LAL the top-15 domains accumulate over 90 % of all LAL URLs, for DEL they only contain approximately 21 %. In other words, the head of LAL is much shorter with higher IRI counts compared to DEL. In order to reduce the number of URLs, we decided to sample URLs of the head. We decided to limit every domain to 2000 URIs. Subsequently, we used Waterman’s Algorithm R for random sampling to pick 2000 URLs for each of the top-k domains that contain more than 2000 URLs, which splits the domains into a subset of domains that is sampled 2k+ and a subset that is not 2k- (see Table 1). For LAL, this resulted in sampling URLs for the top-1492 domains that cover 99.63 % of the URIs and reducing its number of 2k+ URLs from 2.901 billion to 2.948 million, whereas for DEL the top-1152 domains covering 45.99 % were reduced from 15.422 million to 2.304 million. As a result, 13.699 (LAL) respectively 20.412 million URLs (DEL) were selected for the seeds. The resulting seeds are quite complementary since only 46,810 (0.14 %) of the URLs and 115,313 of the domains (2.08 %) overlap between the two sources.

4.2. Crawling Results

From the samples described in Section 4.1, we removed the IRI fragment identifiers and deduplicated the resulting IRIs. We then used our custom crawler implementation that can deal with scheduling millions of domains for parallel crawling.

Crawling Parameters: Several parameters can customize the crawling process. We describe the most important parameters of our setup below and refer to our paper repository8 for transparency and reproducibility. A single request’s timeout is set to 10 seconds. The default request delay is 100ms. The crawler obeys Retry-After headers up to a delay of 10s. The system follows a maximum of 10 redirects. IRIs of one domain are requested sequentially, and the total timeout for one domain is max time=2*iriCountForDomain+13 seconds otherwise a DomainTooSlow exception will occur. The crawler is configured in this way, because we consider this an appropriate setting performance-wise for an efective GGG, given that network and server performance increased significantly since the inception of Linked Data in 2006. After encountering more than 50 exceptions of either Java IOException or request timeout exceptions (signaled by Java ConnectTimeoutException, HttpConnectTimeoutException or HttpTimeoutException) in a sequence for one domain, the crawling of that domain is stopped. Subsequently, all remaining IRIs of that domain are discarded and marked with the TooManyFailuresInARowException. Moreover, we only store resource payloads that are smaller than 10 megabytes. Each requests contains the following Accept request header (adapted from Apache Jena, but giving priority to line-based and triple formats for fault-tolerant parsing (pruning invalid lines). 1 application/n-triples;q=1, text/turtle;q=0.9, application/n-quads;q=0.9, 2 application/ld+json;q=0.8, application/rdf+xml;q=0.7, application/trig;q=0.7, 3 */*;q=0.5 # covers text/html with embedded JSON-LD

Table 2 displays the analyzed log data of four seed partitions. The table is split into two parts. The first part shows the distribution of HTTP response status codes, while the second part displays the accessibility issues that prevented the completion of the HTTP request.

In terms of HTTP status codes, it is worth noting that status code 200 typically has an average recall rate of only 16%, with the highest value being 42.7% for DEL2k+ and the lowest for LAL2kat a mere 8.4%. The second most frequently returned status code is 404 Not Found, with an overall percentage of 6.3%. It’s also important to mention that 2.4% or 809 thousand IRIs end up causing too many redirects and therefore remain unresolved with a 3xx code. When it comes to overall accessibility, DEL was slightly more accessible than LAL, with 83.1% for the 2k+ seed and 28.6% for the 2k- seed, as opposed to LAL’s 58.7% for the 2k+ and 21.2% for the 2k-.

Out of the 33.1 million IRIs that were requested, 67.7% failed due to accessibility issues. The crawler encountered two types of issues preventing the completion of the HTTP request: those related to the remote server and those related to our crawling requirements. The former includes DNS record inaccessibility and resource reachability, while the latter is caused by exceeding manually set thresholds such as payload size (MaxResourceSize). The table displays that over 95% of the total exceptions come from the top four accessibility issues. The most common issue is the UnreliableDomain exception (ranging from 20 to 54%), which can be triggered by DomainTooSlow and TooManyFailuresInARow exceptions. The second most common exception is the ConnectException (31.5%) which indicates the unreachability of a host due to DNS retrieval or a missing web server under the resolved IP address and port. The IOException (14.6%) indicates that the remote server unexpectedly closed an HTTP request, e.g., the response stream. Further, it was found that 12.9% of the issues were related to RequestTimeouts that lasted for more than 10 seconds. Two exceptions related to IRI resolution are InvalidRedirectIRI and PrivateIPSkipped. InvalidRedirectIRI denotes that the initial well-formed IRI redirects to an invalid IRI, while PrivateIPSkipped indicates skipped requests due to the IRI resolving to a private IP range. Other JAVA Errors, which account for 1.4%, are primarily caused by issues with the JAVA HTTP library, such as HTTP message syntax or SSL problems.

Table 3 analyzes the body of successfully fetched resources (status code 200). 15k responses with code 200, but an empty body, could not be analyzed. 26% of LAL2k+ and 42% of DEL2k+ IRIs had payload in the body. A significantly lower portion but with a similar ratio between LAL and DEL was measured for the tails with 8.4 respectively 16.2%. In a subsequent step, we used our failure-tolerant parser to extract Linked Data from the payload. It was configured to parse all plain RDF formats (including plain JSON-LD, but excluding RDFa) and JSON-LD embedded in HTML. For LAL2k+ and LAL2k- 37 vs. 44% of the content could be parsed making up a only a total of around 10 vs. 4% of the IRIs leading to Linked Data. Since DEL is likely to contain more resources that are not described using Linked Data, it is not surprising that these numbers are also low with 8 vs. 3.7%. As final measure of the accessibility of a Linked Data IRI is row 3. The existence of the CDIRI in the parsed document is necessary to retrieve information about it and further navigate in the GGG (exploiting incoming and outgoing links). For LAL this is the case for 8 vs. 1.1 % of the IRIs in contrast to 5.7 vs. 3.2%.

5. Conclusion

To shed light on Linked Data accessibility healthiness of the long tail, we sent requests to IRIs from 5,661,415 distinct domains. We compared accessibility statistics for IRIs from head and tail of LOD-a-lot (containing IRIs from dumped RDF resources) to Wikipedia external links (containing manually curated IRIs, mostly intended for browsers, not necessarily with RDF data) extracted with DBpedia. We discovered, that the head and tail of these Wikipedia links also were significantly impacted by link rot or very slow speeds, with only 43% and 16% respectively returning HTTP 200 status codes. For LAL, we actually expected a much lower and divergent number (26% and 8%) since on the one hand it contains a portion of dumped resources that never were accessible via Linked Data and on the other hand it was published before 2017 several years earlier than Wikipedia - (based on the assumption that age increases link rot).

Addressing the identified accessibility challenges, an infrastructure for improving resource availability and a refined Linked Data consumption strategies seem potential steps toward fostering a more usable and accessible Linked Data web.

Acknowledgements: This work was partially supported by grants from the German Federal Ministry for Economic Afairs and Climate Action (BMWK) to the projects KISS (01MK22001A) and OpenFlaaS (100594042), by the European Research Council for the project ScienceGRAPH (819536) as well as by the Federal Ministry of Education and Research of Germany and by the Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus in the program Center of Excellence for AI-research "Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig", project identification number: ScaDS.AI

[1]

Herrera ,

Hogan , T. Käfer, BTC-2019: the 2019 billion triple challenge dataset , in: ISWC 2019 , volume 11779 of LNCS , Springer, 2019 , pp. 163 - 180 . doi: 10 .1007/978-3- 030 -307 96 - 7 _ 11 .

[2]

Käfer ,

Abdelrahman ,

Umbrich , P. O'Byrne , A. Hogan , Observing linked data dynamics, in: The Semantic Web: Semantics and Big Data , ESWC 2013 , volume 7882 of LNCS , Springer, 2013 , pp. 213 - 227 . doi: 10 .1007/978-3- 642 -38288-8_ 15 .

[3]

Beek , et al., Lod laundromat: A uniform way of publishing other people's dirty data , in: ISWC , Springer, 2014 , pp. 213 - 228 . doi: 10 .1007/978-3- 319 -11964-9_ 14 .

[4]

J. D.

Fernández ,

M. A.

Martínez-Prieto ,

Gutiérrez ,

Polleres ,

Arias , Binary rdf representation for publication and exchange (hdt) , in: Journal of Web Semantics , volume 19 , Elsevier , 2013 , pp. 22 - 41 . doi: 10 .1016/j.websem. 2013 . 01 .002.

[5]

Beek ,

J. D.

Fernández ,

Verborgh , Lod-a-lot: A single-file enabler for data science , in: SEMANTICS 2017 , ACM, 2017 , pp. 181 - 184 . doi: 10 .1145/3132218.3132241.

[6]

Meusel ,

Petrovski ,

Bizer , The webdatacommons microdata, rdfa and microformat dataset series , in: ISWC 2014 , volume 8796 of LNCS , Springer, 2014 , pp. 277 - 292 . doi:10.1 007/978-3- 319 -11964-9_ 18 .

[7]

Brinkmann ,

Primpeli ,

Bizer , The web data commons schema.org data set series , in: WWW 2023 Companion, ACM, 2023 , pp. 136 - 139 . doi: 10 .1145/3543873.3587331.

[8]

Frey ,

Streitmatter ,

Götz ,

Hellmann , N. Arndt, DBpedia archivo: A web-scale interface for ontology archiving under consumer-oriented aspects , in: Semantic Systems , volume 12378 of LNCS , Springer, 2020 , pp. 19 - 35 . doi: 10 .1007/978-3- 030 -59833- 4 _ 2 .

[9]

Frey ,

Streitmatter ,

Arndt ,

Hellmann , Reproducibility crisis in the LOD cloud? studying the impact of ontology accessibility and archiving as a counter measure , in: ISWC 2022 , volume 13489 of LNCS , Springer, 2022 , pp. 91 - 107 . doi: 10 .1007/978-3- 031 -194 33 - 7 _ 6 .