<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>End-to-End Assessment of Product Review Helpfulness Using Subjective and Objective Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuta Nakajima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Ptaszynski</string-name>
          <email>michal@mail.kitami-it.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fumito Masui</string-name>
          <email>f-masui@mail.kitami-it.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kitami Institute of Technology</institution>
          ,
          <addr-line>165 koencho, kitami, Hokkaido, Japan 090-8507</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>82</fpage>
      <lpage>93</lpage>
      <abstract>
        <p>With the spread of the internet, the volume of reviews on services, shopping, and word-of-mouth websites has grown annually, becoming a significant source of information for decision-making. However, not all reviews are equally useful to potential buyers. A certain number of reviews are low-value or spam. Therefore, presenting only highly useful reviews to users would support more efective and eficient decision-making. In this study, we address this issue by analyzing the features of useful reviews and proposing a method to support user decisionmaking. First, after surveying existing research on review helpfulness classification, we propose a scoring method that focuses on the amount of information in a document (text volume) as a key feature of useful reviews. We also examined the concept of helpfulness of reviews in detail, and created a dataset containing reviews with annotations of subjectively perceived helpfulness, and objective features related to it. In a binary classification experiment using Transformers-based models to categorize review helpfulness, we obtained strong results, with an F1 score exceeding 80%. Furthermore, to show users which parts of a review are useful, we built a multi-label classification model to automatically extract the features of helpfulness. This model demonstrated its ability to efectively capture the characteristics of useful reviews, achieving an F1 score of over 80% for four of the core helpfulness-related features defined in this research.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Online Product Reviews</kwd>
        <kwd>Review helpfulness Prediction</kwd>
        <kwd>Multi-label Classification</kwd>
        <kwd>Transformers</kwd>
        <kwd>Lexical Density</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, with the spread of the internet, user reviews are posted across all types of media, such as
online shopping and accommodation booking sites, broadly influencing the sharing of word-of-mouth
information and user decision-making. However, users must browse through a massive number of
reviews for products they are considering, which requires a great deal of their time and efort. When
the number of reviews is too large to read, users often proceed with their purchase consideration
without reading most of the reviews. However, unread reviews may contain valuable information,
while those read first frequently contain redundant or low-quality information that is not helpful for
decision-making. Moreover, there is an increasing number of spam reviews (reviews unrelated to the
product, reviews that serve as advertisements for other products, etc.) and fake reviews (reviews written
by individuals who have not purchased or used the product, either of their own volition or by request,
to excessively praise or criticize the product). Therefore, it is important for users considering a product
purchase to read only high-quality, useful reviews, creating a need for the development of methods to
determine review helpfulness.</p>
      <p>In this study, we investigate the characteristics of highly useful reviews and establish a definition
to determine not only whether a review is useful, but also which aspects of it are considered useful.
Furthermore, similar to previous research, we aim to automatically select highly useful reviews from
a large number, thereby providing users with a metric for decision-making during purchases and
supporting them without requiring them to read all the reviews.</p>
      <p>The remainder of this paper is structured as follows. We begin by reviewing related work in Section
2. Section 3 details our complete methodology, including our framework for defining helpfulness, the
novel ‘lds‘ score for data sampling, and the annotation process used to create our dataset. In Section 4,
we present our main experimental results, covering both the binary helpfulness classification and our
primary multi-label feature classification task. Following this, we discuss the broader implications of
our findings in Section 5. Finally, we summarize our contributions and outline directions for future
research in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Research on review helpfulness prediction has a long history, beginning with models that used
engineered features from review content and metadata, to a more recent studies which shifted towards deep
learning models that learn features directly from text.</p>
      <p>
        Kim et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] pioneered this area by using a Support Vector Regression (SVR) model with features
like reviewer history, review length, and unigrams. Our work difers by employing a multi-label
classification framework for explainability rather than regression, and by using end-to-end Transformer
models to learn semantic representations, which reduces the need for manual feature engineering.
      </p>
      <p>
        Mudambi and Schuf [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] confirmed that review length together with explicitly expressed rating are
strong predictors of helpfulness, particularly when considering the product type (search vs. experience).
Building on their findings, our research focuses more on the semantic content of the review. We also
introduce the ‘lds‘ score, a metric that considers lexical diversity while penalizing short texts, as a more
sophisticated measure of informativeness than length alone.
      </p>
      <p>
        Zhang and Tran [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed an eficient linear model using review length to address the "cold-start"
problem for new reviews. We advance this by using non-linear Transformer architectures that better
interpret complex semantics. Furthermore, our multi-label framework provides an explainable output
identifying why a review is useful, being an improvement over a single, uninterpretable ranking score.
      </p>
      <p>
        Pan and Zhang [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] demonstrated the importance of reviewer metadata, such as post history, in
predicting helpfulness. In contrast, our work deliberately focuses only on the review’s content. This
design choice makes our model more universally applicable, especially on platforms where reviewer
metadata is unavailable, and relies on deep learning to extract all necessary signals from the text itself.
      </p>
      <p>
        Sasaki et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] established eight criteria for helpfulness in Japanese reviews and applied an SVM
to morphological features. Our work adopts their definitional approach but validates it on a larger,
crowd-sourced dataset and uses more advanced deep learning models for classification.
      </p>
      <p>
        The shift to deep learning also in this area of research is represented by Qu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], who used a
Convolutional Neural Network (CNN) to learn features from word embeddings. Our study represents
the next methodological step by employing Transformer architectures. The self-attention mechanism
in Transformers is better suited to modeling the long-range contextual dependencies within a review
compared to the local feature detection of CNNs.
      </p>
      <p>
        Zhang and Lin [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] addressed multilingual helpfulness prediction for English and German reviews
using a combination of language-dependent and independent features. While their work focused on
multilingual breadth, our research on the other hand focuses on monolingual depth. We propose a
detailed set of six helpfulness criteria for Japanese and an explainable multi-label detection model.
      </p>
      <p>
        Sun et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] studied the concept of "informativeness," highlighting its complexity and dependence
on product type. Our work improves on their approach by proposing the ‘lds‘ score as a concrete metric
for informativeness and uses it as a strategic tool to sample high-quality data for further annotation.
      </p>
      <p>
        Saumya et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] also applied a CNN to predict a continuous helpfulness score. Our work advances
this by using a more powerful Transformer architecture and, more importantly, by formulating the
problem as an explainable multi-label classification task rather than a regression task that produces a
single, unexplainable score.
      </p>
      <p>
        Malik [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] found that review-specific content features were the strongest predictors of helpfulness,
more so than reviewer or product-type features. Our work validates this finding by focusing exclusively
on content. We demonstrate that state-of-the-art language models can extract a rich set of predictive
signals from the review text alone, making our system independent of external metadata.
      </p>
      <p>
        Soda et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] aimed to show users not just whether a customer review is useful or not, but in what
way. They proposed seven perspectives for evaluating Japanese reviews’ helpfulness and automated
three of them. Our multi-label classification approach shares this goal of explainability and demonstrates
high performance across four distinct criteria.
      </p>
      <p>
        Finally, recent work by Mayda and Uğurlu [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] demonstrated the efectiveness of Transformers
for Turkish reviews. Our study confirms their findings on Japanese data while contributing a novel
multi-label framework for explainability, which is often missing in purely performance-focused studies.
Kim et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
Mudambi &amp; Schuf [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
Zhang &amp; Tran [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
Pan &amp; Zhang [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
Sasaki et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
Qu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
Zhang &amp; Lin [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
Sun et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
Saumya et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
Malik [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
Soda et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
Mayda &amp; Uğurlu [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
Our Study
      </p>
      <p>Year</p>
      <sec id="sec-2-1">
        <title>2.1. Summary and Contributions</title>
        <p>Prior research on review helpfulness evolved from traditional machine learning with engineered features
to end-to-end deep learning models, typically for regression or binary classification tasks. Our work
builds on this foundation by using Transformers not just for prediction, but to create an explainable
system. We move beyond a single predictive score to identify the specific, human-understandable
characteristics that make a review valuable.</p>
        <p>The main contributions of this research are:
1. A novel information score (‘lds‘) based on lexical density and text length, designed for eficiently
sampling information-rich reviews from large datasets.
2. A refined set of six objective criteria of review helpfulness for Japanese.
3. An explainable multi-label framework based on those six criteria that identifies why a review
is useful.
4. A new, publicly available dataset of over a thousand fully annotated Japanese reviews annotated
with both binary usefulness and multi-label feature labels for explainable review analysis.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We designed our approach to create an explainable, data-driven system for assessing review helpfulness.
This involved three main stages: first, establishing a clear and comprehensive framework for what
constitutes a useful review; second, developing a novel sampling method to eficiently curate
highquality data; and third, creating a high quality manually annotated dataset to train and evaluate our
models. This section details each of these stages.</p>
      <sec id="sec-3-1">
        <title>3.1. A Framework for Review Helpfulness</title>
        <p>
          Building on prior research into the characteristics of helpful reviews [
          <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
          ], we define a useful review
as one that satisfies a set of specific, identifiable criteria. These criteria were developed by synthesizing
the findings of previous studies. Based on a preliminary analysis into review usefulness [
          <xref ref-type="bibr" rid="ref13">13, 14</xref>
          ], we
established that the following six conditions characteristic for a review to be considered helpful. However,
instead of assuming that all of the condition must be met, we analyzed which exact combinations of the
following features make a review feel helpful. See Section 3.3.1 for details of this analysis.
        </p>
        <p>Specifically, for a review to be considered helpful it must contain the following features in various
combinations.</p>
        <p>A1 A basis is provided for the evaluative expression. The review explains why a product, or its
aspect, was rated positively or negatively, going beyond simple evaluative statements.
A2 There are multiple mentions of the review target. The review remains focused on the
product itself, and does not digress about other products.</p>
        <p>A3 The star rating and the evaluation in the review body are consistent. The sentiment of the
text aligns with the given star rating, ensuring the review is coherent.</p>
        <p>A4 The main part of the review has a suficient amount of information. The review ofers
enough detail for a reader to make an informed decision.</p>
        <p>A5 The review title contains the polarity and its target. The title functions as an efective
summary of the review’s core message.</p>
        <p>A6 It is possible to infer whether the reviewer actually used the product. The text contains
descriptions of first-hand experience with the product.</p>
        <p>
          Egawa’s work on text structure suggests that the title often serves as a reliable summary of a review’s
main point [15]. We hypothesize that a well-written, and informative title is a strong indicator of
a well-written, informative review body. On the other hand, we deliberately excluded criteria such
as "comparison with other products" and "readability", which were used in some prior works [
          <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
          ].
This decision was based on findings that "comparison" sentences are often too rare to be a reliable
feature, and "readability" is too subjective and lacks a concrete, consistently applicable definition in the
existing literature. Instead, to also cover the lexical richness of the review, we proposed and ’lds’ score
for pre-filtering of low-quality reviews, as described in the following sections, which we used in the
creation of the dataset used in this research.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Curation and LDS Sampling</title>
        <p>To create a high-quality dataset for annotation without manually reading through millions of reviews,
we first developed a method to sample information-rich content from large, unlabeled corpora. This
was necessary because a purely random sample would likely be dominated by short, uninformative,
and ultimately not useful reviews. For this we used data from the Amazon Review Dataset1 and the
Rakuten Ichiba Review Dataset [16].</p>
        <p>We based our sampling method on a novel information score that adapts the concept of lexical density
with additional penalization for too short texts.</p>
        <p>Lexical density [17, 18, 19] measures the vocabulary richness of a text and is defined as the ratio of
unique words () to the total word count (), as shown in Equation 1. A higher score indicates a
more diverse vocabulary.</p>
        <p>ld =  (1)

However, lexical density alone is biased towards shorter texts, which naturally have fewer repeated
words. To counteract this, we introduce a penalty for short sentences. We first normalize the word
count () using a logarithmic scale to reduce the impact of extreme outliers (Equation 2).
(2)
1http://web.archive.org/web/20201127140619/https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_
multilingual_JP_v1_00.tsv.gz, accessed on 2025-10-17.</p>
        <p>= log2()
We then create a short sentence coeficient ( ) by scaling this value by the maximum log-transformed
word count in the corpus (Equation 3). This coeficient approaches 1 for the longest reviews and is
smaller for shorter ones.</p>
        <p>Finally, we define our proposed review information score (‘lds‘) by multiplying the lexical density by
this short sentence coeficient (Equation 4). This score balances vocabulary richness with review length,
favoring texts that are both lexically diverse and suficiently long.</p>
        <p>ssc =
lds =  × 
(3)
(4)
Next, we applied the newly defined information score to create an initial dataset of review candidates.
We used data from the one month of Rakuten Dataset (January 2019), from which we randomly extracted
10,000 review samples with 2 or more thumbs-up, indicating that two or more users found these reviews
helpful, and another 10,000 samples with 0 thumbs-up (no users fund the review helpful), and used
this data for validation to the proposed information score and extraction of training data for machine
learning experiments.</p>
        <p>The reason for setting the threshold to "2 or more thumbs-up" was that there was an overwhelming
number of reviews with "1 or more thumbs-up" (38,734), which would account for most of the reviews
with thumbs-up (54,861), making the random extraction biased. Reviews with "2 or more thumbs-up"
accounted for 16,127 reviews with thumbs-up counts from 2 to 159, which assured a balanced and varied
source for extraction.</p>
        <p>Finally, each review sample from the initial dataset was assigned the ’lds’ information score. For our
main experiment, we applied this score to the verification dataset and selected the top 1,000 unique
reviews for manual annotation. This practical application of an advanced informativeness metric saves
significant time and resources in the annotation process by pre-filtering for content that is more likely
to be useful.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Annotation Process and Dataset Creation</title>
        <p>Using the ‘lds‘ score, we selected the top 1,000 reviews from our verification dataset for manual
annotation. The goal was to create a gold-standard dataset for training and evaluating automatic
classification models for subjective helpfulness and its representative objective categories.</p>
        <p>The annotation was conducted by 20 annotators (3 males, 17 females, all in their 20s to 40s) recruited
through the CrowdWorks crowdsourcing platform2. Each annotator was provided with a detailed set of
guidelines, instructing them to perform two tasks for each review sample:
1. Assign a binary label ("helpful" or "not helpful") based on their subjective judgment of whether
the review would be helpful in a purchasing decision.
2. Assign a multi-label annotation by selecting all applicable criteria from our six-point framework
(described in Section 3.1).</p>
        <p>To assure high quality of the dataset, each review sample was annotated by two diferent people. The
average of inter-annotator agreement (kappa value) for all pairs of annotators was 0.571 with standard
deviation of 0.069, which indicates a moderate and stable agreement. Considering that the agreement
was calculated from the whole annotation task, namely, both the subjective helpfulness and the set
of objective features, meaning that the task was highly sophisticated, this level of agreement can be
considered suficiently high, suggesting high quality of annotations by all participants. Finally, the
disagreements were resolved through a discussion with a well trained super-annotator.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Analysis of the Annotated Dataset</title>
          <p>The final annotated dataset consists of 574 "useful" and 426 "not useful" reviews. An analysis of the
co-occurrence of our defined features reveals important patterns. Table 2 shows the top five most
frequent combinations of features for useful and not useful reviews, respectively.</p>
          <p>These tables reveal a clear and compelling pattern. The combination ‘1+2+3+4+6‘, which includes
providing a basis for evaluation, mentioning the target, being consistent, having suficient information,
and describing actual use, was found in 263 reviews, and every single one of them was labeled "useful."
This combination represents a ’gold standard’ for a high-quality review. In stark contrast, reviews that
matched none of our features or only matched the "consistency" feature (Label 3) were overwhelmingly
labeled "not useful." This analysis validates that our framework successfully captures the core elements
that align with human perceptions of helpfulness and provides a strong empirical basis for our multi-label
classification task.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Final Dataset Composition and Limitations</title>
          <p>The final dataset is composed of 1,200 annotated reviews from the Rakuten dataset (1,000 for training,
200 for evaluation), along with a separate 192-review set from Amazon used for cross-domain testing.
Each entry includes the review text and two sets of labels: a binary "useful/not useful" label and six
binary labels for our defined features. The final composition of the dataset was represented in Figure 3.</p>
          <p>A key limitation, however, is the sampling bias introduced by our ‘lds‘ scoring method. The dataset
is intentionally enriched with reviews that are longer and more lexically diverse. Although this
preifltering phase can function as a component of the review helpfulness estimation method as a whole, in
practice models trained on this data may not generalize perfectly to a random, unfiltered sample of
all reviews. This trade-of between sample quality and representativeness should be considered when
interpreting the model’s performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>To evaluate our framework, we conducted two main experiments. In the first experiment we tested the
overall concept of helpfulness by training a binary classifier to distinguish "helpful" from "not helpful"
reviews. In the second, more fine-grained experiment we evaluated our primary contribution, namely,
an explainable multi-label model designed to identify the specific characteristics of a useful review
based on our six-point framework.</p>
      <sec id="sec-4-1">
        <title>4.1. Preliminary Investigation</title>
        <p>First, we prepared review sentences and had two annotators verify if they met our definitions, using a
small subset of 192 reviews from the Amazon review dataset3. The annotators also annotated whether
the reviews were helpful or not based on these definitions.</p>
        <p>The result was 96 helpful and 96 not helpful reviews. A1 to A6 in Table 4 correspond to the definition
numbers in our study.</p>
        <p>From the results in Table 4, for definition "A6. It is possible to infer whether the reviewer actually
used the product," 100% of reviews judged as helpful met this criterion, but many not helpful reviews
also met this criterion. For "A1. A basis is provided for the evaluative expression," "A2. There are many
mentions of the review target," and "A3. The star rating and the evaluation in the review body are
consistent," a high percentage of useful reviews met these criteria, while only few not useful reviews
did. The item "A5. The review title contains the polarity and its target" was met by more than half of
the helpful reviews, confirming a certain degree of efectiveness. For "4. The review sentence has a
suficient amount of information," the number of matches was generally low in this annotation. For this
preliminary study we set the threshold to 6-7 sentences or over 200 characters, which could influence
this result. However, the diference between helpful and not helpful reviews was the strongest for
this item, suggesting that the concept of pre-filtering, which we proposed in section 3.2, while not
solving the review helpfulness problem alone due to having a high degree of misses, could function as a
powerful pre-filtering tool when a large number of reviews is available.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Setup</title>
        <p>Both experiments were built on state-of-the-art Transformer architectures. We selected a range of
publicly available Japanese language models from the HuggingFace platform4, as listed in the list below,
to ensure a comprehensive evaluation.</p>
        <p>Model 1 izumi-lab/electra-small-japanese-discriminator
Model 2 ku-nlp/roberta-base-japanese-char-wwm
Model 3 hiroshi-matsuda-rit/bert-base-japanese-basic-char-v2
Model 4 tohoku-nlp/bert-base-japanese-char-v2</p>
        <p>All models were fine-tuned using a consistent set of hyperparameters, with learning rates of 1e-4, 1e-5,
and 2e-5. The primary evaluation metric was the F1-score, supplemented by accuracy, precision, and
recall. For the multi-label task, we report these metrics for each individual label.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experiment 1: Binary Classification of Subjectively Perceived Helpfulness</title>
        <p>The first experiment was designed to assess whether a model could learn a generalizable concept
of review helpfulness. We framed this as a binary classification task in a challenging cross-domain
setting. The model was fine-tuned on the 1,000 reviews from our annotated Rakuten training dataset
and evaluated on the 192-review Amazon test dataset. This setup tests the model’s ability to transfer its
learned knowledge from one e-commerce platform to another.
4.3.1. Results</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Experiment 2: Multi-label Classification of Objective Helpfulness Features</title>
        <p>The second experiment was central to our goal of creating an explainable system for extracting helpful
reviews. We framed this as an in-domain, multi-label classification task to train a model that can
automatically detect which of our six defined helpfulness criteria are present in a given review. For this
task, we used our full annotated Rakuten dataset of 1,200 reviews, split into a 1,000-review training set
and a 200-review test set. The input to the model was a single text sequence created by concatenating
the product name, review rating, title, and body.
4.4.1. Results</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Case Study: Model Application at Scale</title>
        <p>To demonstrate the practical application of our models, we conducted an exploratory case study on a
large, unlabeled dataset of Rakuten reviews from January 2017. We first applied our binary helpfulness
classifier and then used our multi-label model to analyze the features of the reviews within each class.
This is not a formal validation of accuracy, but a qualitative analysis to see if the patterns learned from
our small annotated dataset hold true at a much larger scale.</p>
        <p>Our binary model classified 213,142 reviews as "helpful" and 735,393 as "not helpful." We then
analyzed the feature combinations predicted by the multi-label model for each class. The results, shown
in Table 7, reveal patterns that are remarkably consistent with our findings from the manual annotation
(Table 2). The combination ‘1+2+3+4+6‘ is overwhelmingly associated with "helpful" reviews, while
reviews with "no matching items" or only "Label 3" are strongly associated with "not helpful" reviews.
This qualitative consistency suggests that our models have learned meaningful and generalizable
patterns of review quality that align with human judgments, even when applied to a massive, unlabeled
dataset.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Interpretation of Results and Comparison with Prior Work</title>
        <p>
          Our Transformer models achieved an F1-score of over 0.83 in a cross-domain evaluation, confirming the
efectiveness of deep learning for this task as shown in recent work [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The successful generalization
from the Rakuten to the Amazon dataset suggests the model learned robust, platform-independent
linguistic features of review usefulness.
        </p>
        <p>
          The primary contribution of our work is its explainable multi-label framework. High F1-scores
(over 0.9) for criteria such as "provides a basis for evaluation" (Label 1) and "content indicates actual
use" (Label 6) demonstrate that core components of a quality review are linguistically detectable. This
provides a data-driven method for achieving the explainability goals of earlier work [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The lower
performance for "suficient amount of information" (Label 4) likely stems from the inherent subjectivity
of this criterion and data imbalance, while the failure on "title contains evaluation" (Label 5) was a
direct result of input representation, highlighting an area for future improvement.
        </p>
        <p>
          A key finding is that information quality, not just quantity, is pivotal. The presence of "suficient
information" (Label 4) was the crucial diferentiator for usefulness, ofering a more nuanced insight
than the simple "review length" heuristic used in previous studies [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ]. Our ‘lds‘ score provides a
better proxy for this qualitative informativeness than word count alone, a concept explored also in prior
work [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Furthermore, by focusing only on review content, our approach ofers broader applicability
than methods that depend on reviewer metadata [
          <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
          ].
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Methodological Considerations and Limitations</title>
        <p>Our study has several limitations that provide clear avenues for future research. Firstly, our dataset
was curated using the ‘lds‘ score, which introduces a sampling bias towards information-dense reviews.
Consequently, the model’s performance may not generalize to unfiltered data rich in short, simple texts,
although the ‘lds‘ score itself could function as an efective pre-filter in a practical system.</p>
        <p>Secondly, the binary classification was a cross-domain evaluation (training on Rakuten, testing on
Amazon). While the strong results suggest generalization, this setup makes it dificult to separate model
performance from the efects of domain shift. An in-domain evaluation is needed to establish a clearer
performance baseline.</p>
        <p>
          Finally, our modeling approach has technical limitations. The model’s failure to detect title features
(Label 5) was caused by concatenating all text fields. In the future we plan to use structured inputs
to solve this issue. Additionally, our content-only focus, while making the model more universally
applicable, ignores reviewer metadata, which other studies have shown to be a strong predictor of
helpfulness [
          <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
          ]. This could be implemented to further improve the performance.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Theoretical and Ethical Implications</title>
        <p>
          From a theoretical perspective, our work frames "helpfulness" as a multi-faceted construct rather than
a monolithic score, providing a data-driven validation for multi-perspective frameworks like that of
Soda et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Furthermore, our findings suggest that information density and lexical diversity, as
captured by our ‘lds‘ score, are more precise indicators of a review’s quality than the simple review
length heuristic used in many previous studies [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], shifting the focus from quantitative to qualitative
content metrics.
        </p>
        <p>Ethically, while the intended application is to empower consumers by reducing information overload,
deploying such a system always carries significant risks. These include bias amplification, where
the model could systematically favor certain writing styles and marginalize others; the potential for
malicious actors to game the system by crafting fake reviews optimized to our criteria; and algorithmic
gatekeeping, where the system might suppress unconventionally written but helpful reviews. Therefore,
transparent communication of the criteria for "helpfulness" is essential to ensure such systems promote
fair discourse rather than implicitly censoring it.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>This study presented and validated an end-to-end framework for both predicting the subjective
helpfulness of online product reviews and for providing and objective explainable basis of this prediction.
We introduced a multi-label classification approach based on a refined set of six usefulness criteria. To
facilitate this, we developed a novel ‘lds‘ score for sampling information-rich data and created a new,
manually annotated dataset of over a thousand Japanese reviews.</p>
      <p>Our experiments demonstrate the efectiveness of this framework. A Transformer-based binary
classifier achieved a high F1-score of 0.838 in a challenging cross-domain evaluation, indicating that
it learned generalizable features of usefulness. More importantly, our multi-label model successfully
identified key criteria with F1-scores exceeding 0.9, confirming that these aspects are linguistically
distinct and can be reliably detected. A key finding from our analysis is that while many reviews
contain basic structural elements, the presence of "suficient information" was the critical feature that
distinguished helpful from non-helpful reviews, which is a more nuanced insight than the simple
correlation with review length.</p>
      <p>In conclusion, this research validates a multi-faceted, explainable approach as a powerful method
for assessing review quality. By moving beyond a single predictive score, our framework provides a
foundation for developing more transparent and efective systems to help consumers navigate the vast
landscape of online feedback.</p>
      <p>Future work will focus on two main areas, namely, dataset expansion and methodological refinement.
A primary priority is to create a larger and more representative dataset that is not limited by our ‘lds‘
sampling bias. On the methodological side, we also plan to address the failure in classifying title features
(Label 5) by exploring structured inputs that distinguish the title from the body. Furthermore, a formal
in-domain evaluation will be conducted to establish a clear performance baseline, complementing our
current cross-domain results. Once these improvements are implemented, we will leverage the refined
models to semi-automatically expand the annotated dataset, further scaling our research.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini 2.5 Pro in order to correct grammar and
spelling.
score for review sentences], in: LAU Technical Reports (Summer 2023), Language Acquisition and
Understanding, Sapporo, Japan, 2023, pp. 21–30.
[14] Y. Nakajima, M. Ptaszynski, F. Masui, Rebyu¯ no yu¯yo¯-sei ni okeru tokucho¯ bunseki oyobi
transformers o mochiita rebyu¯ no yu¯yo¯-sei hantei [feature analysis of review helpfulness and helpfulness
classification using transformers], in: LAU Technical Reports (Summer 2024), Language
Acquisition and Understanding, Kushiro, Japan, 2024, pp. 55–63.
[15] Y. Egawa, S. Konno, Bunsho¯ ko¯sei o ko¯ryo shita rebyu¯ p/n bunrui shuho¯ no teian [a proposal of a
p/n classification method for reviews considering document structure], in: IEICE Conferences
Archives, The Institute of Electronics, Information and Communication Engineers, 2012.
[16] Rakuten Group, Inc., Rakuten de¯tasetto [rakuten dataset], https://doi.org/10.32130/idr.2.0, 2014.
[17] J. Ure, Lexical density and register diferentiation, Applications of linguistics 23 (1971) 443–452.
[18] J. Eronen, M. Ptaszynski, F. Masui, A. Smywiński-Pohl, G. Leliwa, M. Wroczynski, Improving
classifier training eficiency for automatic cyberbullying detection with feature density, Information
Processing &amp; Management 58 (2021) 102616.
[19] M. Ptaszynski, A. Yagahara, Terminology extraction device, terminology extraction method and
program, Patent no.: 7557770 (2024-09-19).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.-M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pantel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chklovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pennacchiotti</surname>
          </string-name>
          ,
          <article-title>Automatically assessing review helpfulness</article-title>
          ,
          <source>in: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing - EMNLP '06</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics,
          <year>2006</year>
          , p.
          <fpage>423</fpage>
          . URL: http://dx.doi.org/10. 3115/1610075.1610135. doi:
          <volume>10</volume>
          .3115/1610075.1610135.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mudambi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuf</surname>
          </string-name>
          ,
          <article-title>What makes a helpful online review? a study of customer reviews on amazon</article-title>
          .com, in: MIS quarterly,
          <year>2010</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Tran,
          <article-title>Helpful or unhelpful: A linear approach for ranking product</article-title>
          ,
          <source>Journal of Electronic Commerce Research</source>
          <volume>11</volume>
          (
          <year>2010</year>
          )
          <fpage>220</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Born unequal: a study of the helpfulness of user-generated product reviews</article-title>
          ,
          <source>Journal of retailing 87</source>
          (
          <year>2011</year>
          )
          <fpage>598</fpage>
          -
          <lpage>612</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Seki</surname>
          </string-name>
          ,
          <article-title>Sho¯hin rebyu¯ wo taisho¯ to shita yu¯yo¯-sei no teigi to hanbetsu [definition and discrimination of usefulness for product reviews]</article-title>
          ,
          <source>in: DEIM Forum B5-1</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <article-title>Review helpfulness assessment based on convolutional neural network</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>09016</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Predicting the helpfulness of online product reviews: A multilingual approach</article-title>
          ,
          <source>Electronic Commerce Research and Applications</source>
          <volume>27</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          , M. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Feng,
          <article-title>Helpfulness of online reviews: Examining review informativeness and classification thresholds by search products and experience products, Decision Support Systems 124 (</article-title>
          <year>2019</year>
          )
          <fpage>113099</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saumya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <article-title>Predicting the helpfulness score of online reviews using convolutional neural network</article-title>
          ,
          <source>Soft Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>10989</fpage>
          -
          <lpage>11005</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. S. I. Malik</surname>
          </string-name>
          ,
          <article-title>Predicting users' review helpfulness: the role of significant review and reviewer characteristics</article-title>
          ,
          <source>Soft Computing</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>13913</fpage>
          -
          <lpage>13928</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soda</surname>
          </string-name>
          ,
          <article-title>Sho¯hin rebyu¯ no fukusu¯ no kanten kara no yu¯yo¯-sei no hyo¯ka [Evaluation of the usefulness of product reviews from multiple perspectives]</article-title>
          ,
          <source>Master's thesis</source>
          ,
          <source>School of Advanced Science and Technology, Japan Advanced Institute of Science and Technology</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>İ. Mayda</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Uğurlu</surname>
          </string-name>
          ,
          <article-title>E-ticaret Sitelerindeki Türkçe Müşteri Yorumlarının Faydalılık Tahmini Predicting the Usefulness of Turkish Consumer Reviews on E-commerce Websites</article-title>
          ,
          <source>in: 2024 Innovations in Intelligent Systems and Applications Conference (ASYU)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . URL: http: //dx.doi.org/10.1109/ASYU62119.
          <year>2024</year>
          .
          <volume>10757106</volume>
          . doi:
          <volume>10</volume>
          .1109/asyu62119.
          <year>2024</year>
          .
          <volume>10757106</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakajima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ptaszynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Masui</surname>
          </string-name>
          ,
          <article-title>Rebyu¯ no yu¯yo¯-sei ni kansuru cho¯sa oyobi rebyu¯-bun no jo¯ho¯-ryo¯ sukoa no teian [an investigation of review helpfulness and a proposal for an information</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>