<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimizing End-of-Line Product Testing through AI: A Case Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Athina Tsanousa</string-name>
          <email>atsan@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evangelos Bektsis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilias Gialampoukidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Vrochidis</string-name>
          <email>stefanos@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies Institute, Center for Research and Technology Hellas</institution>
          ,
          <addr-line>6</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Smart manufacturing enables industries to simplify and improve their procedures, resulting in less damaged products and more sustainable operations. One common industrial process is End of Line testing, which tests the final functionality of a produced product. This work provides an exhaustive comparison among various AI approaches in classifying products of EOL testing, evaluating how they respond to the characteristics of such data, including variability in values and imbalanced samples. The conducted experiments are related to the use case of Whirlpool's pilot, during the i4Q project, where it is examined whether certain sensors are related to a faulty produced part.</p>
      </abstract>
      <kwd-group>
        <kwd>1 End-of-Line testing</kwd>
        <kwd>smart manufacturing</kwd>
        <kwd>machine learning</kwd>
        <kwd>ensemble learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>End of Line (EoL) testing is a process conducted at the conclusion of the manufacturing or
assembly process, evaluating the functionality of the final product [1]. The product is tested against
specific characteristics, varying among different industries, and its functionality is measured
through different methods, including sensors. When sensor values are out of specific limits the
product is characterized as faulty. This identification process can be reframed as a classification
problem where the objective is to categorize each manufactured item into one of two classes:
"faulty" or "non-faulty."</p>
      <p>In the context of EoL testing in industrial manufacturing, data acquisition presents a
multifaceted set of challenges that can impact both the reliability and utility of the derived insights.
Data acquisition in End-of-Line testing is full of complications ranging from the integrity and
uniformity of data to security concerns. However, through the application of various techniques
including data preprocessing, compression, encryption, and machine learning, some of these
challenges can be addressed, enabling more accurate and efficient identification of defects and
enhancing overall product quality.</p>
      <p>In traditional manufacturing environments, the EoL tests might involve manual inspections or
specific machinery designed to test particular attributes of the product. However, with the advent of
data science techniques, this can now be approached as a machine learning classification task. The
attributes or features extracted from the product during EoL tests can serve as the input variables to
the classification model. Converting this fault diagnosis into a classification problem offers several
advantages. First, it allows for automation, thereby potentially reducing the time and human
resource investment required for EoL testing. Second, it opens up the possibility of employing a
range of advanced machine learning techniques like late fusion models, which could improve the</p>
      <p>© 2024 Copyright for this paper by its authors.</p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
accuracy of the fault detection process. Third, it facilitates real-time or near-real-time analysis,
enabling immediate corrective actions if necessary.</p>
      <p>In the present work standard classification algorithms are compared with late fusion methods in
order to assess which approach performs better in predicting faulty products from sensors used in
the EoL testing.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>Applications focused on EoL testing data are limited in bibliography. Although the
methodologies are not that different from other applications of smart manufacturing, certain
characteristics of the EOL data need to be considered. The authors of [1] investigate the efficiency
of several methods in dealing with analytical challenges such as small sample size, imbalanced data,
heterogeneous products and overlapping samples of EoL data. They applied several existing
ensemble algorithms and preprocessing techniques and also sampling techniques for imbalanced
data. The major contribution of this paper is a thorough comparative study of these methods to
identify whether they are able to cope with the analytical challenges of EoL data. They investigate
how classification techniques may help quality engineers to determine the cause of a quality issue
by automatically recommending the most likely faulty components, which is a challenging
classification problem.</p>
      <p>Since the industrial monitoring procedures require multiple data sources to extract a decision,
fusion is often applied in relevant literature. The authors of [2] proposed a new multi-task
multisensor fusion network (M2FN) to improve fault diagnosis performance. The proposed method first
uses convolutional neural networks (CNN) to extract and fuse features from raw vibration and
current signals. After that, to improve the discriminative ability of the learned features, a multi-task
learning module (MTL) is designed which contains a classification task and a deep metric learning
task. Instead of using one single classification task for fault diagnosis, they design a multi-task
learning module to jointly optimize feature learning and classification, including one classification
task with the cross-entropy (CE) loss function and one deep metric learning task with the center
loss function. This MTL module can effectively improve the discriminative capability of the learned
features.</p>
      <p>Feature level fusion is also employed in [3] with the use of neural network (NN), where the
authors extract features from the sensors using multiple generators and they embed the features in
a latent space. Following a procedure that includes adversarial learning, the fused features are fed in
the classifier. The system proposed in [4] consists of both early and late fusion modules. The
authors propose a multimodal deep learning-based fault detection model that combines tabular and
time series data. They exploit deep learning methods (Multilayer Perceptron (MLP), CNN, gated
recurrent unit) that are combined in an early fusion level and then apply a late fusion module.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section describes the methodology applied, the dataset used for the experiments and the
results. The EoL dataset was used in the respective Whirlpool pilot of the i4Q project. The use case
implemented in this pilot, specifically regarding the EoL dataset, is to replace a physical test by a
virtual one performed by an AI (Artificial Intelligence) enabled set of i4Q solutions. In this context,
machine and deep learning algorithms were applied in the EoL data, to predict faulty cases and the
best one in terms of performance and ease of applicability, was integrated in the pilot’s
infrastructure. In this paper however, a more research orientated approach is presented, showcasing
the comparative analysis of multiple late fusion methods that were tested on the EoL data with
standard classifiers.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1. Dataset</title>
      <p>The data collection for this dataset took place over the course of several years within the
controlled environment of a factory lab, ensuring that measurements were acquired under
standardized conditions. A set of sensors were strategically placed on the dishwashing machines to
capture essential parameters, such as water usage, temperature, and power consumption. These
sensors played a pivotal role in monitoring the performance of the dishwashers. The primary
objective of this dataset was to assess and classify the performance of the dishwashers as either
functioning correctly or encountering issues. Therefore, the dataset's target variable was binary,
with one class indicating that the dishwasher was working correctly and the other representing
instances where issues were detected during the test cycles. More information regarding the data
cannot be disclosed due to privacy issues.</p>
      <p>In industries like manufacturing, it is a common phenomenon to have a substantial majority of
products performing as intended (in this case, dishwashers working correctly) while a smaller
fraction may exhibit faults or issues, leading to a significant imbalance in the dataset. This poses a
challenge for machine learning models, as they tend to be biased towards predicting the majority
class due to its prevalence. To tackle the class imbalance effectively, two key strategies were
employed: Cost-Sensitive Learning: Recognizing the significance of correctly identifying defective
products, cost-sensitive learning techniques were applied. These techniques assign different
misclassification costs to each class, emphasizing the importance of accurately identifying the
minority class. This approach ensures that the model is penalized more for misclassifying defective
products, thus encouraging better performance on the minority class. Additionally, a balanced
sampling strategy was adopted to mitigate the effects of class imbalance. Instead of relying solely
on the majority class samples, an equal or similar ratio of defective (minority) and non-defective
(majority) products was included in the dataset used for training and evaluation. This balanced
approach allowed the model to learn from both classes effectively, improving its ability to make
accurate predictions for defective products.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Data preprocessing</title>
      <p>Following data collection, a rigorous data preprocessing phase was executed to ensure data
quality and consistency. The dataset underwent data cleaning, which involved addressing missing
values and correcting errors. Missing data were handled using statistical imputation or flagged for
further review when imputation wasn't reliable. Erroneous values or outliers were identified and
rectified to maintain data integrity. Data normalization was applied to standardize measurements
across different features, preventing variable scales from biasing analyses by applying Min-Max
scaling.</p>
      <p>As part of the experimentation process, a crucial step involved generating a representative
sample from the original dataset. It's essential to clarify that this sample selection process was not
driven by chronological considerations; instead, it aimed to ensure a diverse representation of data
points from various time points. The objective was to create a sample that effectively captured the
dataset's inherent patterns and characteristics, irrespective of their temporal sequence. This
approach was instrumental in framing the problem as a classification task. To achieve this, a
random sample was drawn from the dataset, encompassing data points from different temporal
instances.</p>
      <p>It's worth noting that the original dataset was substantial in size, with an initial size of 2.3
gigabytes. The preprocessing steps mentioned above, significantly impacted the dataset's size. The
resulting dataset, after preprocessing and before sampling, was still substantial in volume
(approximately 800MB).</p>
    </sec>
    <sec id="sec-6">
      <title>3.3. Experiments and Results</title>
      <p>This section describes the two approaches adopted. First one refers to classifiers applied to the
whole set of available sensors, concatenated, while the second approach includes classifiers applied
to each individual sensor and then combining the results with late fusion methods.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3.1. Standard classification approach</title>
      <p>The choice of classifiers for this experimentation was driven by the need to assess their
effectiveness in classifying dishwasher performance based on the dataset. Here is a brief overview
of the classifiers used: Light Gradient Boosting Machine (LGBM) is a gradient boosting framework
known for its efficiency and speed. It's particularly well-suited for large datasets and can handle
complex relationships in the data. XGBoost (eXtreme Gradient Boosting) is another gradient
boosting algorithm with strong predictive performance, often considered a benchmark in machine
learning competitions. MLP is a type of artificial neural network known for its ability to capture
complex patterns in data. It's a versatile choice for classification tasks but may require careful
tuning. Decision trees (DT) are interpretable models that partition the data based on feature values.
They are known for their simplicity and ease of understanding. TabNet is a relatively new
interpretable deep learning model designed for tabular data. It combines elements of decision trees
and deep learning. Random Forest is an ensemble method that combines multiple decision trees. It's
known for its robustness and ability to handle noisy data. SVM (Support Vector Machines) is a
powerful classifier that can handle both linear and non-linear data. The non-linear kernel version is
used to capture complex relationships in the data. The linear kernel version of SVM is suitable for
problems where the data is linearly separable.</p>
      <p>The experimentation process aimed to evaluate the performance of each classifier on the entire
dataset. After data preprocessing, each classifier was trained on a subset of the dataset, using a
portion of the data for training and the rest for testing. This allowed us to assess their performance
in terms of training accuracy and generalization to unseen data (test accuracy). A noteworthy
addition to the experimentation process was the use of grid search for hyperparameter tuning. Grid
search involved systematically exploring a range of hyperparameter values for each classifier to
identify the combination that yielded the best performance. This hyperparameter optimization step
was essential to fine-tune the models for optimal accuracy and generalization.</p>
      <p>Table 1 provides an overview of the results achieved by each classifier, revealing the superiority
of Random Forests, while boosting classifiers like LGBM and XGBOOST performed quite well too.</p>
      <sec id="sec-7-1">
        <title>Train Accuracy</title>
      </sec>
      <sec id="sec-7-2">
        <title>Test Accuracy</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.3.2. Late Fusion approach</title>
      <p>This analysis provides insights into how different classifiers like Logistic Regression, K-Nearest
Neighbors, Naive Bayes, Decision Tree, and Support Vector Classifier perform across features such
as voltage, current, power, phase, temperature, ambient temperature, and water and if applying
fusion methods to combine improves the final performance metrics. Table 2 provides an overview
of the classifier results before any fusion method is applied. This table provides a clear comparison
of the best performing models for each feature in terms of test accuracy and balanced test accuracy.</p>
      <p>This comprehensive evaluation of classifiers across different features is essential in smart
manufacturing EoL testing. It reveals significant variations in performance based on the feature,
with models like Logistic Regression, Naive Bayes (NB), and Decision Tree often showing a more
balanced performance, whereas K-Nearest Neighbors (KNN), despite high accuracy, frequently
struggles with balanced accuracy. The consistently low performance of the Support Vector
Classifier across all features suggests its limited applicability in this context. These insights are
fundamental in understanding the strengths and weaknesses of each classifier and form a basis for
the subsequent selection of appropriate models and fusion methods for optimizing EoL testing
processes.
Ambient Temperature</p>
      <sec id="sec-8-1">
        <title>Decision Tree</title>
        <p>The results of the classification obtained from the individual sensors, were afterwards combined
with late fusion methods. The purpose of this approach was to examine whether more information
could be extracted from the data in this way.</p>
        <p>The late fusion methods adopted were the following: The Simple Average Fusion method
combines predictions from multiple models by calculating their average. It is straightforward and
effective in reducing model variance but may struggle with imbalanced datasets or conflicting
model predictions. Max (maximum) Voting Fusion chooses the class with the most votes from
different models. It uses the strength of each model and works well when the models are different
from each other. However, it can be less effective if all models make similar errors. Min (Minimum)
Voting Fusion, the opposite of Max voting, takes the least common prediction as the final outcome.
This can be useful in avoiding commonly occurring classes but may be unreliable if the less
common classes do not lead to accurate predictions. Majority Voting Fusion requires most models
to agree on the class, selecting the class with the majority of votes. This method balances out
extreme predictions and works well when models complement each other. Its effectiveness might
decrease if there is no clear majority or if the models are imbalanced. Ensemble Averaging Fusion
averages the outputs from different models, combining their strengths to improve accuracy. This
method is effective in handling varied data characteristics and reducing biases from individual
models. However, it is more complex to implement and needs a careful selection of models to be
averaged. It computes a weighted average of predictions, where weights are typically determined
based on the individual performance of each classifier.</p>
        <p>The performance of these models is assessed using metrics such as train/test accuracy and
balanced train/test accuracy (Table 3). Logistic Regression showed varied performance across
different fusion methods, with ensemble averaging showing more balanced accuracy and F1 scores.
This suggests its effectiveness in handling class imbalances, a key issue for datasets in
manufacturing. K-Nearest Neighbors maintained high train and test accuracy across different fusion
methods but had lower balanced accuracy. This indicates potential issues with data imbalance,
important in the accuracy-focused environment of EoL testing. Naive Bayes varied in performance,
with ensemble averaging showing better balanced accuracy. This variation suggests the model's
suitability for datasets with inherent biases, often found in the manufacturing sector. Decision Tree
showed consistent performance across all fusion methods, with balanced accuracy and F1 scores</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. Conclusions</title>
      <p>indicating its effective management of both classes in the dataset, crucial for EoL testing
applications. Table 3 provides a clear and concise overview of the best performing fusion methods
for each model in terms of test accuracy and balanced test accuracy.</p>
      <sec id="sec-9-1">
        <title>Best Fusion Method (Test Accuracy)</title>
      </sec>
      <sec id="sec-9-2">
        <title>Min. voting</title>
      </sec>
      <sec id="sec-9-3">
        <title>Simple average</title>
      </sec>
      <sec id="sec-9-4">
        <title>Min. voting</title>
      </sec>
      <sec id="sec-9-5">
        <title>Simple average</title>
        <p>Comparing the results of Table 3 with the ones presented in Table 1, it is evident that there is a
small improvement in performance when using late fusion, instead of applying classifiers to all
sensors.</p>
        <p>This paper presented the experiments related to a pilot use case of the i4Q project. The pilot
refers to a white goods company that, through i4Q, intends to replace the lab tests of EoL testing
with virtual ones, based on AI predictive algorithms. The paper is not solely focused on the
implementation for the pilot, but on the research work done to investigate whether a late fusion
method could outperform the other methods.</p>
        <p>The exhaustive experiments applied to these data revealed that although late fusion employs
advanced methods to combine information from multiple sources, there is a small improvement in
performance compared to the classifiers applied to concatenated data. This has been observed in
other applications with real data, based on our experience. The choice for a late fusion framework
over one classifier, depends not only on performance metrics but other factors, such as ease of
integration and deployment and time effectiveness.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>This work was supported by the i4Q project, funded by the European Union’s Horizon
2020 Research and Innovation Program under Grant Agreement No. 958205.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI References</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Hirsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reimann</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitschang</surname>
          </string-name>
          ,
          <article-title>"Data-Driven Fault Diagnosis in End-of-Line Testing of Complex Products,"</article-title>
          <source>2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)</source>
          , Washington, DC, USA,
          <year>2019</year>
          , pp.
          <fpage>492</fpage>
          -
          <lpage>503</lpage>
          , doi: 10.1109/DSAA.
          <year>2019</year>
          .
          <volume>00064</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>M2FN: An end-to-end multi-task and multi-sensor fusion network for intelligent fault diagnosis</article-title>
          .
          <source>Measurement</source>
          ,
          <volume>204</volume>
          ,
          <fpage>112085</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2022</year>
          ,
          <article-title>October)</article-title>
          .
          <article-title>Adversarial Embedding Fusion Network for Multi-sensor Fusion Fault Diagnosis of Wheelset Bearings</article-title>
          .
          <article-title>In 2022 Global Reliability and Prognostics and Health Management (PHM-Yantai)</article-title>
          (pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>J. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ku</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>A multimodal deep learning-based fault detection model for a plastic injection molding process</article-title>
          .
          <source>IEEE Access</source>
          ,
          <volume>9</volume>
          ,
          <fpage>132455</fpage>
          -
          <lpage>132467</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>