<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weak signal detection for occupational safety</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Skander Ghazzai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Grigori</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raja Rebai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Air France</institution>
          ,
          <addr-line>45 rue de Paris,Tremblay-en-France</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Paris Dauphine - PSL University, CNRS, LAMSADE, Pl. du Maréchal de Lattre de Tassignys</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we address the challenge of detecting weak signals within the working environment using textual data. Our objective is to set up a decision support system to assist occupational safety experts in detecting weak signals. To achieve this, we adopt a unique method that combines portfolio maps and interpretation models. The portfolio maps integrate both structured and unstructured data, providing a holistic view of the potential safety risks in the workplace. The interpretation model further helps in comprehending and categorizing these signals accurately. We leverage the input of human experts on the potential weak signals to populate a dataset that serves as the basis for training a machine learning model. This model is designed to automate and optimize the detection and assessment of weak signals in the future. Our preliminary results demonstrate that this approach not only eficiently identifies weak signals but also ofers the potential for continuous improvement in occupational safety management.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Weak signals</kwd>
        <kwd>Decision support system</kwd>
        <kwd>Portfolio maps</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In both daily life and the corporate world, weak signals
are pervasive, often manifesting as early indicators that
we might overlook until their implications become evident.
For instance, the onset of illness might be preceded by
subtle signs, such as feeling unusually cold, which we might
dismiss until more significant symptoms develop. The
definition of weak signals varies across diferent fields, each
adapting the concept to its unique context and the specific
nuances of the early indicators it seeks to identify and
interpret. Originally used to detect preliminary signs of potential
threats in the domain of military intelligence, the concept of
weak signals has found its application in various domains.
Ansof (1975) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] described weak signals as the earliest
indicators of significant yet unrecognized changes or trends,
and, later, Godet (1986)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] portrayed them as developing
changes with potentially significant future implications,
necessitating contextual understanding. This paper draws on
this fundamental insight to address the critical need for
early detection of potential safety issues in occupational
settings by introducing a novel approach that leverages text
mining and machine learning techniques for weak signal
detections. Recognizing their value, many companies are
leveraging their untapped textual data to enhance
operational eficiency and employee safety in the field of
occupational safety. However, the manual process of weak signal
detection could be burdensome for safety analysts. To
address this, we propose a novel approach combining text
mining techniques and machine learning to automate weak
signal detection from work-related accidents. This aims
to reduce the workload of analysts and support informed
decision-making. This study underscores the importance
of proactively addressing weak signals to stay competitive
and ensure employee safety in today’s rapidly evolving
environment. In this paper, we will outline the methodology,
detailing how we can efectively detect weak signals and
enrich their understanding through contextual information.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and related work</title>
      <p>
        This section begins with a definition of weak signals,
followed by a detailed discussion of various methodologies
employed for their detection and interpretation. A signal,
by definition, is a function that conveys information about
a phenomenon. In this context, a signal can be understood
as a specific instance or data point within a dataset that
embodies meaningful information or patterns indicative
of underlying phenomena. Yet, a "weak signal" is often
regarded as an early, ambiguous, issue, or opportunity, serving
as an initial indication of change. Due to their low intensity,
fragmented nature, or potential noise within the data, these
precursor signs often prove challenging to detect. A weak
signal is by definition not universally recognized. The
nature of weak signals varies according to the domain. This
detection requires expert supervision. An expert who can
navigate the complexities of data and contextualize these
signals with the broader of domain-specific landscape. For
example, in marketing, it could be identified as a subtle shift
in consumer behavior, while in health contexts, it could be a
subtle symptom hinting at a more significant medical issue.
The concept of weak signals is rooted in the field of strategic
foresight and is frequently used to support decision-making
and strategic planning by anticipating potential future
development. Embracing data-driven decision-making
processes enhances the ability to navigate through uncertainty
with informed, evidence-based strategies. This alignment
with data-driven processes ensures a comprehensive
analysis of available data, facilitating more accurate predictions
and strategic actions in the face of uncertain future trends.
Hiltunen[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduces a 3D spatial model for their
definition. This model, based on Pierce’s model[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], consists of
three components:
• "Signal" refers to the frequency or visibility of
signals. In other words, it quantifies how often a signal
appears or the visibility of this signal.
• "Issue" represents the number of occurrences or
contexts in which the signal appears, indicating the
extent to which the signal spreads.
• "Interpretation" signifies the level of understanding
of signals by information users, i.e., the degree to
which receivers comprehend these signals.
      </p>
      <p>
        This definition has proven its efectiveness in a lot of studies
over the years [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Techniques for detecting weak
signals can be broadly classified into four categories[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]:
statistics-based methods, graph theory, machine learning,
and semantics expert knowledge. Each method has its
strengths and limitations. In the context of weak signal
detection, graph theory is instrumental in analyzing the
topology of graphs and structures [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] serving as a method
to uncover weak signals. Graph theory, although efective,
necessitates a high level of expertise for correct
interpretation. Machine learning techniques, on the other hand,
can ofer powerful predictive capabilities, but the ’black
box’ nature of some of these models often makes it
challenging for the expert to comprehend the reasoning behind
specific predictions, especially on sensitive topics[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Semantic expert knowledge heavily relies on an individual
expert’s knowledge and intuition. This dependence makes
it less consistent and challenging to scale up due to
varying expertise levels. Lastly, statistics-based methods, while
often robust and dependable, may fail to capture
qualitative nuances. The efectiveness of these methods is heavily
reliant on the quality and consistency of the data and the
expert knowledge. In this research, we will concentrate on
statistics-based methods, specifically portfolio map methods.
The choice to concentrate on portfolio maps stemmed from
the cyclical and seasonal nature of the work-related accident
data available to us, as well as the preferences expressed by
experts in the field.
      </p>
      <p>
        Portfolio maps, a form of statistics-based methodology,
are used to detect weak signals and identify emerging trends
and issues. They serve as visual tools designed to manage
and track multiple weak signals simultaneously.
Key characteristics of portfolio maps include:
• They typically use a two-dimensional matrix to plot
signals based on their level of
uncertainty/ambiguity and impact/importance. As per Ansof[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], strong
signals are highly impactful and certain, while weak
signals are either low in impact or high in
uncertainty.
• Portfolio maps are relatively easier to interpret for
those who are unfamiliar with advanced data
analytics or graph theory, making them accessible to a
wider range of decision-makers.
• These maps facilitate proactive measures against
potential risks or disruptions by visualizing the relative
position of various signals and aiding in
prioritization and decision-making.
      </p>
      <p>
        The foundational work on portfolio maps was initiated
within the context of business opportunities for solar cells by
Yoon[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Yoon proposed a quantitative approach based on
keyword text-mining to identify weak signal topics, thereby
developing an efective method for measuring "signals" and
"issues" as per Hiltunen’s approach. Two novel metrics were
introduced in this paper, namely the Degree of Visibility
(DoV) and the Degree of Difusion (DoD). The DoV aimed
to evaluate the visibility of a signal, while the DoD focused
on the issue-level aspect of a keyword, enabling the
identiifcation of emerging weak signals or trends that might not
be immediately visible through conventional text-mining
approaches. The DoV for keyword  during the period  is
calculated as follows:
      </p>
      <p>DoV, =</p>
      <p>TF,
NN</p>
      <p>× (1 − tw × ( − ))
TF, is the number of appearance of the word  in period 
,  is the total number of documents.  is the number of
periods and  is a time weight. This formula suggests that
the influence of a keyword is related to its frequency, and
recent instances are given greater significance. Similarly,
the DoD for a keyword  in period  is calculated as follows:
DF,</p>
      <p>NN
DoD, =</p>
      <p>
        × (1 − tw × ( − ))
Here DF, represents the document frequency of the
keyword  appearing during the period  (the number of
documents in which the keyword appears). This formula
represents the spread of the signal, with a higher DoD value
signifying a wider distribution of the issue. These two
metrics are employed to generate two keyword portfolio maps:
Keywords Emergence Map (KEM) and Keywords Issue Map
(KIM). These maps, featuring a two-dimensional
representation, have the horizontal axis indicating the average term
or document frequency, and the vertical axis representing
the average growth rate of DoV or DoD. These maps
assist in identifying weak signals: words with a currently
low frequency but high growth rates, suggesting potential
swift future escalation in significance. In Figure 1, the
median value on the X and Y axes divides the maps into four
quadrants, classifying the keywords based on their
locations. This division facilitates the automatic and dynamic
categorization of keywords. The maps can subsequently
be interpreted by experts, and common terms found in the
same area are categorized as either a weak signal, a strong
signal, or a well-known yet not strong signal. According to
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], this method surpasses human experts when dealing
with large textual datasets. Weak signals can be represented
in both KIM and KEM maps as keywords situated in the top
left corner. However, prior research using portfolio maps
has revealed some limitations, such as the dificulty in
distinguishing mixed signals on each map and the potential
for multiple meanings for a single keyword. The definition
of weak signals, as provided earlier, introduces an
interpretational dimension. By definition, a weak signal has a
low level of interpretation. If it were otherwise, it would
be widely understood and thus wouldn’t qualify as a ’weak’
signal. According to Hiltunen, this dimension includes the
context within which individuals anticipate potential future
events. However, the fragmented and incomplete nature of
weak signals can make it challenging for experts to
distinguish between genuine weak signals and noise. To address
this issue, researchers have developed various techniques,
as outlined below.
      </p>
      <p>
        Multiwords Analysis This technique, enhanced by
natural language processing, determines the frequency of
occurrence of a specific word or phrase (potential weak
signals and related keywords) in the presence of other
specific words or phrases. This analysis, also known as
cooccurrence analysis, provides insights into the relationships
and contexts between diferent signals. The measurement of
the association strength between terms or words in a given
context is quantified[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This technique, augmented by
natural language processing, determines the frequency of
words co-occurring in conjunction with potential weak
signals and related keywords. Previous studies [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have
indicated that during the analysis of the portfolio map
categorization, interpretability often sufers because keywords
related to weak signals tend to be isolated terms.
Consequently, the lack of context and relationships among these
keywords restricts the scope and depth of the information
that can be obtained. Hence, techniques like multiword
analysis can play a crucial role in mitigating this interpretability
issue by providing additional context.
      </p>
      <p>
        Degree of Transmission This concept has been
introduced to incorporate the third dimension of Hiltumen’s
future sign model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], known as the interpretation. The
approach in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposed a novel metric, the Degree of
Transmission (DoT), specifically designed to assess the
significance of terms within various sources, such as
ScienceDirect, New York Times, and Twitter, from which keywords
are automatically extracted. The Degree of Transmission is
calculated as follows :
      </p>
      <p>
        DoT = ∑︁ Hindex
Here, DoT represents the degree of transmission for term ,
while Hindex refers to the H-index of the journal
in which the term appears. The H-index, which measures
the impact and productivity of scholarly works, is summed
across all the journals where a term is present, providing an
overall measure of the term’s influence in the scientific
literature. This approach has demonstrated promising results
when both the Degree of Difusion (DoD) and the Degree of
Visibility (DoV) are multiplied by their corresponding DoT
to enhance interpretability in the portfolio maps.
Topic Modeling [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] proposes a methodology for weak
signals detection using LDA and Word2Vec. The approach
is based on clustering topics at multi-level documents and
extracting significant descriptors (weighted list of words).
This model has the advantage of proposing a method for
detecting weak signals based on tree-multi-clustering,
unlike other works that are essentially based on the portfolio
maps method, but don’t consider the temporal aspect.
Limitations of existing methods Many current
methodologies rely on expert opinions or manual evaluations,
presenting significant challenges for scalability for large
datasets or real-time analysis. The adaptability of these
methods to diferent domains and datasets is essential. By
addressing these limitations, we aim to develop a more
robust and versatile method for weak signal detection and
analysis within the realm of safety occupational management.
Our approach integrates the strengths of portfolio maps and
machine learning algorithms, ofering an improved solution
for detecting and interpreting over time.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In the following, we describe the approach adopted for
developing a novel model for detecting weak signals in
occupational safety management. This model aims to overcome
the limitations of previous work, which lacks context,
requires time-consuming analysis, and remains static over
time (i.e., does not leverage user input for improvement).
Our model combines text mining with a portfolio map
approach, and a machine-learning algorithm, to identify and
interpret weak signals more eficiently. The portfolio map
approach visualizes the current state of an organization or
system, analyzing the relationships between diferent
elements to spot potential risks or opportunities. We utilized
the concept of a weak signal consistent with Hiltunen’s
definition, using a portfolio map that includes two axes:
Visibility and Issue. To facilitate interpretation, we developed a
user interface that aids users in confirming or discarding
potential weak signals, enabling them to restrict the analysis to
a given context. To further leverage expert input, each time
a weak signal is confirmed, we populate a database with
that specific weak signal, along with all factors contributing
to its identification.</p>
      <p>In the subsequent sections, we outline the systematic steps
we adopted for our weak signal detection process. We start
by preparing the documents, and ensuring they are ready
for further analysis. Following this, we calculate new
metrics to construct portfolio maps, an essential tool that aids in
signal identification. After creating these maps, we perform
a future sign classification. This step not only assists
experts in interpreting the data but also provides a structured
framework for identifying potential weak signals. The next
step is the interpretation stage where we provide the expert
with an interface that helps him discard weak signals from
noise and strong signals. Finally, using the labeled data, we
develop and train a machine-learning model. This model,
designed to automatically detect and assess weak signals,
introduces an element of automation and eficiency to the
process. Through this comprehensive approach, we aim to
enhance both the efectiveness and precision of weak signal
detection.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preparation</title>
        <p>In this section, we describe the steps taken to prepare the
dataset for the study including data cleaning and
preprocessing. The dataset used in this research was collected
directly from the company (Air France) history containing
a structured part related to the user and an unstructured
part which is the "Verbatim" textual data that describes the
event.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Data Cleaning</title>
          <p>For this study, we work with a French data corpus and
follow a standard preprocessing workflow to ensure data
quality. Firstly, we remove and correct misspelled keywords,
punctuation marks, and special characters from the text.
Additionally, we convert all the text to lowercase to ensure
consistency in our analysis. These data-cleaning steps help
to improve the accuracy and reliability of our subsequent
analyses.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Data preprocessing</title>
          <p>
            To maximize the value of the information extracted from
the text and reduce its complexity, we tokenize the text,
splitting it into individual words. We use the word tokenizer
NLTK tool for tokenization, a widely used tool for natural
language processing. Once the text is tokenized, we perform
lemmatization and part-of-speech tagging to refine the data
further. This involves identifying the root form of each
word, as well as its grammatical function in the sentence.
We use the FrenchLeffLemmatizer and Camabert-ner NER
model that was fine-tuned from camemBERT [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] on a wiki
ner-fr dataset. Finally, we remove stop words, common
words that do not add significant meaning to the text. In
French, common stop words include "le," "la," "de," and "des."
Removing these words helps to reduce noise and improve
the accuracy of our analysis. Overall, our data preprocessing
workflow ensures that the data is clean, consistent, and
ready for our model, minimizing the number of keywords.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Adding contextual information</title>
        <p>The process of detecting weak signals is a meticulous task,
as each piece of information can potentially be valuable.
To add contextual data to each event, we enrich the
descriptive information with knowledge extracted from the
enterprise database. Each sentence describing the event is
represented as a combination of keywords, i.e.,  =
{1, ..., }. We augment these keywords
with contextual information such as metadata (related
to the place and time) and user-related data (role,
department, number of past events, etc..). , →
{ ,   − }. This methodology allows
us to capture the contextual details associated with each
keyword, enabling us to use all the available data eficiently.
As an example, consider the structure of this sentence: "
Keyword1 event place Keyword2" After performing our data
preparation, each keyword would carry specific contextual
information. For instance, the keywords could carry
Metadata such as ["Location", "Time"] and User-related
information such as ["Role", "Experience", "Age"]. By integrating
this information, we can calculate metrics like the degree of
Transmission, which allows us to identify potentially
hazardous events more efectively. Moreover, this methodology
can enhance our data-driven decision-making and a more
robust detection of critical events.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Feature Engineering: Topic Modeling for Event</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Typologies</title>
          <p>To extract and utilize as much information as possible from
the textual data at hand, we collaborated with domain
experts to fine-tune a Camembert model - a transformer-based
language model specifically optimized for French language
processing. This fine-tuning process involved training the
model to categorize accident descriptions based on a
preestablished set of typologies denoted as T. Crafted and
validated by field experts, these typologies span a wide array
of accident categories. The resulting model acts as a
classiifer, capable of processing unstructured text descriptions of
accidents and categorizing each into its corresponding
typology. Additionally, an "Unknown" category is provisioned
for instances where the model cannot confidently assign a
description to any of the known typologies.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Contextualized portfolio maps</title>
        <p>In this part, we are going to introduce how are we going
to construct the portfolio maps. The first step is the data
representation which gives a numerical value to the issue
and visibility to each keyword.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Data representation</title>
          <p>We want to represent numerically the visibility and issue for
every keyword while staying consistent with the definition
of Hilutmen. We propose a new approach that represents
the Visibility of every keyword defined as being the number
of diferent populations (categories of users) that used the
keyword with the degree of visibility (DoV). To capture the
uncertainty of a given signal we propose the Issue notion,
represented by the degree of difusion with the number of
diferent event typologies the keyword has been used in,
trying to capture the level of uncertainty. The new formulas
that we introduced with the contextualization are as follows:
Visibility, = DoV, × nP,</p>
          <p>Issue, = DoD, × nT,
Where: DoV, is the degree of visibility of the keyword
 in the period , DoD, is the degree of difusion of the
keyword  in the period , nT, is the number of diferent
typologies that the keyword  is found in the period , nP,
is the number of diferent populations that used the
keyword  in the period ,  is the index for the keyword,  is
the index for the period. The data derived from the textual
documents we analyze ofers a variety of information.
Details such as the frequency of accident reports made by an
individual and the experience of the employee contribute to
this rich context. Inspired by the H-index we created new
metrics Degree of Transmission (DoT) taking into account
this personalized, user-related data. By associating such
additional context with all keywords, we can significantly
enhance the performance of our weak signal detection. By
combining the old definition of Yoon with the new
contextualization, we try to have a more precise representation
of the impact/importance and the uncertainty/ambiguity of
every keyword.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Portfolio maps categorization</title>
          <p>After representing our signals with the metrics previously
designed, we construct two portfolio maps:
• Keyword Emergence Map KEM where the X-axis
is mapped with the geometric mean of the term
frequency, and the increasing rate of visibility for
each keyword is mapped into the Y-axis.
• Keyword Issue Map KIM has the X-axis mapped
with the geometric mean of the term appearance,
and the increasing rate of the issue for each keyword
is mapped into the Y-axis.</p>
          <p>Using these portfolio maps, we are equipped to identify
potential weak signals in terms of both visibility and
issue. Typically, keywords classified as potential weak signals
exhibit below-average term frequency and document
appearance, coupled with an appearance in diferent topics
and used by diferent populations which results in
aboveaverage increases in issue and visibility rates, indicative of
emerging trends. We then identify commonalities between
the weak signal candidates derived from each keyword map,
categorizing these as our prime weak signal contenders. In
the subsequent phase, these candidates are presented to
domain experts for interpretation and validation.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Interpretation</title>
        <p>Interpretation refers to the ability to understand the
significance of the information extracted from the portfolio maps
categorization. As a result of the whole system, experts
will have access to two outputs that will help them in the
decision-making process :
• A list of potential weak signals represented in the
Keyword Issue Map, depending on their Degree of
Difusion and Degree of Transmission.
• A list of potential weak signals represented in the
Keyword Emergence Map, depending on their
Degree of Visibility and Degree of Transmission.</p>
        <p>Our goal in this part is to minimize the risk of
misinterpreting a weak signal by maximizing the level of interpretability
of every signal.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Domain-specific</title>
          <p>Having to work with all the potential weak signals listed
in the portfolio maps can be challenging and can be
timeconsuming. To be more eficient in how we are treating
the keywords we tried to come up with a way to rank the
list of potential weak signals. with the help of experts, we
developed a set of ranking rules representing the emergency
of which we have to treat the results as follows:
1. The list of potential weak signals represented in both
the Keyword Emergence Map and the Keyword Issue
Map
By applying those rules, we try to focus our attention on the
most relevant keywords thus reducing the time and efort
required to identify and analyze weak signals.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Multi-word analysis</title>
          <p>
            After ranking, the output words from the Portfolio Maps
emerge as potential weak signals or terms related to weak
signals. This stage presents a challenge due to the various
interpretations a single keyword might encompass, a
problem highlighted in previous studies [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. To mitigate
this issue, we implement multi-word expression analysis,
a technique designed to refine our results. This method
examines the words that appear immediately before and
after the identified term in every instance, excluding
common stopwords. Consequently, we generate a co-occurrence
relationship related to a keyword, ranking it based on its
increasing rate of visibility and issues.
          </p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Predictive modeling</title>
          <p>
            Identifying weak signals in portfolio maps and interpreting
the results is a multi-step complex task that not only requires
a signal to be found in the intersection of the KIM and KEM
but also the validation throughout the interpretation and
confirmation of the expert. To use knowledge from the past,
we aim to leverage every expert input to create a model
that predicts the probability of a keyword being related to a
weak signal using this procedure :
1. The expert selects a keyword from the portfolio map
found in the intersection of the KIM and KEM.
2. The expert using the interpretation model labels the
signal (related to a weak signal or not)
3. Add the newly labeled signal to the dataset
After every use of our model, we gather the expert’s
feedback (label), incorporate it into our existing dataset, and
once a substantial amount of new information has been
accumulated, we recalibrate our model in light of this freshly
acquired data. After creating the model we try to implement
the model output in the portfolio maps. This model predicts
the probability of other keywords being a weak signal
without taking into consideration the position of the keyword in
any of the portfolio maps and is thus complementary to the
portfolio map. This approach is supported by previous
research in the field of predictive modeling, which has shown
that combining expert input with machine learning can lead
to more accurate predictions and better decision-making
[
            <xref ref-type="bibr" rid="ref18">18</xref>
            ][
            <xref ref-type="bibr" rid="ref19">19</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation and</title>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>In this section, we present a preliminary evaluation of our
model’s efectiveness in detecting potential weak signals
and preventing future accidents, using real-world
occupational safety data. We then show the results of our approach
applied to our dataset.</p>
      <sec id="sec-5-1">
        <title>4.1. Data Representation</title>
        <p>Safety occupational management event reporting follows a
strict protocol, generating a dynamic database that is
continuously updated with new reports. These reports
contain both structured and unstructured data relating to the
event and the user. Due to GDPR constraints, we will not
delve into the structured data, which primarily includes
user-related information such as employee profiles, work
details, and accident history. The primary focus of our study
lies in the unstructured data, including the event label and
description, which represent rich sources of textual
information. This unstructured segment can provide valuable
insights into weak signals when analyzed efectively.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Portfolio maps applied to our dataset</title>
        <p>The first step of our application is the construction of the
portfolio maps (Figures 2 and 3). We represent the growth
rate of every signal keyword with our designed metrics
and then employ portfolio map categorization. This
process allows us to identify a set of potential weak signals.
Given the sensitive nature of the data, continuous
monitoring and updating are vital. Each new accident report
plays a pivotal role in this ecosystem, substantially
impacting both the composition of our portfolio map and the list
of potential weak signals. Each keyword has the potential
to shift the dynamics of the portfolio, leading to new and
meaningful discoveries. In the following, we analyze the
diference between our newly developed metrics and the
classical ones, distinguishing between strong signals and
potentially weak signals. Our approach here is
instance-byinstance, and we aim to discern the comparative rates of
increase for these metrics. We focus our analysis on the
keyword ’salarie’ due to its importance as it consistently appears
at the start of each report, thereby providing a common
reference point. To maximize the impact of our analysis, we
conduct a time-series analysis, closely examining the
metrics associated with the keyword ’salarie’ throughout June
(Figures 4 and 5). Each metric calculation only encompasses
past and present instances, intentionally leaving out future
ones. This methodology allows us to conduct an exhaustive
comparison of traditional metrics (DOV and DOD) and the
newly developed ones (’Visibility’ and ’Issue’). By doing so,
we gain a clear understanding of the rate at which these
metrics increase over time, ofering vital insights for the
construction of our portfolio maps. In our comparison of
the ’Degree of Visibility’ (DOV) and ’Visibility’ metrics, we
initially noticed that DOV exhibits a sharper peak. This
suggests a higher increase rate for DOV compared to the
’Visibility’ metric. Essentially, the keyword ’salarie’ appears
more frequently in reports over time, indicating its growing
visibility. However, this rise in visibility does not correspond
to equal growth in the ’Visibility’ metric. As ’Visibility’ is
a product of DOV and the diversity of typologies in which
the keyword appears, a rapid increase in DOV won’t match
the ’Visibility’ increase rate if the keyword ’salarie’ doesn’t
appear in a variety of typologies there being in the same
typologies there being a well-known signal. This
divergence highlights the more nuanced understanding ofered
by the ’Visibility’ metric, which takes into account not just
keyword frequency but also the diversity of contexts
(typologies) where the keyword is present. On the other hand, there
are instances when ’Visibility’ peaks significantly higher
than DOV. This occurs when the keyword, despite being
used less frequently, is present in a wider range of contexts,
leading to a sharper rise in ’Visibility’ compared to DOV.
The ’Issue’ and DOD demonstrate a similar behavior to
’Visibility’ and DOV, respectively. We have implemented our
system as an auxiliary tool for a proactive approach. It
automatically updates whenever the list of potential weak signals
changes, or when the model predicts that a keyword might
be a weak signal. This proactive approach facilitates early
detection and prevention eforts, reinforcing the system’s
contribution to accident prevention.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Interpretation</title>
        <p>Despite our eforts to include contextualized metrics, we
acknowledged the need for an approach that ofers a
broader and global view of the system. To address this,
we developed a dynamic dashboard that empowers the
expert to select specific data from the structured dataset,
including user-related and event-related information. We
transformed all available data into selectable variables,
thereby augmenting the depth and relevance of the
contextualization process. The dynamic dashboard
facilitates the confirmation of weak signals enabling experts
to filter and choose data from a comprehensive set of
structured variables. The dynamic dashboard facilitates
1. Type of periods
2. Sources: Population
3. Afiliation: where the accident has been submitted
4. Topic: typologies of the event
5. Type of Keyword (Named entity recognition(NER))
the confirmation of weak signals enabling experts to filter
and choose data from a comprehensive set of structured
variables. In the interest of illustrating our approach
to enhancing data contextualization and facilitating the
identification of weak signals, Figure 6 presents a partial
view of the dynamic dashboard interface. This segment
showcases a subset of the configurable options available to
analysts, including ’Type of periods’, ’Sources: Population’,
’Afiliation’ indicating the submission source of the accident
report, ’Topic’ relating to the typologies of the event,
and ’Type of Keyword’, which leverages Named Entity
Recognition (NER) for deeper analysis. It’s important to
note that this figure represents only a fraction of the full
interface, specifically chosen to demonstrate the flexibility
and depth of analysis without disclosing sensitive variable
details. Through these selectable options, the dashboard
empowers analysts to tailor their examination of the
data, thus enhancing the precision and relevance of their
ifndings. This not only provides a more in-depth and
targeted analysis but also accelerates the confirmation
of weak signals. The ability to combine the strengths of
both textual and structured data significantly enhances the
detection and interpretation of weak signals, ensuring a
more thorough and accurate understanding of underlying
trends and relationships.</p>
        <p>Multi-word analysis To show more accurate and
interesting information about the detected term of interest than
an analysis based only on single words, we perform a
multiword analysis. This enhances the overall interpretability, by
showing the co-occurrence of keywords that have been used
with the selected signal. Figure 7 represents the Multi-word
analysis for a potentially weak signal. Because of RGPD
issues, we had to anonymize the keywords and the events
associated with them.</p>
        <p>
          Machine learning We start by estimating the probability
of a keyword’s occurrence based on the available metadata.
This process involves the use of CamemBERT[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
tokenization and embedding in combination with the structured
data. The resulting information is then presented to the
expert for portfolio map interpretation. Each time a keyword
is selected by the expert, we enrich a dataset that will, in
the future, enable us to develop a more efective
machinelearning model. However, due to the low absolute frequency
of weak signals within our dataset, the efectiveness of the
current approach could not be assessed. Nonetheless, it
allows continual refinement and improvement of the model.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>This research paper has introduced innovative metrics and
processes tailored to detect weak signals in the domain of
occupational safety management. By integrating a
portfolio map approach and machine learning algorithms, the
proposed model aims to enhance the eficiency of the
detection process. We have adopted a contextualized
approach designed to support analysts and facilitate informed
decision-making. This approach efectively captures
contextual aspects and optimizes the utilization of available data.
Through the synergy of expert knowledge with machine
learning techniques, our model adeptly identifies weak
signals and potential emergent trends. Future research should
focus on further refining the model, incorporating
additional contextual information, and utilizing more advanced
machine-learning techniques to enhance the detection of
weak signals. Regarding the model’s interpretability, we
anticipate that integrating highly sophisticated large language
models, such as ChatGPT, could make a significant
contribution to this objective. Such integration could provide
richer, more nuanced interpretations of the detected weak
signals, thereby enabling safety management professionals
to make more informed decisions. Furthermore, evaluating
our model’s adaptability across a variety of domains and
industries could ascertain its universal applicability and
versatility in diverse occupational safety management contexts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>H. I. Ansof</surname>
          </string-name>
          ,
          <article-title>Managing strategic surprise by response to weak signals</article-title>
          ,
          <source>California Management Review</source>
          <volume>18</volume>
          (
          <year>1975</year>
          )
          <fpage>21</fpage>
          -
          <lpage>33</lpage>
          . doi:
          <volume>10</volume>
          .2307/41164635.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Godet</surname>
          </string-name>
          , From Anticipation to Action:
          <article-title>A Handbook of Strategic Prospective</article-title>
          , UNESCO Publishing,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>E. Hiltunen,</surname>
          </string-name>
          <article-title>The future sign and its three dimensions</article-title>
          ,
          <source>Futures</source>
          <volume>40</volume>
          (
          <year>2007</year>
          )
          <fpage>247</fpage>
          -
          <lpage>260</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.futures.
          <year>2007</year>
          .
          <volume>08</volume>
          .021.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Peirce</surname>
          </string-name>
          ,
          <article-title>Some consequences of four incapacities</article-title>
          ,
          <source>Journal of Speculative Philosophy</source>
          <volume>2</volume>
          (
          <year>1868</year>
          )
          <fpage>140</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Roh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Exploring signals for a nuclear future using social big data</article-title>
          ,
          <source>Sustainability</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <article-title>5563</article-title>
          . doi:
          <volume>10</volume>
          .3390/su12145563.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Krigsholm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Riekkinen</surname>
          </string-name>
          ,
          <article-title>Applying text mining for identifying future signals of land administration</article-title>
          ,
          <source>Land</source>
          <volume>8</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .3390/land8120181.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <article-title>Detecting weak signals for long-term business opportunities using text mining of web news</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>39</volume>
          (
          <year>2012</year>
          )
          <fpage>12543</fpage>
          -
          <lpage>12550</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2012</year>
          .
          <volume>04</volume>
          .059.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Camara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kotzinos</surname>
          </string-name>
          ,
          <article-title>Weak signal detection and identification in large data sets: A review of methods and applications</article-title>
          ,
          <source>ResearchGate</source>
          ,
          <year>2021</year>
          .
          <source>doi:10.13140/RG.2.2.20808.24327/1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Jamra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Savonnet</surname>
          </string-name>
          , E. Leclercq,
          <article-title>Beam: A network topology framework to detect weak signals</article-title>
          ,
          <source>International Journal of Advanced Computer Science and Applications</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .14569/ijacsa.
          <year>2022</year>
          .
          <volume>0130402</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>1</volume>
          (
          <year>2018</year>
          )
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          . doi:
          <volume>10</volume>
          .1038/s42256-019-0048-x.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Roh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Exploring signals for a nuclear future using social big data</article-title>
          ,
          <source>Sustainability</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <article-title>5563</article-title>
          . doi:
          <volume>10</volume>
          .3390/su12145563.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Romero-Frías</surname>
          </string-name>
          ,
          <article-title>Exploring web keyword analysis as an alternative to link analysis: a multiindustry case</article-title>
          ,
          <source>Scientometrics</source>
          (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1007/ S11192-012-0640-X.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.-J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , J.-Y. Park,
          <article-title>Identification of future signal based on the quantitative and qualitative text mining: A case study on ethical issues in artificial intelligence, Quality</article-title>
          and Quantity:
          <source>International Journal of Methodology</source>
          <volume>52</volume>
          (
          <year>2018</year>
          )
          <fpage>653</fpage>
          -
          <lpage>667</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          , H. j.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          <article-title>Study on the Development Direction of the New Energy Industry Through the Internet of Things - Searching for Future Signals Using Text Mining</article-title>
          ,
          <source>Technical Report, Korea Energy Economics Institute</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Griol-Barres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cebrián</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Millet</surname>
          </string-name>
          ,
          <article-title>Detecting weak signals of the future: A system implementation based on text mining and natural language processing</article-title>
          ,
          <source>Sustainability</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .3390/su12198141.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maitre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chiron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouju</surname>
          </string-name>
          ,
          <article-title>Détection de signaux faibles dans des masses de données faiblement structurées</article-title>
          ,
          <source>RIDoWS</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <article-title>null</article-title>
          . doi:
          <volume>10</volume>
          .21494/ ISTE.OP.
          <year>2020</year>
          .
          <volume>0463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. O.</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dupont</surname>
          </string-name>
          , L. Romary,
          <string-name>
            <surname>Éric Villemonte de la Clergerie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Seddah</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Camembert: A tasty french language model</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7203</fpage>
          -
          <lpage>7219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Megahed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Mahajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Incorporating experts' judgment into machine learning models</article-title>
          ,
          <source>Expert Systems With Applications</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1016/j.eswa.
          <year>2023</year>
          .
          <volume>120118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F. d. V.</given-names>
            <surname>Tamer</surname>
          </string-name>
          <string-name>
            <surname>Boyacı</surname>
          </string-name>
          , Caner Canyakmaz,
          <article-title>Human and machine: The impact of machine input on decision making under cognitive limitations, Management Science (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1287/mnsc.
          <year>2023</year>
          .
          <volume>4744</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>