<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Word Embeddings and Transformers to Extract Semantics from Building Regulations Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Odinakachukwu Okonkwo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amna Dridi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edlira Vakaj</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computing, Engineering and Built Environment, Birmingham City University</institution>
          ,
          <addr-line>B4 7XG, Birmingham</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>176</fpage>
      <lpage>188</lpage>
      <abstract>
        <p>In the recent years, the interest to knowledge extraction in the architecture, engineering and construction (AEC) domain has grown dramatically. Along with the advances in the AEC domain, a massive amount of data is collected from sensors, project management software, drones and 3D scanning. However, the construction regulatory knowledge has maintained primarily in the form of unstructured text. Natural Language Processing (NLP) has been recently introduced to the construction industry to extract underlying knowledge from unstructured data. For instance, NLP can be used to extract key information from construction contracts and specifications, identify potential risks, and automate compliance checking. It is considered impractical for construction engineers and stakeholders to author formal, accurate, and structured building regulatory rules. However, previous eforts on extracting knowledge from unstructured text in AEC domain have mainly focused on basic concepts and hierarchies for ontology engineering using traditional NLP techniques, rather than deeply digging in the nature of the used NLP techniques and their abilities to capture semantics from the building regulations text. In this context, this paper focuses on the development of a semantic-based testing approach that studies the performance of modern NLP techniques, namely word embeddings and transformers, on extracting semantic regularities within the building regulatory text. Specifically, this paper studies the ability of word2vec, BERT, and Sentence BERT (SBERT) to extract semantic regularities from the British building regulations at both word and sentence levels. The UK building regulations code has been used as a dataset. The ground truth of semantic regulations has been manually curated from the well-established Brick Ontology to test the performance of the proposed NLP techniques to capture the semantic regularities from the building regulatory text. Both quantitative and qualitative analyses have been performed, and the obtained results show that modern NLP techniques can reliably capture semantic regularities from the building regulations text at both word and sentence levels, with an accuracy that reaches 80% at the word-level, and hits 100% at the sentence-level.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AEC domain</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>BERT</kwd>
        <kwd>word2vec</kwd>
        <kwd>Sentence BERT</kwd>
        <kwd>Word embeddings</kwd>
        <kwd>Transformers</kwd>
        <kwd>Building regulations</kwd>
        <kwd>Semantic regulations</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The automation of activities, which were done manually has become quite prevalent in our
world. Textual data is now automated to extract interesting insights and information, and save
time and man power. One such widely growing domain that fosters this digitisation process is
the Architecture, Engineering and Construction (AEC) domain. Natural Language processing
(NLP) - as a branch of Machine learning (ML) dealing with textual data - has recently skyrocketed
in the AEC domain, essentially with the success of word embeddings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and transformers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Word embeddings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are numerical representations of words in a high-dimensional vector
space. Transformers, on the other hand, are a type of neural network architecture for processing
sequences of text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Together, they have enabled significant advances in a wide range of
NLP applications, among which applications in AEC domain such as safety management [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
automation compliance checking [
        <xref ref-type="bibr" rid="ref10 ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9, 10</xref>
        ], public opinion analysis [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], building
design [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], contract management [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ], and others [
        <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
        ].
      </p>
      <p>
        Due to their ability to reduce the gap between human and computer language comprehension,
typically, word embeddings and transformers have been used as features for diferent ML tasks
dealing with diverse aspects in AEC domain, such as text classification [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and clustering [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Despite the sensitivity of the hyper-parameters of these NLP techniques and their characteristics
of being data and task dependant [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], there are a few studies that deeply investigate the
capabilities of word embeddings and transformers to capture the semantic regularities within a
domain-specific text [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. While it is important in the construction domain, specifically with
regulatory text, to extract the required and exact information, an in-depth analysis of these NLP
techniques with application to the building regulations text is becoming an urge.
      </p>
      <p>
        This paper looks at the ability of word embeddings and transformers to extract semantic
meaning from the building regulatory text. The term semantics became important because the
information in a phrase or a piece of text is stored in organised sequences, with the semantic
arrangement of words expressing the meaning of the text. This also implies that the integrity
of the semantic meaning in the sentences must be maintained during the extraction of text. To
this end, word2vec [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Sentence BERT (SBERT) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] have been used as word
embeddings and transformers techniques, respectively to test their abilities to capture semantics
from the regulatory text in AEC domain. To make our point, we propose training these models
with the UK building regulations code; moreover, we propose using common-sense knowledge
manually curated from a well-established ontology in the building environment domain, namely
Brick Ontology1 [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. As a result, this work adds breadth to the debate on the strengths of using
modern NLP techniques for knowledge extraction from regulatory text in the construction
industry. Although transformers have been widely used in similar works[
        <xref ref-type="bibr" rid="ref26 ref5 ref7 ref8">5, 7, 8, 26</xref>
        ], however,
none of the existing work has studied their suitability for the task and how efective they are to
capture the semantic regularities within the domain-specific use. This is important especially
when it comes to critical downstream tasks like information extraction and rule generation.
With regards to word embeddings literature, many researchers have studied the capabilities of
these techniques to capture the semantic regularities with a domain-specific language, such as
the medical domain [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] or the scientific domain [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. However, to the best of our knowledge,
no existing work has been done so far for the AEC-related language. Additionally, going beyond
the aforementioned literature, this work studies also the suitability of transformers to capture
the semantic regularities at both word and sentence-levels. To the best of our knowledge, the
proposed work represents the first attempt to methodically test the ability of word embeddings
and transformers to capture semantics in the building text.
      </p>
      <p>We list the major contributions of this work as follows: (i) we propose the accuracy of
word2vec, BERT and SBERT to capture the semantic regularities within the building text as
an objective to measure while learning the models, (ii) we create an analogy dataset for the
building regulations text by manually curating the Brick Ontology, and (iii) we evaluate our
work quantitatively and qualitatively on a corpus generated from the UK building regulations
code at both word and sentence levels. Our embeddings detected interesting semantic relations
in AEC domain such as “meter is to electricity as consumer_unit is to consumer”, and “room is a
type of space as door is a type of fitting” . The obtained results are, therefore, both promising and
insightful.</p>
      <p>The rest of the paper is organised as follows. Section 2 summarises the existing approaches
on NLP in the construction industry and gives an overview on work that attempted to use word
embeddings and transformers in the AEC domain. Section 3 presents our methodology and
describes the proposed word embeddings and transformers techniques. Section 4 describes the
dataset we have created from the UK building regulations code, the analogy dataset we have
created from the Brick Ontology as gold standard, presents and discusses the obtained results.
Finally, in Section 5 we conclude and draw future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This section summarises the literature on both NLP in AEC domain and semantic search with
word embeddings and transformers, hence covering the two topics of this paper.</p>
      <sec id="sec-2-1">
        <title>2.1. NLP in AEC domain</title>
        <p>
          Zhang and El-Gohary were among the first researchers who applied NLP for automated
information Extraction (IE) in AEC domain. They used a set of pattern matching based IE rules utilising
a series of syntactic and semantic text features in the patterns of the building rules. They also
utilised an ontology to support the identification of semantic text features. The IE algorithms
built was tested in extracting quantitative requirements from the 2009 international building
code and the results were 0.969 and 0.944 precision and recall, respectively. However, they
opined that the use of Machine Learning algorithms for text processing yielded less precision
and recall results when compared manually coded rules, which requires more human efort [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
However, because the manual process lacked neither flexibility nor scalability; Whenever the
building rules change, there will be a need to make adjustments to the building code, Zhang
and El-Gohary [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], therefore, proposed the use of a deep neural networks for semantic and
syntactic IE aspects from AEC regulation papers. The suggested approach performed well with
an average accuracy of 93% and a recall of 92.9%[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. In another work, Zhang et al.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] used
construction scene graphs and the C-BERT network, to propose an autonomous technique for
hazard inference. Initially, computer vision was used to produce construction scene graphs with
interaction-level scene descriptions that included entities, characteristics, and their interactions.
Second, the C-BERT network was meant to infer potential dangers by combining scene graphs
with domain information such as building rules. Five separate working settings were employed
to illustrate the validity of the suggested method, which achieved an identification accuracy of
97.82%. It ofered an efective mechanism for merging visual information and domain knowledge
for automated safety monitoring and paved the way for huge multi-modal information fusion
inside the industry.
        </p>
        <p>
          In the same context of rule automation, Zhou et al. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] proposed an automated rule
interpretation system for automated compliance checking that interprets sentences into single
requirement and multi-requirement rules. The parsing accuracy for basic sentences was 99.6%,
exceeding the state of the art, and 91.0% parsing precision for complicated sentences, which are
challenging for present algorithms to handle.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Word Embeddings and Transformers for Semantic Search in AEC domain</title>
        <p>
          A.J.P. Tixier et al. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] applied word embedding techniques(word2vec) on an 11-million word
corpus obtained from the construction domain and obtained word vectors from the process.
They explored the embedding space created by the vectors and afirmed that the word vectors
were able to capture meaningful semantics related to construction specific concepts. They
evaluated the performance of the vectors against the ones that were trained on a 100B-word
corpus (Google News) within the confines of an injury report classification task and without
any parameter tuning, their embeddings gave competitive results, and outperformed the Google
News vectors in many cases.
        </p>
        <p>
          Yuan et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] devised a technique for determining phrase similarity based on the BERT
model, and compared the classic ALBERT, ESIM, and BIMPM models. Their experimental
ifndings demonstrated that the BERT model calculates text similarity with an accuracy of 87%,
which is clearly superior to other models. Simultaneously, the synonym model is trained using
Word2Vec to extract target-word-related synonyms.
        </p>
        <p>
          Risch and Krestel[
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] applied domain-specific word embedding techniques for the automation
of patent applications. Here, they compared novelty applications for patents to existing patents
in the same class. However, one challenge the task faced was patent-specific language use,
especially in phrases and vocabulary. To account for this language usage, the authors proposed
pre-trained word embeddings for the patent domain that are domain-specific. The authors
trained the model on a massive dataset including more than 5 million patents and assessed its
classification performance. To this purpose, the authors presented a deep learning technique for
automated patent categorization based on gated recurrent units and trained word embeddings.
Experiments on a conventional evaluation dataset indicated that the strategy improved patent
categorization accuracy by 17% compared to state-of-the-art methods.
        </p>
        <p>
          It is important that the semantic meaning of sentences must be preserved when extracting
information[
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. This is the reason why we are adopting the word embedding techniques
(word2vec) and Bidirectional Encoder Representations from Transformers (BERT) to preserve
the context of sentences during the knowledge extraction from the regulatory text in AEC
domain.
        </p>
        <p>It is important to note that in all the aforementioned literature, BERT representations or word
embeddings were used as features for their Machine Learning models, assuming that these
techniques are capable to represent the semantic regularities within the natural language of
regulatory text in AEC domain. However, in our case, the research problem is diferent – we
aim to test the ability of word embeddings and transformers to capture the semantic regularities
within the regulatory text in AEC domain, before applying it to downstream reasoning tasks,
such as information extraction and rule generalisation. Given that these tasks are critical and
the models should be accurate enough to capture the semantics of the language, this paper
represents a step ahead what others have done, to make sure that the NLP techniques used
to represent the regulatory text are suitable to capture the semantic regularities within the
regulatory text in AEC domain.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This study focuses on modern NLP techniques, namely word2vec, BERT, and Sentence BERT
applied to the regulatory text in the construction industry. To this end, we study the capabilities
of these techniques to capture the semantic knowledge embedded with the regulatory text
at both word and sentence levels. The aim behind this study is to methodically set up the
choice of the appropriate NLP techniques as features to represent a domain-specific text, such
as the building regulations text. This task is important to guarantee the suitability of these
techniques to represent the regulatory text in the AEC domain when they are applied in critical
downstream tasks like information extraction or rule generalisation.</p>
      <sec id="sec-3-1">
        <title>3.1. Semantics at word level</title>
        <p>
          In order to capture the semantics at word level from the regulatory text in AEC domain, we
propose to use word2vec [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and BERT [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
3.1.1. word2vec
Word2vec is a neural network-based approach that uses unsupervised learning to create word
embeddings, which are vector representations of words in a high-dimensional space. The
algorithm works by training a neural network to predict the context words given a target word
or vice versa. In other words, given a large corpus of text data, word2vec learns to represent
each word in the corpus as a vector in a multi-dimensional space, such that words that are
semantically similar are placed closer together in this space [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          Word2vec has two main architectures: the Continuous Bag-of-Words (CBOW) and the
Skipgram model. In CBOW, the algorithm predicts the target word from its context, while in
Skip-gram, the algorithm predicts the context words given a target word. Previous results
reported in the literature have shown that Skip-Gram [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] model does not only produce useful
word representations, but it is also eficient to train. For this reason, we focus on it to build our
embeddings for regulatory text in AEC domain in this study.
        </p>
        <p>
          However, one draw back of the word2vec model is that it does not take into context the use
of a word in a sentence as words sometimes have diferent meanings when applied in sentences.
For example, (i) worker’s right in the the site and (ii) Right side of the building. Word2vec will
assign similar vectors to the word “right”. This gave rise to BERT as discussed in the next
section.
3.1.2. BERT
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based neural
network model for NLP tasks. It was developed by Google AI Language in 2018 and is one of
the most popular and powerful pre-trained language models [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Unlike traditional NLP models,
which process text in a linear manner (from left to right or right to left), BERT is designed to
process text in both directions, using a bidirectional approach. This allows BERT to capture
the context and meaning of words more accurately, and to better understand the relationships
between words and sentences [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>
          Figure (1) displays the BERT methodology we used. The BERT model comes pre-trained with
about 110Million words[
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. We import the python library pytorch_transformer, from which we
import a BertTokenizer and the pre-trained BertModel. Then, we fine-tune the model with our
UK building regulations corpora in order to achieve a better performance.
        </p>
        <p>Transformers ofers a variety of classes for using BERT on various tasks (token classification,
text classification, etc). Here, we are utilising the fundamental BertModel, which is a decent
option if all we want to do with BERT is extract embeddings and has no specified output
requirement. We evaluate the model and test how the model predicts missing words in sentences
by replacing those words with the [MASK] function. We tested both the pre-trained model and
the fine-tuned model in order to test the importance of fine-tuning BERT model in our task.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Semantics at sentence level</title>
        <p>
          In order to capture the semantics at sentence level from the regulatory text in AEC domain, we
propose to use Sentence BERT (SBERT) [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
3.2.1. Sentence BERT
Sentence-BERT (SBERT) is a variation of the BERT (Bidirectional Encoder Representations from
Transformers) model that is specifically designed to generate sentence embeddings, which are
vector representations of sentences in a high-dimensional space [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. The idea behind SBERT is
to leverage the power of pre-trained transformer-based models like BERT for sentence-level
tasks, such as semantic similarity - the task we are focusing on in this work. Unlike BERT, which
generates a fixed-length vector representation for each word in the input sequence, SBERT
generates a fixed-length vector representation for each sentence in the input. SBERT achieves
this by applying pooling techniques such as max-pooling or mean-pooling over the output of
the last layer of the transformer, which results in a sentence-level embedding [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ].
        </p>
        <p>Figure (2) displays the process methodology for the SBERT model. First, we converted our
PDF to TEXT file and preprocessed the data in the text file using nltk to tokenize the sentences.
Then, regular expression command was used to clean the data but this time, we did not remove
stop words so that we could maintain the semantic relationship between words in the sentence.
The output of preprocessing was corpora with only sentences.</p>
        <p>The Sentence transformer model SBERT was trained with our dataset to create embeddings.
Then we queried the model to extract sentences related to the query, and the results only
extracted words and sentence similar to the search query.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Semantic regularities in the building regulations text</title>
        <p>
          Word embeddings and transformers gain their success from their ability to capture syntactic and
semantic regularities in the natural language. Interestingly, they represent each relationship
by a relation-specific vector ofset [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. For example, the famous analogy “king is to queen as
man is to woman” is encoded in the vector space by the vector arithmetic “king - man + woman
= queen”. More specifically, the word analogy task aims at answering the question “man is to
woman as king is to — ?” given the two pairs of words that share a relation (“man:woman”,
“king:queen”), where the identity of the fourth word (“queen”) is hidden.
        </p>
        <p>Motivated by the ability of modern NLP techniques to extract semantic knowledge in textual
data without any prior domain knowledge, this ability is evaluated in a domain-specific text,
namely, the regulatory text in the AEC domain. The aim is to assess as to what extent these
NLP techniques are able to correctly represent the semantic knowledge in regulatory text given
the complexity of the regulations in the construction industry comparing to natural language.</p>
        <p>The semantic extraction methodology adopted at the word level is to query for
buildingrelated regularities captured in the vector model through simple vector subtraction and addition.
More formally, given two pairs of words (  _ ∶    _ ′) and (  _ ∶    ′), the aim is
to answer the question ( word_a is to word_a’ as word_b is to —?). Thus, the vector of the hidden
word   _ ′ will be the vector (  _ ′ −    _ +    _) , suggesting that the analogy
question can be solved by optimising:
arg
_
max
′∈
(  (  _
′,   
_ ′ −   _ +   _))
(1)
where  is the vocabulary and similarity is the cosine similarity measure.</p>
        <p>
          This task is challenging for building text language as no gold standard is available to evaluate
the eficacy of word embeddings and transformers in identifying linguistic regularities in
unstructured regulatory text in AEC domain, unlike existing work that use either the gold standard
defined by Mikolov et al. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] for general natural language tasks or predefined ontologies like
NDF-RT ontology2 for medical domain. Although various building-related ontologies exist, such
as ifcOWL [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], the Building Topology Ontology (BOT) [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], the Building Product Ontology
(BPO) [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], etc., the automatic mapping between the terminology used in our data sources (UK
Regulations) and the ontology concepts was hard and infeasible in most cases. To overcome
this problem, we propose to use Brick Ontology [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], which is a semantic metadata standard
representing the physical and logical entities in buildings, and a minimal set of relationships
that capture the connections between entities. Brick ontology was useful because it essentially
replaces the unstructured labels with semi-structured set of tags, which guarantees - to some
extent - the mapping between the concepts existing in our corpora and these tags. To build our
ground truth, we manually curate relationships related to building regulations domain from
Brick Ontology, and define a test set of analogy questions as semantic questions following the
relation described above, after verifying that the concepts present in every semantic question
exist in our corpora. This verification step is necessary to guarantee that all the extracted
relationships can be tested and fairly assess the performance of our models. The semantic questions
are formed based on the semantic relationships between concepts in the ontology, such as
“is-a”, “hasPart”, “isPartOf”, “isContainedIn”, “isTypeOf”, etc. For example, “roof” and “parapet”
are considered two components of the building elements “wall” and “balcony”, respectively.
Accordingly, the analogical question should be “roof is part of wall as parapet is part of —?”. To
correctly answer the question, the model should identify the missing term with a
correspondence counted as a correct match by finding the word “balcony”, whose vector representation is
closest to the vector (“roof” - “wall” + “parapet”) according to the cosine similarity. Similarly,
for the semantic relationship “room is a type of space as door is a type of fitting”, given the terms
“room”, “space”, and “door”, the model should be able to predict the term “fitting”. Recall that for
the specificity and complexity of scientific language and respecting the interchangeability of
scientific terms, instead of using the exact correspondence as the correct match, it is proposed
to adopt an approximate correspondence that considers an answer as correct if it belongs to
the top 10 nearest words given by cosine similarity in order to guarantee the applicability of
the generated embeddings in regulatory text in AEC domain. This approach was based on the
work published by Dridi et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] on word2vec hyper-parametrisation is the scientific domain.
        </p>
        <p>The methodology described above is applied to the word-level semantics. However, for the
sentence-level semantics, it is proposed to query the SBERT model with sentences extracted the
manual for the UK building regulations, and to test its ability to capture the semantic meaning
of sentences, including their context and relationships with other sentences.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>To show how word embedding techniques and transformers could extract semantic from text
data, we used the UK Building Regulations Code3document, which is publicly available and
saved in PDF format. The document contains 18 chapters, each of them is related to a specific
2National Drug File -Reference Terminology
3https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1082748/
Merged_Approved_Documents__Jun2022_.pdf
building domain such as construction, fire safety, energy eficiency, etc. This document contains
building rules to guide designers and builders on the things they must adhere to in terms
of specifications and guidelines. All chapters in the document were used for our work. The
document has been first converted to a text file. Then, it has been pre-processed before training
the NLP models. The pre-processing consists of (i) removal of all punctuation and lower-casing
the corpus; (ii) removal of stop-words using Stanford NLP stop word list, enriched by a list of
irrelevant keywords extracted the UK building regulations code like “online version, edition, for
use in England, etc.; and (iii) construction of bag-of-words where words are either unigrams used
for standard word2vec training or bigrams used for word2phrase learning. The use of bigrams
is justified by their frequent use in the building text, such as “fire safety”, “energy eficiency” , etc.
Table 1 summarises the statistics of the dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Discussion</title>
        <p>The evaluation of the chosen NLP techniques to extract the semantic knowledge from the
building regulations text at both word and sentence levels was performed following the methodology
described in Section 3.3.</p>
        <p>As described in Section 3.3, the word-level semantic task attempts to query for regularities
captured in the embedding model through simple vector subtraction and addition for word2vec
and BERT models. The preliminary analogy dataset we created contains 100 analogical questions
extracted from the Brick Ontology as described in Section 3.3, and it is considered as our ground
truth. The dataset is made publicly available for further use and enrichment 4. After querying
the models with the selected words/phrases, we calculate the similarity scores between the
query embeddings and the results embeddings using cosine similarity metric. This gives us a
measure of how similar each returned word/phrase from the UK building regulations code is
to the query. A summary of our models accuracy in displayed in Table 2. For BERT, both the
pre-trained and fine-tuned models were used.
4https://github.com/salomena/Word-Embeddings-/blob/Semantic-relationships/Semantic%20Relationships</p>
        <p>
          Table 2 shows that the accuracy of word2vec and BERT models are both promising and proves
the ability of these two models to capture the semantic regularities within the regulatory text
in the AEC domain. Both Word2vec and the fine-tuned BERT perform better than the
pretrained BERT because they have been trained on our dataset. These findings are insightful and
promising because they clearly show that both word embeddings (word2vec) and transformers
(BERT and SBERT) are able to capture semantic regularities in AEC-related regulatory text,
with an accuracy of 80% for BERT at the word-level. Although, it was observed in [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] that
BERT performed poorly in domain specific words.
        </p>
        <p>
          In addition to this quantitative analysis, a qualitative analysis has been performed with
word2vec model. It consists to represent the t-distributed stochastic neighbor embedding
(t-SNE) [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ] visualisations of words.
        </p>
        <p>(a) word “legistlation”
(b) word “electricity”</p>
        <p>For instance, Figures 3a and 3b represent the vector ofsets of the two words “legistlation”
and “electricity”, respectively. It can be clearly seen from the plots that the surrounding words
of each word are semantically closer in meaning. This confirms that modern NLP techniques,
namely word2vec and BERT can reliably extract the semantic knowledge from the building
regulations text.</p>
        <p>For the sentence-level semantic task, SBERT has been evaluated as described in Section 3.3.
A set of sentences has been selected from the manual for the UK building regulations as queries,
the semantic search has been performed, and then the returned results have been evaluated with
cosine similarity measure. Interestingly, the model hits a 100% accuracy at the sentence-level as
shown in Table 2. These results are very promising and confirm the previous findings on the
capability of word embeddings and transformers to capture the semantic knowledge at both
word and sentence levels.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>Despite being popular and achieving state-of-the-art performance in tasks related to
knowledge extraction and semantic search, word embeddings and transformers are still being used
intuitively; without a proper testing of their ability to capture semantic regularities in a
domainspecific text. From this perspective and aiming to provide a reliable information extraction from
the regulatory text in AEC domain, this work explored three models, namely word2vec, BERT
and Sentence BERT and tested their reliability for the extraction of semantics from the building
text at both word and sentence levels. The UK building regulations code has been used as a
dataset to apply and test the models, and Brick Ontology has been used as a ground truth the
create the semantic relationships. The obtained results were insightful and promising. This
work adds breadth to the automation in construction industry that started to heavily rely on ML
and NLP techniques to deal with the massive amount of textual data. Applied to the regulatory
text, and despite the sensitivity of the domain, our work has proven the ability of the modern
NLP techniques to efectively capture the semantic knowledge.</p>
      <p>As a short term objective, we plan to expend our ground truth of semantic analogies in the
building domain by combining the knowledge extracted from diferent resources, and see how
the models perform on a larger dataset. For long term objectives, we plan to leverage these NLP
techniques to extract information and auto-generate rules from the building regulations.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Acknowledgements</title>
      <p>This work is partially funded by the European Union’s Horizon Europe research and innovation
programme under grant agreement no 101056973 (ACCORD). UK Participants in Horizon Europe
Project [ACCORD] are supported by UKRI grant numbers [10040207] (Cardif University),
[10038999 ] (Birmingham City University and [10049977] (Building Smart International).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: NIPS</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          , NIPS'17, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2017</year>
          , p.
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shrestha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Morshed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pradhananga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <article-title>Leveraging accident investigation reports as leading indicators of construction safety using text classification</article-title>
          ,
          <source>in: Conference: ASCE Construction Research Congress (CRC)</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Salama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>El-Gohary</surname>
          </string-name>
          ,
          <article-title>Semantic modeling for automated compliance checking</article-title>
          , in: Computing in Civil Engineering (
          <year>2011</year>
          ),
          <year>2011</year>
          , pp.
          <fpage>641</fpage>
          -
          <lpage>648</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , N. El-Gohary,
          <article-title>Extraction of construction regulatory requirements from textual documents using natural language processing techniques</article-title>
          , in: Computing in Civil Engineering (
          <year>2012</year>
          ),
          <year>2012</year>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>El-Gohary</surname>
          </string-name>
          ,
          <article-title>Information transformation and automated reasoning for automated compliance checking in construction, Computing in civil engineering 8 (</article-title>
          <year>2013</year>
          )
          <fpage>701</fpage>
          -
          <lpage>708</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>Automated code compliance checking in the construction domain using semantic natural language processing and logic-based reasoning</article-title>
          , University of Illinois at UrbanaChampaign,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>El-Gohary</surname>
          </string-name>
          ,
          <article-title>Semantic nlp-based information extraction from construction regulatory documents for automated compliance checking</article-title>
          ,
          <source>Journal of Computing in Civil Engineering</source>
          <volume>30</volume>
          (
          <year>2016</year>
          )
          <fpage>04015014</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          ,
          <article-title>An unsupervised sentence embedding method by mutual information maximization</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>12061</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Automatic construction site hazard identification integrating construction scene graphs with bert based domain knowledge, Automation in Construction 142 (</article-title>
          <year>2022</year>
          )
          <fpage>104535</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <article-title>Social media: A phenomenon to be analyzed</article-title>
          ,
          <source>Social Media+ Society</source>
          <volume>1</volume>
          (
          <year>2015</year>
          )
          <fpage>2056305115580148</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>An integrated system of text mining technique and case-based reasoning (tm-cbr) for supporting green building design</article-title>
          ,
          <source>Building and Environment</source>
          <volume>124</volume>
          (
          <year>2017</year>
          )
          <fpage>388</fpage>
          -
          <lpage>401</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Lee, Automated classification of building information modeling (bim) case studies by bim use based on natural language processing (nlp) and unsupervised learning</article-title>
          ,
          <source>Advanced Engineering Informatics</source>
          <volume>41</volume>
          (
          <year>2019</year>
          )
          <fpage>100917</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Padhy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jagannathan</surname>
          </string-name>
          , V. Delhi,
          <article-title>Application of natural language processing to automatically identify exculpatory clauses in construction contracts</article-title>
          ,
          <source>Journal of Legal Afairs and Dispute Resolution in Engineering and Construction</source>
          <volume>13</volume>
          (
          <year>2021</year>
          )
          <fpage>04521035</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Indurkhya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Damerau</surname>
          </string-name>
          ,
          <article-title>Text mining: predictive methods for analyzing unstructured information</article-title>
          , Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.-H.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Automated management of green building material information using web crawling and ontology</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>102</volume>
          (
          <year>2019</year>
          )
          <fpage>230</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jallan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ashuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clevenger</surname>
          </string-name>
          ,
          <article-title>Application of natural language processing and text mining to identify patterns in construction-defect litigation cases</article-title>
          ,
          <source>Journal of Legal Afairs and Dispute Resolution in Engineering and Construction</source>
          <volume>11</volume>
          (
          <year>2019</year>
          )
          <fpage>04519024</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F. C.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodrigues</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ben-Akiva</surname>
          </string-name>
          ,
          <article-title>Text analysis in incident duration prediction</article-title>
          ,
          <source>Transportation Research Part C: Emerging Technologies</source>
          <volume>37</volume>
          (
          <year>2013</year>
          )
          <fpage>177</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          , Machine learning in
          <source>automated text categorization, ACM computing surveys (CSUR) 34</source>
          (
          <year>2002</year>
          )
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Al Qady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kandil</surname>
          </string-name>
          ,
          <article-title>Automatic clustering of construction project documents based on textual similarity, Automation in Construction 42 (</article-title>
          <year>2014</year>
          )
          <fpage>36</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Leyton-Brown</surname>
          </string-name>
          ,
          <article-title>An eficient approach for assessing hyperparameter importance</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML'14</source>
          , JMLR.org,
          <year>2014</year>
          , p.
          <source>I-754-I-762.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dridi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Gaber</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Azad</surname>
          </string-name>
          , J. Bhogal,
          <article-title>k-nn embedding stability for word2vec hyperparametrisation in scientific text</article-title>
          ,
          <source>in: International Conference on Discovery Science</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          ,
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fierro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Culler</surname>
          </string-name>
          ,
          <article-title>Beyond a house of sticks: Formalizing metadata tags with brick</article-title>
          ,
          <source>in: The 6th ACM International Conference on Systems for EnergyEficient Buildings</source>
          , Cities, and Transportation, Association for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>125</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>El-Gohary</surname>
          </string-name>
          ,
          <article-title>A deep neural network-based method for deep information extraction using transfer learning strategies to support automated compliance checking</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>132</volume>
          (
          <year>2021</year>
          )
          <fpage>103834</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Miñarro-Giménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marín-Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Samwald</surname>
          </string-name>
          ,
          <article-title>Applying deep learning techniques on medical corpora from the world wide web: a prototypical system and evaluation</article-title>
          ,
          <source>CoRR abs/1502</source>
          .03682 (
          <year>2015</year>
          ).
          <article-title>a r X i v : 1 5 0 2 . 0 3 6 8 2</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Y.-C. Zhou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>J.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.-Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Integrating nlp and context-free grammar for complex rule interpretation towards automated compliance checking</article-title>
          ,
          <source>Computers in Industry</source>
          <volume>142</volume>
          (
          <year>2022</year>
          )
          <fpage>103746</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>A. J.-P. Tixier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Vazirgiannis</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Hallowell</surname>
          </string-name>
          ,
          <article-title>Word embeddings for the construction domain</article-title>
          ,
          <source>arXiv preprint arXiv:1610.09333</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Guo,</surname>
          </string-name>
          <article-title>Research on text similarity calculation based on bert and word2vec</article-title>
          ,
          <source>in: ICETIS 2022; 7th International Conference on Electronic Technology and Information Science</source>
          , VDE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Risch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <article-title>Domain-specific word embeddings for patent classification</article-title>
          ,
          <source>Data Technologies and Applications</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Han,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . Shen,
          <article-title>Intelligent question answering method for construction safety hazard knowledge based on deep semantic mining</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>145</volume>
          (
          <year>2023</year>
          )
          <fpage>104670</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805.
          <article-title>a r X i v : 1 8 1 0 . 0 4 8 0 5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. J. Kuo</surname>
          </string-name>
          ,
          <article-title>Sbert-wk: A sentence embedding method by dissecting bert-based word models</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>28</volume>
          (
          <year>2020</year>
          )
          <fpage>2146</fpage>
          -
          <lpage>2157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen-Tau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Geofrey</surname>
          </string-name>
          ,
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .,
          <source>in: HLT-NAACL</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pauwels</surname>
          </string-name>
          , W. Terkaj,
          <article-title>Express to owl for construction industry: Towards a recommendable and usable ifcowl ontology</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>63</volume>
          (
          <year>2016</year>
          )
          <fpage>100</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lefrançois</surname>
          </string-name>
          , G. Schneider,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pauwels</surname>
          </string-name>
          ,
          <article-title>Bot: the building topology ontology of the w3c linked building data group</article-title>
          , Semantic
          <string-name>
            <surname>Web</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>143</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sprenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maurer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          , U. Rüppel,
          <article-title>Building product ontology: Core ontology for linked building product data</article-title>
          ,
          <source>Automation in Construction</source>
          <volume>133</volume>
          (
          <year>2022</year>
          )
          <fpage>103927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Legal-bert: The muppets straight out of law school</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>02559</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>L. van der</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Visualizing data using t-sne</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          )
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>