<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shanshan Bai</string-name>
          <email>shanshan.bai@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Kruspe</string-name>
          <email>anna.kruspe@hm.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaoxiang Zhu</string-name>
          <email>xiaoxiang.zhu@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>"Just posted a photo @ Baxter Hall."</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Munich Center for Machine Learning</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Munich University of Applied Sciences</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technical University of Munich</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases-an inherently weakly supervised process that introduces both label noise and sentence-level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence-level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise dificult to isolate in real-world data. To assess its utility, we compare model performance using Naïve Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn efectively, outperforming Naïve Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset ofers a novel experimental environment for future noise injection studies and is publicly available on GitHub1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Noise</kwd>
        <kwd>Tweets</kwd>
        <kwd>Synthetic Data Generation</kwd>
        <kwd>Building Function Classification</kwd>
        <kwd>LLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Building Function Classification (BFC) aims to determine the functional purpose of buildings, such as
commercial or residential use, using various data sources. Traditionally, remote sensing imagery has
been the dominant modality for this task. However, such imagery often lacks the semantic granularity
needed to distinguish nuanced urban functions at the building level, for example, diferentiating between
dormitories and hotels.</p>
      <p>Textual data from social media, such as geo-tagged tweets, ofers a complementary perspective.
Tweets sent from the same location as a building may reveal human activities and behaviors that
provide contextual clues about building use. For instance, a tweet like "Great cofee at this café! " implies
a commercial function, while "Silent night at the dorm." suggests residential use.</p>
      <p>Despite this promise, social media datasets collected for BFC often rely on weak supervision—tweets
are heuristically matched to buildings (e.g. tweets send within 50 meters radius of a building are assigned
to this building) and labeled using voluntarily-provided building tags on the external databases such as
OpenStreetMap (OSM). This process introduces multiple types of noise: (1) label noise, from incorrect
or outdated building tags; and (2) sentence-level feature noise, in the form of irrelevant, ambiguous,
or uninformative tweets. While label noise has been extensively studied—often through controlled
label-flipping experiments—sentence-level feature noise remains harder to investigate. This is because
it requires access to a dataset that is known to contain only relevant and correctly labeled examples,
something that human annotators often cannot guarantee due to the subjective and implicit nature of
semantic relevance.</p>
      <p>To address this, we explore the feasibility of using LLM to generate a synthetic oracle dataset for BFC:
a noise-free benchmark in which all tweets are both correctly labeled and semantically relevant to their
associated buildings. We define an oracle dataset as an idealized collection designed not for deployment,
but to serve as a clean experimental environment for systematically analyzing the efects of noise.</p>
      <p>
        It is also important to acknowledge that geo-tagged tweets are no longer widely accessible due to
platform changes (e.g., Twitter removing precise geolocation tagging [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). However, our focus is not on
Twitter per se, but on the broader challenge of noise in weakly supervised user-generated datasets—an
issue that persists across many social media applications.
      </p>
      <p>Our key contributions are as follows:</p>
      <sec id="sec-1-1">
        <title>A Synthetic Oracle Tweets Dataset for Building Function Classification: We construct a clean,</title>
        <p>LLM-generated dataset that enables controlled experiments on feature noise, facilitating analysis
that is otherwise infeasible using real-world data.</p>
        <p>A Data Generation Pipeline: We introduce a reproducible method for generating synthetic datasets,
guided by real-world building and tweet distributions to ensure statistical realism.
Evaluation of Data Quality: We compare the correctness and diversity of the synthetic data against
real-world datasets using both classifier performance and linguistic metrics (Self-BLEU,
perplexity).</p>
        <p>Insights into Feature Noise Impact: Our results show that handling feature noise is more critical
than increasing model complexity for BFC tasks.</p>
        <p>The remainder of this paper is structured as follows: Section 2 reviews related work on building
function classification and synthetic data generation. Section 3 describes our data generation pipeline.
Section 4 presents the evaluation methodology and results. Section 5 analyzes the quality and
characteristics of the dataset. Section 6 summarizes our findings, and Section 7 outlines directions for future
research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Building Function Classification with Tweets</title>
        <p>
          Recent studies have investigated the feasibility of using geo-tagged tweets for Building Function
Classification (BFC), confirming their potential while emphasizing key challenges related to noise and
data quality. Häberle et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] were among the first to treat BFC as a text classification task, assigning
tweets located near buildings as inputs. However, their use of FastText limited robustness in multilingual
urban settings. In contrast, Häberle et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] shifted focus from sentence-level classification to feature
engineering, using tweet embeddings to extract function-indicative features. While some keywords
were found to correlate strongly with building types, overlapping lexical features introduced noise,
reducing overall reliability.
        </p>
        <p>
          To enhance classification accuracy, Häberle et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed a decision-level fusion strategy,
integrating textual features from tweets with remote sensing imagery. Their results showed that while
tweets ofered useful contextual cues, the discriminative power largely came from imagery. More
recently, Häberle [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] scaled BFC globally, reframing it as a multilingual classification task using mBERT.
Despite achieving moderate success (55% accuracy across three functional categories), noise in the
tweet data—both in labeling and content—remained a significant bottleneck.
        </p>
        <p>
          A consistent finding across these studies is that tweets, while informative, are inherently noisy.
Label noise often originates from incorrect or incomplete tags in OpenStreetMap (OSM) [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ], and
geo-tagging mismatch occurs when tweets are attributed to the wrong building due to spatial heuristics
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Additionally, sentence-level feature noise—such as irrelevant weets—remains a major challenge [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
While label noise has been the subject of many experimental studies, there has been limited systematic
analysis of sentence-level feature noise and its efects on model performance.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Synthetic Data Generation with LLMs</title>
        <p>
          Large Language Models (LLMs) have shown impressive ability to generate text that captures patterns
and structures from real-world corpora [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Recent work has explored using LLMs to create synthetic
datasets via instruction-tuned prompting conditioned on class labels [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ], showing that LLMs can
serve as scalable alternatives to human annotators. However, category-conditioned generation poses a
critical challenge: the synthetic text distribution often diverges from real data in linguistic diversity,
topical focus, or context relevance [13, 14].
        </p>
        <p>To address these issues, researchers have proposed human-in-the-loop refinement strategies [ 15] and
model-guided feedback loops [16], which adjust prompts based on classifier performance. Others have
explored persona-based synthesis [17, 18, 19, 20] to introduce style and perspective variation. While
these approaches enhance fluency and diversity, they are not explicitly designed for tasks requiring
spatial and contextual grounding—such as aligning tweets with building-specific metadata in urban
environments.</p>
        <p>Despite these advances, there remains a gap in applying LLM-based data generation to fine-grained,
geospatially anchored classification tasks like BFC. Most prior work assumes structured label spaces
and ignores the semantic constraints imposed by geographic or functional context. Our work addresses
this gap by conditioning LLM prompts on real-world building attributes (e.g., function, name, location)
and matching tweet language distributions. This enables the creation of a high-quality oracle dataset
that supports controlled analysis of noise in text-based BFC.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Synthetic Tweets Generation Pipeline</title>
      <p>This section presents our three-step pipeline for generating synthetic multilingual tweets using a large
language model (LLM), conditioned on building-level metadata. The pipeline includes: (1) retrieving
metadata, (2) cleaning it, and (3) prompting the LLM to generate contextually grounded tweets.</p>
      <sec id="sec-3-1">
        <title>3.1. Step 1: Retrieving Metadata</title>
        <p>
          We define metadata as structured descriptive information associated with each building—such as its
function, location, and the intended number and language of tweets to be generated. This metadata
is either inherited from a prior labeled dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] or retrieved from OSM, ensuring that the resulting
synthetic dataset reflects the statistical characteristics of real-world data.
        </p>
        <p>
          • Building Attributes:
– Building_tag: A fine-grained functional label from OSM (e.g., "restaurant",
"apartment"), distinct from the binary ground-truth classification ("commercial" or
"residential").
– Building_name: A descriptive identifier used in tweet content (e.g., "Merlex Auto Group").
– Building_city: The city where the building is located.
• Tweets Language Distribution: A list specifying the language of each synthetic tweet to be
generated. For instance, ["English", "English"] indicates two English tweets should be
generated for that building. These values are inherited from the dataset in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], ensuring alignment
with real-world tweet frequency and language usage.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Step 2: Preprocessing Metadata</title>
        <p>Since building attributes derived from OSM can be noisy, and we also want to control the maximum
number of tweets each building, we apply a cleaning preprocessing pipeline to previous collected
metadata. This step ensures LLM generating synthetic tweets without introducing unwanted noise and
also a balanced dataset.</p>
        <p>• Removing formatting artifacts: We sanitize building names and tags by stripping special characters
(e.g., underscores _, slashes ) to prevent malformed prompts.
• Filtering generic or erroneous tags: Entries with non-informative tags such as ‘"yes"‘ or ‘"roof"‘
are excluded, as they do not convey meaningful function.
• Ensuring unique tag: Many buildings are associated with multiple use-specific tags that suggest
diferent functional categories (e.g., both ‘"church"‘ and ‘"restaurant"‘).
• Ensuring label-tag consistency: For buildings pre-labeled as ‘"commercial"‘ or ‘"residential"‘
(from prior datasets), we remove those whose OSM tags clearly contradict the assigned label.
For example, if a building labeled ‘"commercial"‘ has the tag ‘"mosque"‘, it is removed from the
dataset.
• Restricting to at most five tweets per building: This step applies to the number of synthetic tweets
to be generated and also to real-world tweets. We cap the number at five to balance the dataset
across buildings and to control for prompt length and generation consistency. This helps avoid
overrepresentation of a few buildings during training or evaluation phases.</p>
        <p>The preprocessed metadata is stored in structured JSON format. An example is shown below:
{ " B u i l d i n g c i t y " : " WashingtonDC " ,
" B u i l d i n g t a g " : " R e t a i l " ,
" B u i l d i n g name " : " M e r l e x Auto Group " ,
" T w e e t s d i s t r i b u t i o n " : [ " E n g l i s h " , " E n g l i s h " ] }</p>
        <p>While these preprocessing steps helps reduce ambiguity and maintain consistency, we acknowledge
that it may exclude buildings with complex, multi-functional roles (e.g., shopping malls with food courts,
theaters, and ofices). Although this limits the coverage of our analysis, our focus in this study is to
establish a controlled, noise-free experimental setting—not to comprehensively model all real-world
function types.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Step 3: Generating Tweets using an LLM</title>
        <p>We use the Llama-3.3-70B-Instruct-bnb-4bit model from Hugging Face1 to generate
multilingual tweets. Each prompt consists of two key components:
• System Prompt: Defines the overall task, outlining style and formatting guidelines for tweet
generation. The system prompt also includes a one-shot example demonstrating the desired tweet
format. This remains constant for all buildings.
• User Prompt: Contains building-specific metadata, to ensure diverse and contextually relevant
outputs.</p>
        <sec id="sec-3-3-1">
          <title>Here is the system prompt:</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>1https://huggingface.co/unsloth/Llama-3.3-70B-Instruct-bnb-4bit</title>
          <p>Generate tweets as if they were posted by real Twitter users in a specific building. Tweets should
be sent from the type of building described in the ’building tag’. Ensure that each tweet reflects a
unique perspective or experience. Imagine switching personas for each tweet, simulating thoughts
from diferent types of users, such as tourists, professionals, or families. Consider varying the tone
(e.g., humorous, cynical, formal, casual), the length (short and concise, or longer and detailed), and
the use of mentions or hashtags. Highlight varied aspects of the building, such as its architecture,
services, location, history, or events. You must generate only one tweet in each language specified
under ’tweet language distribution,’ written directly in that language.</p>
          <p>Example:
{"Building_city": "WashingtonDC",
"Building_tag": "Retail",
"Building_name": "Merlex Auto Group",
"Tweets_language_distribution": ["English", "English"]}
["Bought new rims here at Merlex Auto yesterday, totally transformed my ride! @merlexautogroup
#AutoCare", "Merlex Auto Group really knows how to treat car lovers right. The staf? Super
knowledgeable. The selection? If you’re in DC and thinking about upgrading your ride, this is the
place! #CarShopping #DCLuxury"]</p>
          <p>
            Using this three-steps pipeline, we generate a synthetic oracle dataset covering 6,000 real-world
buildings across 41 global cities. The final dataset includes 15,222 tweets in 45 languages. We ensure
that the distribution of building types, tweet language frequencies closely mirror those of the real-world
dataset used in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. A quantitative comparison between real and synthetic tweets is provided in Table 1.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset Evaluations and Results</title>
      <p>In this section, we evaluate the quality of the generated synthetic dataset along two key dimensions:
diversity and correctness. Diversity ensures that the dataset captures a broad range of sentence
structures and vocabulary variations, rather than overly repetitive content that could oversimplify the
classification task. Correctness assesses whether the synthetic dataset fulfills its intended purpose as an
oracle dataset, containing only perfectly aligned tweets that semantically correspond to their respective
target buildings.</p>
      <p>Since each building in the synthetic dataset mirrors a real-world building exactly, and the tweet
distributions in terms of volume and language match their real-world counterparts, we evaluate correctness
and diversity by comparing the synthetic dataset against the real-world dataset.</p>
      <sec id="sec-4-1">
        <title>4.1. Diversity Evaluation</title>
        <p>To assess diversity, we use 4-gram Self-BLEU [21] as the primary metric, following [22]. Self-BLEU
measures how similar each sentence is to the rest of the dataset by calculating BLEU [21] scores for
every sentence against all others. A lower Self-BLEU score indicates higher diversity, suggesting that
the dataset contains a richer variety of expressions. The results are reported in Table 2.</p>
        <p>As a secondary measure, we compute perplexity [23]. While traditionally used to assess how well a
language model predicts text, perplexity can also serve as an indirect proxy for vocabulary alignment
between the synthetic and real-world datasets. Specifically, we compute unigram perplexity using a
unigram language model pre-trained on a held-out set of 100,000 real-world tweets from buildings
not included in the experimental dataset. By evaluating perplexity for both datasets, a similar score
between synthetic and real-world data suggests that the synthetic dataset shares comparable linguistic
distribution and vocabulary complexity. Conversely, a significant deviation would indicate substantial
diferences in word usage. The perplexity results are also included in Table 2.
"Turkey Dinner with all the Fixin’s, that your
kids will actually gobble up?! Yep! Cake #nailedit
Spotted at superfresh_bayridge_dyker (where
they do NOT make vegan pumpkin pie btw)
SuperFresh Bay Ridge."
"New York - The food in the deli, had to go to the
hospital where they pumped my stomach out
and determined the deli salads poisoned me. I
was so... Food Poisoning."
"Fresh produce at unbeatable prices, Superfresh
has become my go-to grocery spot in New York!"
"Why did I wait so long to discover the best deals
on electronics are actually inside this store?"
"Does anyone else notice how vibrant the flower
arrangements are near the entrance every
morning? It sets the mood right for shopping."
"Shooting visserlieb clarinet improv #artistic
#research #clarinet #performancebasedresearch
#phdlife #phdstudent #archive #freeimprovisation
#kirbycollection #music. . . ."
"Shooting visserlieb clarinet improv. #artistic
#research #clarinet #artisticresearch
#performancebasedresearch #phdlife #phdstudent
#archive #freeimprovisation #kirbycollection
#music #experimentation #sound. . . ."
"Cramming for exams is way more bearable when
you have an amazing view like ours at Baxter
Hall!"
"Dorm life can be crazy but our community here
makes all the diference."
"Battling to get work done before deadline hits
anyone else feeling the pressure? Hope cafeine
kicks in soon."
"Sharing laughs over breakfast on the rooftop has
become daily highlight always something new."
"Wonderful place full great people making every
day count."</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Correctness Evaluation</title>
        <p>A direct way to evaluate correctness is by testing the dataset’s utility in a downstream classification task.
We compare two distinct classifiers: Naïve Bayes (NB) and multilingual BERT (mBERT). These models
were chosen to contrast a traditional statistical classifier (NB), which relies on word co-occurrence
patterns, with a transformer-based language model (mBERT), which leverages contextual embeddings
for nuanced language understanding. We conduct evaluations under three configurations:
• Real-world: Train and test models using real-world data to establish a baseline.
• Synthetic: Train and test models using synthetic data to determine if the dataset achieves
correctness.
• Cross-domain: Train models on synthetic data and test on real-world data to evaluate
generalization performance.</p>
        <p>First two configurations allow us to assess whether the synthetic dataset can function as a valid
oracle dataset. The cross-domain evaluation is particularly insightful in determining how well models
trained on noise-free synthetic data adapt to noisy real-world conditions.</p>
        <p>
          The real-world dataset used in our experiments was derived from [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], filtered to include the same
6,000 buildings as the synthetic set. For each building, we retain the same number of tweets in the same
languages, resulting in a dataset of 15,222 real tweets. We split both datasets at the building level to
prevent data leakage, allocating 10,729 tweets from 80% of buildings for training and validation, and
2279 tweets of the rest 20% buildings for testing. We fine-tune bert-base-multilingual-cased
for five epochs using the Adam optimizer with a learning rate of 5e-6, a batch size of 16, a warm-up
ratio of 0.01, and a weight decay of 0.01 to prevent overfitting. All training hyperparameters remain
consistent across experiments. The results are presented in Table 3.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussions</title>
      <p>Our diversity evaluation reveals a trade-of between structural variation and lexical realism in the
synthetic dataset. The higher self-BLEU score of the synthetic tweets (48.37%) compared to the
realworld tweets (40.78%) suggests that the synthetic content is more repetitive at the sentence structure
level. However, the perplexity scores—4.49 for synthetic vs. 4.36 for real—indicate that the vocabulary
distribution is comparable across both datasets. This suggests that while the LLM may favor certain
patterns or templates in sentence generation, it still captures a realistic range of word usage and
topic-relevant terms.</p>
      <p>In the real-world configuration, mBERT (65%) performs only marginally better than Naïve Bayes
(64%). This finding suggests that the contextual capabilities of transformer models are suppressed by
the high level of noise in real tweets. Naïve Bayes, which relies on surface-level word co-occurrences,
performs nearly as well—indicating that complex models may be overkill in noisy, weakly supervised
settings where semantic signals are diluted.</p>
      <p>By contrast, in the synthetic configuration, mBERT (91%) significantly outperforms Naïve Bayes (84%).
This performance gap highlights how the noise-free, semantically aligned tweets in the oracle dataset
allow transformer model to leverage its full potential. The performance gain achieved by mBERT also
reflects the presence of fine-grained semantic cues that word-frequency based models cannot exploit.
This validates the intended function of our synthetic dataset as an oracle environment—models capable
of contextual understanding should perform well when noise is removed.</p>
      <p>These findings also show a key insight: noise handling is more critical than model complexity for
improving classification accuracy in weakly supervised text settings. Even the best models fail to learn
efectively when irrelevant or ambiguous inputs dominate the data.</p>
      <p>We also noted the cross-domain evaluation shows that models trained on the synthetic oracle dataset
generalize poorly to noisy real-world tweets. This result underscores the significant impact of domain
shift introduced by the removal of noise by LLM-generated dataset. It is important to recognize that our
synthetic dataset is not designed to replace real-world data. Rather, it serves a distinct purpose: to create
a controlled, noise-free environment for systematic experimentation—particularly for studying the
isolated efects of label or feature noise via noise injection strategies. We also acknowledge the potential
risk of semantic drift in LLM-generated data—that is, the possibility that synthetic tweets may reflect
biases, stereotypes, or linguistic abstractions learned by the model rather than replicating authentic
user behavior. However, our goal is not to reproduce social media behavior with full fidelity. Instead,
we intentionally trade of some realism for semantic clarity and control, enabling cleaner experimental
analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This study investigates the feasibility of using LLM to generate a synthetic oracle dataset for text
classification tasks. We focus on the domain of building function classification (BFC) using geo-tagged
tweets, a modality that ofers semantic richness but sufers from label noise and, more critically,
sentence-level feature noise—irrelevant, ambiguous, or uninformative tweets that obscure learning
signals.</p>
      <p>While label noise has been extensively studied through noise injection experiments, sentence-level
feature noise remains underexplored due to the lack of a truly clean dataset. Human annotators often
disagree on what constitutes relevance in social media text, making it dificult to filter out such noise
consistently. To address this, we construct an oracle dataset using an LLM, where all tweets are
guaranteed to be correctly labeled and semantically aligned with their associated building types. This
provides a controlled, noise-free environment suitable for systematic experimentation.</p>
      <p>Our evaluations show that:
• The synthetic dataset, while slightly less diverse in sentence structure (higher Self-BLEU), exhibits
comparable lexical richness (similar perplexity) to real-world tweets.
• On real-world data, sophisticated models like mBERT perform only marginally better than simple
classifiers like Naïve Bayes, due to noise degrading contextual learning.
• On synthetic data, mBERT significantly outperforms Naïve Bayes, validating that the synthetic
dataset retains meaningful semantic distinctions and fulfills its oracle role.</p>
      <p>These findings suggest that addressing feature noise may be more critical than increasing model
complexity for tasks involving weakly supervised text. While our synthetic dataset is not intended
to replicate real user behavior, it ofers a valuable testbed for controlled experiments that would be
infeasible with noisy real-world data.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future Work</title>
      <p>One key challenge in synthetic data generation is balancing diversity and semantic fidelity. Our
ifndings show that while our dataset maintains realistic vocabulary usage, its sentence structure is less
varied than real-world tweets. Future work could explore techniques such as controlled paraphrasing,
persona variation, or prompt augmentation to increase structural diversity without compromising
label alignment. Reducing this domain gap would make synthetic datasets more useful not only for
experimentation but also for training and transfer learning.</p>
      <p>Our synthetic oracle dataset provides a clean, well-defined starting point for studying the impact
of noise in text classification. Future research can build on this by systematically injecting diferent
types and levels of noise—such as label flipping, irrelevant content, or of-topic language—to isolate
their individual and combined efects on model performance.</p>
      <p>Given that our synthetic tweets are grounded in real cities and building names, future work could
use this dataset to examine how geospatial context influences text classification. For example, studies
could compare performance with and without location-specific references, or investigate how city-level
variation afects model generalization. This has implications not only for BFC, but also for broader
geo-aware NLP tasks such as place-based sentiment analysis or urban event detection.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT to identify and correct grammatical
errors, typos, and other writing mistakes, in order to improve the clarity and professionalism of
writing. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.
Dominican Republic, 2021, pp. 6943–6951. URL: https://aclanthology.org/2021.emnlp-main.555/.
doi:10.18653/v1/2021.emnlp-main.555.
[13] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, 2020. URL: https://arxiv.org/abs/2005.14165. arXiv:2005.14165.
[14] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg,
A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen,
K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy,
K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha,
T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain,
D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass,
R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L.
Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair,
A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut,
L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich,
H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih,
K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu,
Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang,
L. Zheng, K. Zhou, P. Liang, On the opportunities and risks of foundation models, 2022. URL:
https://arxiv.org/abs/2108.07258. arXiv:2108.07258.
[15] T. Schick, H. Schütze, Generating datasets with pretrained language models, 2021. URL: https:
//arxiv.org/abs/2104.07540. arXiv:2104.07540.
[16] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,
P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human
feedback, 2022. URL: https://arxiv.org/abs/2203.02155. arXiv:2203.02155.
[17] M. Shanahan, K. McDonell, L. Reynolds, Role-play with large language models, 2023. URL: https:
//arxiv.org/abs/2305.16367. arXiv:2305.16367.
[18] J. Li, N. Mehrabi, C. Peris, P. Goyal, K.-W. Chang, A. Galstyan, R. Zemel, R. Gupta, On the steerability
of large language models toward data-driven personas, 2024. URL: https://arxiv.org/abs/2311.04978.
arXiv:2311.04978.
[19] T. Ge, X. Chan, X. Wang, D. Yu, H. Mi, D. Yu, Scaling synthetic data creation with 1,000,000,000
personas, 2024. URL: https://arxiv.org/abs/2406.20094. arXiv:2406.20094.
[20] H. K. Choi, Y. Li, Picle: Eliciting diverse behaviors from large language models with persona
in-context learning, 2024. URL: https://arxiv.org/abs/2405.02501. arXiv:2405.02501.
[21] Y. Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, Y. Yu, Texygen: A benchmarking platform for
text generation models, 2018. URL: https://arxiv.org/abs/1802.01886. arXiv:1802.01886.
[22] J. Ye, J. Gao, Q. Li, H. Xu, J. Feng, Z. Wu, T. Yu, L. Kong, Zerogen: Eficient zero-shot learning via
dataset generation, 2022. URL: https://arxiv.org/abs/2202.07922. arXiv:2202.07922.
[23] D. Jurafsky, Speech and Language Processing, Stanford University, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruspe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rode-Hasinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abdulahhad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Changes in Twitter geolocations: Insights and suggestions for future usage</article-title>
          , in: W. Xu,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Rahimi (Eds.),
          <source>Proceedings of the Seventh Workshop</source>
          on Noisy User-generated
          <string-name>
            <surname>Text</surname>
          </string-name>
          (
          <article-title>W-NUT</article-title>
          <year>2021</year>
          ),
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>221</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .wnut-
          <volume>1</volume>
          .24/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .wnut-
          <volume>1</volume>
          .
          <fpage>24</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Building type classification from social media texts via geospatial textmining (</article-title>
          <year>2019</year>
          )
          <fpage>10047</fpage>
          -
          <lpage>10050</lpage>
          . URL: https://ieeexplore.ieee.org/document/8898836.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Geo-spatial text-mining from twitter - a feature space analysis with a view toward building classification in urban regions</article-title>
          ,
          <source>European Journal of Remote Sensing</source>
          <volume>52</volume>
          (
          <year>2019</year>
          )
          <fpage>2</fpage>
          -
          <lpage>11</lpage>
          . URL: https://doi.org/10.1080/22797254.
          <year>2019</year>
          .
          <volume>1586451</volume>
          . doi:
          <volume>10</volume>
          .1080/22797254.
          <year>2019</year>
          .
          <volume>1586451</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Can linguistic features extracted from geo-referenced tweets help building function classification in remote sensing?</article-title>
          ,
          <source>ISPRS Journal of Photogrammetry and Remote Sensing</source>
          <volume>188</volume>
          (
          <year>2022</year>
          )
          <fpage>255</fpage>
          -
          <lpage>268</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0924271622001058. doi:https://doi.org/10.1016/j.isprsjprs.
          <year>2022</year>
          .
          <volume>04</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <article-title>Fusion of remote sensing images and social media text messages for building function classification (</article-title>
          <year>2022</year>
          )
          <article-title>134</article-title>
          . URL: https://mediatum.ub.tum.de/1637448.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Frenay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Verleysen</surname>
          </string-name>
          ,
          <article-title>Classification in the presence of label noise: A survey</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>25</volume>
          (
          <year>2014</year>
          )
          <fpage>845</fpage>
          -
          <lpage>869</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2013</year>
          .
          <volume>2292894</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Building attributes recognition with noisy and incomplete labels</article-title>
          ,
          <year>2024</year>
          , pp.
          <fpage>230</fpage>
          -
          <lpage>233</lpage>
          . doi:
          <volume>10</volume>
          .1109/IGARSS53475.
          <year>2024</year>
          .
          <volume>10641987</volume>
          ,
          <string-name>
            <surname>publisher</surname>
            <given-names>Copyright</given-names>
          </string-name>
          : ©
          <year>2024</year>
          IEEE.
          <article-title>; 2024 IEEE International Geoscience and Remote Sensing Symposium</article-title>
          , IGARSS 2024 ; Conference date:
          <fpage>07</fpage>
          -
          <lpage>07</lpage>
          -2024 Through 12-
          <fpage>07</fpage>
          -
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lamsal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Read</surname>
          </string-name>
          ,
          <article-title>Where did you tweet from? inferring the origin locations of tweets based on contextual information</article-title>
          ,
          <source>in: 2022 IEEE International Conference on Big Data (Big Data)</source>
          , IEEE,
          <year>2022</year>
          , p.
          <fpage>3935</fpage>
          -
          <lpage>3944</lpage>
          . URL: http://dx.doi.org/10.1109/BigData55660.
          <year>2022</year>
          .
          <volume>10020460</volume>
          . doi:
          <volume>10</volume>
          .1109/bigdata55660.
          <year>2022</year>
          .
          <volume>10020460</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruspe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kersten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klan</surname>
          </string-name>
          ,
          <article-title>Review article: Detection of actionable tweets in crisis events</article-title>
          ,
          <source>Natural Hazards and Earth System Sciences</source>
          <volume>21</volume>
          (
          <year>2021</year>
          )
          <fpage>1825</fpage>
          -
          <lpage>1845</lpage>
          . URL: https://nhess.copernicus. org/articles/21/1825/2021/. doi:
          <volume>10</volume>
          .5194/nhess-21-1825-
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Delétang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ruoss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Duquenne</surname>
          </string-name>
          , E. Catt,
          <string-name>
            <given-names>T.</given-names>
            <surname>Genewein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mattern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Grau-Moya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Wenliang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aitchison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Orseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          , Language modeling is compression,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2309.10668. arXiv:
          <volume>2309</volume>
          .
          <fpage>10668</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>In-context autoencoder for context compression in a large language model</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2307.06945. arXiv:
          <volume>2307</volume>
          .
          <fpage>06945</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>Generating datasets with pretrained language models</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and Punta Cana,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>