<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MLLMs Construction Company: Investigating Multimodal LLMs' Communicative Skills in a Collaborative Building Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marika Sarzotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Duca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chris Madge</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaella Bernardi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Poesio</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CIMeC, University of Trento</institution>
          ,
          <addr-line>Corso Bettini 31, Rovereto, 38068</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen Bolzano</institution>
          ,
          <addr-line>Piazza Università 1, 39100, Bolzano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Queen Mary University of London</institution>
          ,
          <addr-line>327 Mile End Rd, Bethnal Green, London E1 4NS</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>How efective are the communication choices of Multimodal Large Language Models when pursuing a common goal? Can they make use of common human dialogical patterns? We address these questions by engaging two agents based on the Mistral model in a collaborative building task, where one has to instruct the other how to build a specific target structure. The aim of this work is to investigate whether diferent prompting techniques with varying degrees of multimodality can influence the performance of MLLM-based agents in the proposed task. Code and data available in the project's GitHub repository.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;communication</kwd>
        <kwd>dialogue</kwd>
        <kwd>3D understanding</kwd>
        <kwd>multimodality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>in conversation, woulds easily cause frustration and
dissatisfaction among users.</p>
      <p>Communication is a crucial aspect of people’s daily life, Our work aims to place a stone on the road toward
as it allows them to share and obtain information, guide this objective, by investigating whether Foundational
choices and actions, learn, understand their peers, and Multimodal Large Language Models (MLLMs)—a very
more. Many common tasks humans often undertake, powerful class of AI models which has been receiving
from a simple grocery shopping run to the coordination more and more attention by the research community in
of a big work project, require at least a small amount of recent years—can mimic common and eficient human
communicative efort [ 1]. A typical and recurrent sce- communication techniques when communicating among
nario where communicative skills are intuitively key is themselves in a collaborative building task, where one
when two or more people have to collaborate in order to model is required to instruct the other on how to build a
pursue a common goal, as the dialogue exchanges have certain target structure, without specific training.
to be eficient enough to bring the group to the comple- We intend to proceed by investigating the impact of
tion of the task with as little efort and inconvenience as diferent prompting techniques, with varying degrees of
possible. multimodality, on the performance of models in the
afore</p>
      <p>With the rise of powerful AI assistants brought about mentioned task. Specifically, we have designed three
difby the progress of modern technology, it is only natu- ferent experimental setups (a text-only, an image-only,
ral to want them to communicate with us in a way that and a mixed). Comparing models’ performance in these
is somehow familiar, which means close to the commu- conditions will shed light on whether specific techniques
nication protocols that we naturally implement and to can induce more efective and human-like
communicathe degree of eficiency we are accustomed to. In fact, a tion abilities in MLLMs. At the same time, the specific
communication style that is too alien, for instance one building task chosen will allow us to also investigate
that largely strays away from the Gricean maxims [2], MLLMs abilities to understand and manipulate diferent
which we commonly use to regulate information flows formats of 3D representations, presenting them with a
diverse challenge which tackles both their linguistic and
analysis of a collaborative building task conducted by needed.
human participants in a Minecraft1-like environment. This study extends existing research by presenting
The players were divided into couples and assigned the a fully automated implementation of the collaborative
role of either Architect or Builder, where the former was building task. Our approach uniquely employs two
supposed to instruct the latter on how to build a specific MLLMs-based agents, assessing their performance
betarget structure composed of blocks of diferent colors, yond conventional textual prompting to include visual
which only the architect could see. The Builder was prompting. We investigate two key areas: the MLLMs’
provided an inventory of 6 colors of blocks, with 20 units capacity for generating human-like dialogue exchanges,
each. investigating communication techniques identified in the</p>
      <p>The authors thus collected the Minecraft Dialogue Cor- Minecraft Dialogue Corpus, and their proficiency in
compus, a large collection of game logs consisting of 509 prehending and manipulating 3D representations.
human-human dialogues and screenshots of both the
target structures and the participants’ progress in replicating
them, at diferent timestamps and from various perspec- 3. Methods
tives. Of major interest for our work is the fact that,
analyzing the dialogue histories collected, the authors 3.1. Experimental Design
were able to highlight the main recurring communication The task presented in this work is an implementation of
patterns and techniques that the players employed. the collaborative building task from Narayan-Chen et al.</p>
      <p>Notably, they observed that humans in the Architect [3], with the role of the Architect and the Builder being
role often relied on choices which would allow them taken by two agents based on the Mistral model2 [7].
to speed up communication and make themselves more To focus our study on high-level spatial reasoning
easily understood, such as references to recognizable, and collaboration, we opted not to use a full Minecraft
well-known shapes of, for instance, objects, or implicit environment for the multimodal component: instead of
references, recalling recently taken actions or referring requiring agents to navigate a 3D world and interpret a
to the Builder’s position and perspective. Builders, on the ifrst-person perspective—as is typical in embodied agent
other hand, frequently engaged in asking clarification settings—we rendered simplified voxel-based scenes and
and verification questions, in providing status updates provided static images from multiple viewpoints (see
on the ongoing activity and on the inventory state, or in Figure 1). This design choice isolates the challenge of
using extrapolation to take autonomous initiative based reconstructing and reasoning about three-dimensional
on their interpretations of the Architect’s goal. spaces from limited visual input, without introducing the</p>
      <p>Collaborative building tasks have since then sparked additional complexities of navigation, low-level control,
interest in AI research in general and NLP specifically, and egocentric perception. While full embodiment is an
with a dedicated challenge, named IGLU challenge, being important long-term goal, our aim here is to evaluate
proposed in the 2021 and 2022 editions of the NeurIPS whether agents can jointly interpret structured visual
conference [4, 5]. The most recent edition of the IGLU scenes at a higher level of abstraction.
challenge included two tracks: a Reinforcement Learning In order to investigate the possible efects that varying
one, involving the development of RL agents able to work degrees of multimodality could have on the
communicaas Builders in the task; and an NLP one, dedicated to the tive abilities of the models, we designed three diferent
advancements of the Builder’s ability to understand when experimental conditions. The basic prompt, which
proand how to ask clarification questions. vided each agent with a description of the task and of its</p>
      <p>Furthermore, Madge and Poesio [6] realized an imple- role, remained constant across conditions: what changed
mentation of the collaborative building task presented in was the format in which the target structure was
preNarayan-Chen et al. [3], using Large Language Models sented to the Architect, as well as that of the updated
as either the Architect or Builder, with a human as their world states provided periodically throughout the task,
counterpart. The models received a text-only prompt de- based on the Builder’s actions.
scribing the task, their role and how they were expected The Architect’s basic prompt instructed it to provide
to behave. The Architect was provided a (textual) JSON clear and easy to follow instructions, broken down into
description of the target structure and required to give small incremental sub-steps, and to acknowledge the
clear and easy to follow instructions, while the Builder Builder’s actions and communication. The Builder, on
was prompted to state, again in JSON format, the color the other hand, was directed to always respond with a
of blocks that it would have used and where it would JSON object listing its actions—either place or remove a
have placed them, along with clarification questions, if block—and messages to the Architect. With respect to
1Minecraft is a sandbox video game where players explore a blocky, communication, its instructions were to provide feedback
procedurally generated world, mine resources, craft tools, and build
structures. 2Mistral Small 3.1 24B Instruct, loaded at bfloat-16 precision.
(a)
(c)
(e)
(b)
(d)
(f)
conditions which are described in the following part of
this section.</p>
      <sec id="sec-1-1">
        <title>Purely Textual: In the purely textual condition, the</title>
        <p>Architect received, along with its basic prompt, a JSON
description of the target structure, i.e., the coordinates
and color of each block composing it. Furthermore, after
each turn, the Architect was supplied with an updated
JSON representation of the world state, directly reflecting
the Builder’s most recent actions of placing or removing
blocks.</p>
        <p>Purely Visual: In this second condition, the
Architect started by being shown rendered images of the
target structure. These images were provided from three
specific viewpoints—front, top-down, and an isometric
(three-quarter) view—a design choice inspired by the
visual conventions of Lego instruction manuals to facilitate
a robust perception of 3D forms. Similarly to the textual
condition, the Architect was also shown visual updates
of the world state after each action performed by the
Builder, rendered accordingly.</p>
        <p>Mixed: In the mixed condition, both input formats
were utilized. The Architect received the JSON
description of the target structure concurrently with its three
visual representations. Similarly, world state updates
were provided in both textual (JSON) and visual formats
throughout the interaction.</p>
        <sec id="sec-1-1-1">
          <title>3.2. Evaluation Metrics</title>
          <p>The evaluation of the agents’ performance in the
collaborative building task was divided into two aspects: the
task success rate (TSR) per se, namely the ability of the
agents to correctly recreate the target structure, and the
efectiveness and human likeness (HL) of their dialogues.
on the ongoing task, to ask clarification questions when To assess TSR, we compared the model generated
strucnecessary, and to report any issues or assumptions that ture (that is, the final world state) with the corresponding
it had to make. Furthermore, the Builder received an target structure. To account for global shifts—where a
explanation of the coordinate system and bounds of the structure might be built correctly but not aligned with
environment and, at every step, the state of its inventory.3 the target’s exact coordinates—we normalized the
co</p>
          <p>Communication between the agents was achieved by ordinates of both the generated and target structures,
sequentially passing the extended conversation to each adjusting them so that the minimum coordinates are set
model. To ensure clarity, at every turn the extended con- to zero, with all the others shifted accordingly. Moreover,
versation directed to the Architect was parsed so that in order to avoid over-penalization of rotational
diferthe Builder’s actions modify the world state, which the ences, we implemented a form of fuzzy matching—that
Architect received separately from the cleaned commu- is, a comparison method which tolerates small
varianication. A schematic representation of the interaction tions or transformations between structures. Specifically,
process is provided in Figure 2. we constructed 24, 90-degrees canonical rotations of the</p>
          <p>We ran the experiment on 20 target structures from generated structures, and found the one which better
the Minecraft Dialogue Corpus, in the three experimental matched the target. Figures 1c and 1d show a case where
a target structure—1c—was replicated with a 90-degrees
upward rotation. For each pair of target and best match</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>3The full prompts are available in this works’ GitHub repository.</title>
        <p>among the rotations, we proceeded by computing In- tative analysis of the dialogues, to examine them closely
tersection over Union, also known as Jaccard Similarity, and highlight merits and shortcomings of the agents’
a metric commonly used in place of accuracy for tasks communicative abilities. We identified and analyzed
ocsuch as object detection, instance segmentation and 3D currences of the aforementioned human communication
reconstruction, where defining false negatives is often patterns, as well as other potentially interesting forms of
problematic or misleading [8], along with precision, re- linguistic behavior displayed by the agents.
call and F1.</p>
        <p>For what concerns the evaluation of the dialogue
exchanges, we chose to adhere to a growing paradigm in 4. Results
NLP research, namely the use of LLMs as judges of task In order to shed light on how the three experimental
conperformance. Indeed, literature in the field has repeat- ditions (purely textual, purely visual and mixed) afected
edly shown how the performance of LLMs in aligning the agents’ abilities to engage with representations of
with human judgment is encouraging [9, 10], and we 3D structures and produce efective dialogues exchanges,
therefore decided to opt for this solution in light of both we conducted both a quantitative and a qualitative
analythe complexity of conducting an online survey with such sis on the data collected, using the metrics and methods
lengthy data as the dialogues we collected, opening to introduced in 3.2.
the risk of attention drops in the evaluators and, thus,
hindered results, and the well-recorded shortcomings of
classic NLP evaluation metrics such as BLEU and ROUGE Quantitative Analysis For what concerns Task
Suc[11, 12]. We used DeepSeek-R1 [13], prompted to evalu- cess Rate (TSR), the results appear quite underwhelming,
ate how human-like and plausible the dialogues appeared with poor performance in all the three conditions. Only
on a scale from 1 to 5, and equipped with examples of one structure per condition was perfectly built, and in all
conversations among human players from the Minecraft the three cases it was a very simple L-shaped formation
Dialogue Corpus, as a reference. The five degrees of comprising just three blocks. The IoU, precision, recall
the evaluation scale were described in detail, instructing and F1 mean scores are available in Table 1
the model to judge the dialogues with respect to how As a soft comparison, in Table 1 we also provide the
remuch they were distinguishable (1) or indistinguishable sults of the best solution submitted to the reinforcement
(5) from the examples of human-human interactions it learning track of the IGLU 2022 challenge [5]. Please be
received. A direct comparison between the dialogues to aware that there are key diferences between these works
be judged and a human-generated gold standard was also and ours, which only allow for a non-definitive
comparmeant to discourage the LLM from excessively inflating ison 4 Keeping this into consideration, it is possible to
the scores. To further clarify what signals HL, the exam- observe how our results in the textual condition only
ples were annotated with labeled instances of the most
common human communication patterns highlighted in
Narayan-Chen et al. [3], and summarized in Section 2.</p>
        <p>The complete judge prompt is available in A.1.</p>
        <p>In order to avoid relying solely on the HL scores
provided by the LLM judge, we conducted a thorough
quali</p>
      </sec>
      <sec id="sec-1-3">
        <title>4The key diferences are that: as stated above, in our setting there</title>
        <p>is no navigation or first-person perspective, but every action of
the Builder is textual; we implemented agents based on pre-trained
MLLMs rather than training them with RL; that we sampled our
target structures from the Minecraft Dialogue Corpus; and that
in the IGLU challenge F1, precision and recall scores were
computed by searching for the maximal intersection across all possible
alignments of grid-based representations of the target and built
IbGesLtUso2l0u2t2ion – 0.33 0.26 0.25 mQousatlliytafoticvueseAdnoanlycsloisselyIninovuerstqiguaatliintagtitvheeadniaalloygsiuse,
wexechanges among the two agents, in order to analyze their
linguistic behavior and check for the presence of the
slightly deviate from those that were achieved as part communication patterns and techniques presented in 2.
of the IGLU challenge, suggesting that our implemen- As a general observation, the Architect, as expected,
tation, which did not involve any task-specific training displayed the typical verbosity associated with LLMs. In
for the MLLMs-base agents, went close to matching the fact, even if it was instructed to avoid providing too many
performance obtained using agents specifically trained instructions all at once, but rather breaking down the
via Reinforcement Learning (RL) in an embodied setting. task into simple steps and waiting for feedback from the</p>
        <p>An interesting trend is observable in our results: the Builder, it often produced long and monotonous bullet
four computed metrics consistently show that the best points with steps and instructions. This propensity was
performance was achieved in the textual set up, followed observed almost double the number of times in the textual
by the mixed one and, finally, by the purely visual one. condition then in the other two, and it is likely one of
Figure 1 shows three pairs of target and generated struc- the major features that contributed to lowering the HL
tures, with diferent degrees of correctness. scores, as such a linguistic behavior is uncommon in</p>
        <p>Regarding the human likeness (HL) evaluation, the human dialogues, and therefore in the examples the LLM
mean scores in all three conditions approach the midpoint judge had as reference.
of the 1-to-5 scale (see Table 2). This result indicates Aside from this undesirable behavior, the agents
inthat the dialogues exhibit some characteristics of human deed proved able to employ, at diferent degrees, all the
interaction, yet do not consistently achieve a naturalistic typically human communication patterns of interest. The
quality. only pattern which was never recorder throughout our</p>
        <p>According to the LLM judge’s prompting instructions, task is that of extrapolation, namely instances where the
a score of 3 signifies that conversations, while not entirely Builder asks to keep working without further
instruchuman-like, contain substantial portions that resemble tions.
the provided examples of human dialogue. This suggests Moreover, apart from the specific patterns we are
ina baseline capability for human-like interaction that is, terested in, the agents displayed some generic desirable
however, far from being fully realized. More specifically, behavior. Specifically, the Architect repeatedly
demon55% of dialogues in the textual and mixed conditions strated the ability to spot mistakes in the Builder’s
acreceived a score of 3, while in the visual condition it tions and provide guidance in correcting them, either by
was achieved by 70% of dialogues. The highest score acknowledging the updated world state or by
indepenobtained was 4, assigned to a dialogue exchange in the dently asking the Builder to describe what it was seeing,
visual condition, and to another in the mixed one. then suggesting changes. As a reference, B presents</p>
        <p>Notably, these results highlight an opposite trend with two snippets of dialogues, a high quality one and a low
respect to the one that emerged in the TSR analysis. In quality one, with an analysis of their merits and flaws.
fact, the ranking of the three conditions is flipped when it In the following part of this section we will describe
comes to HL scores, where the condition which obtained more in details how the single patterns were used by the
the best results is the purely visual one, then the mixed agents.
one, still occupying the middle position, and finally the
textual condition.
structures, while we implemented coordinate normalization and
canonical rotations before computing these metrics.</p>
        <p>Implicit References: This communicative technique,
concerning the choice to make references to the Builder’s
current position and point of view or its most recent
actions, was widely employed by the Architect, being
present with at least some instances in all the dialogues
collected. While this shows that the Architect was, to an
sentence structures, and sounded more natural.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Discussion and Conclusion</title>
      <p>extent, able to construct references which would speed
up communication and at the same time to acknowledge
its counterpart, it is worth noticing that in this specific
task set up the Architect is not actually able to see the
Builder—so whenever it refers to its position, it would
be either assuming that they share the same perspective,
or trying to infer it based on the updated world state it
received.</p>
      <sec id="sec-2-1">
        <title>The results obtained through our collaborative building</title>
        <p>task highlighted how MLLMs-based agents are able to
conduct dialogues employing some typical
communication patterns used by humans in similar scenarios, while
still largely struggling to understand and manipulate 3D</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Recognizable Shapes and Sub-Structures: This pat- representations.</title>
      <p>tern refers to the ability to use well-known shapes to In terms of Task Success Rate (TSR), the best
perforidentify the structures or parts of them. Again, the Archi- mance was obtained in in the purely textual condition,
tect was able to implement this into its dialogues. Even where the Architect was presented the target structure
though its choices in this direction were never as cre- and the subsequent updated world states only as a JSON
ative and eccentric as some of the examples presented in representation, while the worst results were observed
Narayan-Chen et al. [3], but rather simple choices such when, instead, it received said information in the form
as letter shapes, it shows that the agents were able to of images. This shows how processing 3D environments
identify and use at their advantage some easily recogniz- from images still seems to pose a complex challenge
able formations present in the structures. Interestingly, for MLLMs, regardless of the attempt to achieve a
wellin one instance, a recognizable shape (a plus sign) was rounded representation by providing the Architect with
consistently mentioned five times by the Architect and diferent points of view of the same structure. Research
ultimately adopted by the Builder in its feedback as well, in the area of language and vision tasks has repeatedly
almost as if established as a code name through repe- made claims that MLLMs might display cases of unimodal
tition. In a similar fashion, in one other instance the biases, where they tend to largely rely on either the
viArchitect purposefully proposed to attribute a code name sual or linguistic modality, to the expenses of the other
to a specific part of the structure, stating: I’ll call this the [15, 16, 17, 18]. The results obtained through our task,
"top leftmost block". where the introduction of a textual description of the
target structure improved performance, seem to support
Verification and Clarification Questions: LLMs of- such claims, pointing to a unimodal bias which favors
ten struggle to ask clarification questions and to under- language. Yet, as briefly mentioned in Section 4, the
stand whether the instructions they received are realiz- use of MLLMs-base agents without task-specific training
able, or lack some key information [6, 14]. Our Builder allowed us to obtain results which only slightly deviate
was no exception, as it was rare for it to ask clarifica- from those achieved by RL agents specifically trained
tion or verification questions. Specifically, we recorded for such task. This observation suggests that the
imple2 instances of such questions in the textual setup, 5 in mentation of a specific training regime could increase
the visual condition, and 8 in the mixed one. Notably, performance, potentially reaching the results obtained
it is more common for the Builder to pose its questions by RL agents in the context of the IGLU challenge.
in an indirect way, as shown by the fact that, of the 15 Nevertheless, with respect to the quality of dialogue
questions it asked, only 5 of them were direct. exchanges, an inverted trend was observed, where the
purely visual condition exerted the best results, while the
Status Updates: The Builder proved able to eficiently textual one produced the worst ones.
communicate status updates to the Architect, as this pat- An hypothesis regarding this opposing efect that the
tern is largely found in all the dialogues. However, the three experimental conditions had on the construction
vast majority of the updates it provided were extremely of the structure and on the linguistic performance is that,
repetitive, being almost always the same throughout the while a JSON description of the structure might be an
conversation, and very often sounding unnatural and easier representation for the Architect to understand and,
stif. One reason for this behavior might be the fact that, therefore, allowing it to provide more efective
instrucfrequently, status updates were directly requested by the tions or to more promptly spot mistakes in the updated
Architect, sometimes at every turn, creating an over- world state, it could also present the Architect with an
unall repetitive communicative environment to which the desirable shortcut for communication. In fact, the purely
Builder might have adapted. In favor to this hypothesis textual condition was the one in which the largest
numis the fact that unsolicited status updates, which hap- ber of verbose bullet points of instructions was recorded,
pened most often when the Builder had to communicate most of the time being precise, block-by-block
descripinventory shortages, were much more varied in terms of tions of the structure. This suggests that such a
straightforward structure representation as a JSON description method and previous studies have highlighted its
poinduced the Architect to simply copy it and restate it tential, researchers still report flaws and advocate for
in the form of a list of instructions, to the expenses of careful application of such automated judges [19]. A
furdialogue quality. ther limitation of our evaluation is its reliance on a single,</p>
      <p>Such lengthy and monotonous bulleted lists of instruc- holistic score for human likeness. A more granular
analtions where generated by the Architect despite its direc- ysis would be beneficial, refining the judge’s work to
tives to break down tasks into simple steps and await assess distinct dimensions of the conversation—such as
Builder feedback. This verbosity persisted even in cases lfuency, grounding, and collaborative efectiveness—to
where the Builder demonstrably failed to follow these provide more nuanced insights.
comprehensive directives, suggesting a potential discon- In addition, our qualitative analysis of the dialogues
fonect or an attempt by the Architect to over-clarify in the cused on a specific set of communication patterns. Other
face of non-compliance. interesting forms of linguistic behavior were recorded,</p>
      <p>This behavior, along with the notable absence of and a more general analysis could help explain them.
extrapolation—where the Builder requests to continue Notably, it would be informative to investigate the
working independently—is consistent with the fundamen- monotony and repetitiveness in the Architect’s
uttertal design principles of instruction-tuned LLMs. These ances, possibly by elaborating a metric to quantify it and
models are primarily developed to function as assistants, compare it to human-generated dialogues.
optimized for providing comprehensive and helpful re- Furthermore, there are important diferences between
sponses when explicitly prompted, rather than initiating our implementation of the collaborative building task
new tasks or seeking continuation autonomously. While and how people naturally approach such a game. While
this optimization for thoroughness can be generally ben- in our pipeline the action and communication spaces
eficial, it proved sub-optimal for the Architect, which, were shared, in a human-human setup, the Builder can
when faced with cases where the Builder struggled to directly modify the world state without first articulating
understand those long and overly-specific instructions, their actions in natural language [3]. Lastly, inferring a
it was unable to adapt its communication style to better complete 3D structure from three static images is
inhersuit its counterpart’s necessities. ently challenging.</p>
      <p>On the other hand, the single presence of images of
the target structure deprived the Architect from the
opportunity to directly copy from the prompt, inducing it 7. Future Directions
to generate more natural and plausible utterances, albeit
this time hindering TSR scores. Coherently with this
claim, the mixed condition obtained the most balanced
results, possibly exploiting the advantages of both visual
and textual representations of the target structure.</p>
      <p>This study ofered insights on how diferent prompting
techniques can afect the communication proficiency of
MLLMs partaking in a collaborative building task, along
with their abilities to understand and recreate 3D
structures. In particular, it showed how varying degrees of
multimodality in the models’ prompts afect their
communication and building abilities in opposing ways, and
how a mixed input, comprising both visual and textual
elements, could be a balanced solution to incorporate the
advantages of both formats.</p>
      <p>We are positive that this work can inspire interesting
further implementations to improve models’
communicative abilities in multimodal collaborative settings.</p>
      <p>Future work could address these limitations in several
ways. A more complex and diverse implementation,
potentially involving a modular architecture with
specialized components for acting and communicating—for
instance, an LLM for language paired with a model for
3D understanding [20]—would allow for a division of
action and language spaces. Moreover, having agents
that can freely move in a simulated environment could
facilitate 3D understanding, though this introduces new
challenges related to spatial awareness and navigation
[21].</p>
      <p>Another promising direction is to explore task-specific
training. This could involve fine-tuning on dialogue
corpora like the Minecraft Dialogue Corpus, using datasets
built to enhance 3D spatial understanding [22], or
employing MLLMs pretrained for 3D comprehension [23].</p>
      <p>Finally, applying Reinforcement Learning (RL) to train
the agents presents an interesting avenue. The reward
signal could be twofold: one component for task success,
6. Limitations granting rewards for each correctly placed block (capped
to prevent reward hacking), and a second component
We acknowledge several limitations in our present work for collaborative quality. This latter reward could be
which open avenues for future research. provided by an LLM judge assessing the use of
conversa</p>
      <p>First, regarding the use of LLMs as judges, it is im- tional grounding techniques, such as acknowledgements
portant to note that while this is a growing evaluation and clarification questions, to foster more efective and
natural collaboration.</p>
      <sec id="sec-3-1">
        <title>Union: A Metric and a Loss for Bounding Box Re</title>
        <p>gression , in: 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), IEEE
Computer Society, Los Alamitos, CA, USA, 2019,
pp. 658–666. URL: https://doi.ieeecomputersociety.
[1] eMfo.rtInpzalircahdto, x:A. ESfohretnihsavb,otCh. cYo.stOlylivaonlad, valT-he org/10.1109/CVPR.2019.00075. doi:10.1109/CVPR.
ued, Trends in Cognitive Sciences 22 (2018) 337– [9] 2C0.-1H9..C0h0i0a7n5g,. H.-y. Lee, Can large language
mod349. URL: https://www.sciencedirect.com/science/ els be an alternative to human evaluations?, in:
article/pii/S1364661318300202. doi:https://doi. A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.),
Proorg/10.1016/j.tics.2018.01.007. ceedings of the 61st Annual Meeting of the
Asso[2] H. Grice, Logic and conversation, Syntax and se- ciation for Computational Linguistics (Volume 1:
mantics 3 (1975). Long Papers), Association for Computational
Lin[3] A. Narayan-Chen, P. Jayannavar, J. Hockenmaier, guistics, Toronto, Canada, 2023, pp. 15607–15631.</p>
        <p>Collaborative dialogue in Minecraft, in: A. Ko- URL: https://aclanthology.org/2023.acl-long.870/.
rhonen, D. Traum, L. Màrquez (Eds.),
Proceedings of the 57th Annual Meeting of the Asso- [10] dLo.Zi:1he0n.g1,8W65.-3L./Cvh1i/a2n0g2,Y3..Sahceln-gl,So.nZgh.u8a7n0g., Z. Wu,
ciation for Computational Linguistics, Associa- Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang,
tion for Computational Linguistics, Florence, Italy, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge
2019, pp. 5405–5415. URL: https://aclanthology.org/ with mt-bench and chatbot arena, in: A. Oh, T.
NauP19-1537/. doi:10.18653/v1/P19-1537. mann, A. Globerson, K. Saenko, M. Hardt, S. Levine
[4] J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, (Eds.), Advances in Neural Information Processing
M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, Systems, volume 36, Curran Associates, Inc., 2023,
A. Panov, K. Srinet, A. Szlam, Y. Sun, K. Hofmann, pp. 46595–46623. URL: https://proceedings.
M.-A. Côté, A. Awadallah, L. Abdrazakov, I. Churin, neurips.cc/paper_files/paper/2023/file/
P. Manggala, K. Naszadi, M. van der Meer, T. Kim, 91f18a1287b398d378ef22505bf41832-Paper-Datasets_
Interactive grounded language understanding in a and_Benchmarks.pdf.
collaborative environment: Iglu 2021, in: D. Kiela, [11] E. Reiter, A structured review of the validity
M. Ciccone, B. Caputo (Eds.), Proceedings of the of BLEU, Computational Linguistics 44 (2018)
NeurIPS 2021 Competitions and Demonstrations 393–401. URL: https://aclanthology.org/J18-3002/.
TinrgacRke,sveoarluchm,eP1M7L6Ro,f2P0r2o2c,eepdpi.n1g4s6o–f1M61a.cUhiRnLe:Lhetatrpns-: [12] dKo. i:1B0la.g1e1c6,2/Gc.olDio_rafn_er0,03M22.. Moradi, S. Ott,
//proceedings.mlr.press/v176/kiseleva22a.html. M. Samwald, A global analysis of metrics used for
[5] J. Kiseleva, A. Skrynnik, A. Zholus, S. Mo- measuring performance in natural language
prohanty, N. Arabzadeh, M.-A. Côté, M. Alianne- cessing, in: T. Shavrina, V. Mikhailov, V. Malykh,
jadi, M. Teruel, Z. Li, M. Burtsev, M. ter Ho- E. Artemova, O. Serikov, V. Protasov (Eds.),
Proceedeve, Z. Volovikova, A. Panov, Y. Sun, K. Srinet, ings of NLP Power! The First Workshop on Eficient
A. Szlam, A. Awadallah, S. Rho, T. Kwon, D. Won- Benchmarking in NLP, Association for
Computatae Nam, F. Bivort Haiek, E. Zhang, L. Abdrazakov, tional Linguistics, Dublin, Ireland, 2022, pp. 52–63.
G. Qingyam, J. Zhang, Z. Guo, Interactive grounded URL: https://aclanthology.org/2022.nlppower-1.6/.
language understanding in a collaborative
environment: Retrospective on iglu 2022 competition, in: [13] dDoeie:1p0S.ee1k8-6A5I3,/Dv. 1G/u2o0,2D2..Ynalnpgp,oHw.eZrh-a1n.g6,.J. Song,
M. Ciccone, G. Stolovitzky, J. Albrecht (Eds.), Pro- R. Z. et al., Deepseek-r1: Incentivizing
reasonceedings of the NeurIPS 2022 Competitions Track, ing capability in llms via reinforcement
learnvolume 220 of Proceedings of Machine Learning ing, 2025. URL: https://arxiv.org/abs/2501.12948.
Research, PMLR, 2022, pp. 204–216. URL: https:
//proceedings.mlr.press/v220/kiseleva23a.html. [14] aCr.XDi.vH:2ro5m01e.i,1D2.94M8a.rgiotta, D. Croce, R. Basili,
[6] C. Madge, M. Poesio, Large Language Models as MM-IGLU: Multi-modal interactive grounded
lanMinecraft Agents, 2024. URL: http://arxiv.org/abs/ guage understanding, in: N. Calzolari, M.-Y. Kan,
2402.08392. doi:10.48550/arXiv.2402.08392, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
ProceedarXiv:2402.08392 version: 1. ings of the 2024 Joint International Conference on
[7] Mistral AI, Mistral small 3.1, 2025. URL: https:// Computational Linguistics, Language Resources
mistral.ai/news/mistral-small-3-1, release note. and Evaluation (LREC-COLING 2024), ELRA and
[8] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, ICCL, Torino, Italia, 2024, pp. 11440–11451. URL:</p>
        <p>I. Reid, S. Savarese, Generalized Intersection Over https://aclanthology.org/2024.lrec-main.1000/.
[15] M. Chen, Y. Cao, Y. Zhang, C. Lu, Quanti- [22] Y. Zhang, Z. Xu, Y. Shen, P. Kordjamshidi, L. Huang,
fying and mitigating unimodal biases in multi- SPARTUN3d: Situated spatial understanding of 3d
modal large language models: A causal perspec- world in large language model, in: The Thirteenth
tive, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen International Conference on Learning
Representa(Eds.), Findings of the Association for Computa- tions, 2025. URL: https://openreview.net/forum?id=
tional Linguistics: EMNLP 2024, Association for FGMkSL8NR0.</p>
        <p>Computational Linguistics, Miami, Florida, USA, [23] J. H. Cho, B. Ivanovic, Y. Cao, E. Schmerling,
2024, pp. 16449–16469. URL: https://aclanthology. Y. Wang, X. Weng, B. Li, Y. You, P. Kraehenbuehl,
org/2024.findings-emnlp.960/. doi: 10.18653/v1/ Y. Wang, M. Pavone, Language-image models
2024.findings-emnlp.960. with 3d understanding, in: The Thirteenth
In[16] Y. Zhang, P. E. Latham, A. M. Saxe, Understand- ternational Conference on Learning
Representaing unimodal bias in multimodal deep linear net- tions, 2025. URL: https://openreview.net/forum?id=
works, in: R. Salakhutdinov, Z. Kolter, K. Heller, yaQbTAD2JJ.</p>
        <p>A. Weller, N. Oliver, J. Scarlett, F. Berkenkamp
(Eds.), Proceedings of the 41st International
Conference on Machine Learning, volume 235 of Proceed- A. Appendix
ings of Machine Learning Research, PMLR, 2024, pp.
59100–59125. URL: https://proceedings.mlr.press/ Project repository available at:
v235/zhang24aa.html.
https://github.com/r3lativo/MLLMs-construction[17] H. Zhao, S. Si, L. Chen, Y. Zhang, M. Sun, M. Zhang, company/tree/restructure</p>
        <p>B. Chang, Looking beyond text: Reducing
language bias in large vision-language models via
multimodal dual-attention and soft-image guid- A.1. Judge Prompt
ance, 2024. URL: https://arxiv.org/abs/2411.14279. You will be shown some dialogues among two agents,
arXiv:2411.14279. an Architect and a Builder. The dialogues were collected
[18] S. Frank, E. Bugliarello, D. Elliott, Vision-and- during a collaborative building task, were the Architect
language or vision-for-language? on cross-modal was supposed to instruct the Builder on how to build
influence in multimodal transformers, in: M.- a certain target structure. You will be provided some
F. Moens, X. Huang, L. Specia, S. W.-t. Yih examples of dialogues among humans playing the same
(Eds.), Proceedings of the 2021 Conference on game. Based on those examples, your task is to judge
Empirical Methods in Natural Language Process- how human-like and plausible each dialogue that you
ing, Association for Computational Linguistics, will be shown is, on a scale from 1 to 5, where 1 means
Online and Punta Cana, Dominican Republic, that the dialogue is very unnatural and easily detectable
2021, pp. 9847–9857. URL: https://aclanthology.org/ as artificial; 2 means that while the dialogue still sounds
2021.emnlp-main.775/. doi:10.18653/v1/2021. aprattitficeiranl,sasereesntriincttehdeaemxaomunptleosfahruempraensecnotm; 3mmuneiacnastitohnat
emnlp-main.775. although the dialogue is not entirely natural, a big
por[19] A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, tion of it resembles the human examples; 4 means that
R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, the dialogue is almost perfectly human-like and natural,
M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, with only very few occurrences of artificial-sounding
V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, and implausible utterances; 5 means that the dialogue is
A. Suglia, A. K. Surikuchi, E. Takmaz, A. Testoni, very human-like, it presents utterances and exchanges
Llms instead of human judges? a large scale em- which closely resemble the ones seen in the example,
pirical study across 20 nlp evaluation tasks, 2024. thus being almost indistinguishable from the examples.
[20] Ja.rXXui,vW: 2.4C0h6e.n1g,8Y4.0G3.ao, X. Wang, S. Gao, Y. Shan, pOrnoldyuacninswgearnwy ioththtehreuntutemrabnecrey.ou think is best, without
Instantmesh: Eficient 3d mesh generation from a
single image with sparse-view large reconstruction B. Dialogue Snippets
models, 2024. URL: https://arxiv.org/abs/2404.07191.</p>
        <p>arXiv:2404.07191. Two snippets of dialogues between the agents. Note that
[21] I. White, K. Nottingham, A. Maniar, M. Robin- neither the builder’s action nor the represented world
son, H. Lillemark, M. Maheshwari, L. Qin, P. Am- state are reported here.
manabrolu, Collaborating action by action: A
multi-agent llm framework for embodied
reasoning, 2025. URL: https://arxiv.org/abs/2504.17950.</p>
        <p>arXiv:2504.17950.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Architect: Alright, let’s start building this structure step</title>
        <p>by step.</p>
        <p>First, we’ll focus on the blue base. Place 4 blue blocks in
a square-shaped foundation. Place the 4 blocks side by
side, each adjacent to the other, so that they form a 2x2
square. Proceed when you’re ready with the blue base.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Builder: Placed the 4 blue blocks in a 2x2 square at</title>
        <p>ground level with the bottom front left coordinate at
100,1,100.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Architect: Great job on placing the blue base. Next,</title>
        <p>let’s build the purple part of the structure. Place the first
purple block on top of the front left blue block.
Give me a confirmation when you’ve placed the purple
block.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Builder: Placed the purple block on top of the front left</title>
        <p>blue block.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Architect: Great, now I want you to place the second</title>
        <p>purple block one level above and directly to the front of
the first purple block. Once that’s done, let me know.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Builder: Placed the second purple block one level above,</title>
        <p>directly in front of the first purple block.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Architect: Good job placing the second purple block!</title>
        <p>Now, let’s build the green part of the structure. Add
a third green block on the right side of the back right
bottom blue block. Let me know once you’re done!</p>
      </sec>
      <sec id="sec-3-9">
        <title>Builder: I need confirmation, do you mean the right</title>
        <p>side relative to the structure or right side relative to the
world?</p>
      </sec>
      <sec id="sec-3-10">
        <title>Architect: My apologies for any confusion, I meant the</title>
        <p>right side FROM THE VIEWER’S PERSPECTIVE. So the
block you’ve placed is the correct one. Now add two
more green blocks, one on top of the first green one, and
one in front of the first. Let me know when this is done.</p>
        <sec id="sec-3-10-1">
          <title>B.1. A high quality dialogue exchange</title>
        </sec>
        <sec id="sec-3-10-2">
          <title>B.2. A low quality dialogue exchange</title>
          <p>The following is a snippet of the dialogue which got a HL The following excerpt, on the other hand, serves as an
score of 4 out of 5 in the visual condition. In this example, example of a low quality communication exchange, and
it is possible to appreciate how the Architect provided is taken from a dialogue with score 2 out of 5 in HL. It
its instructions in small and clear steps, made references is clear from the extract that, in this case, the Architect
to recognizable shapes—squared-shape foundation—and failed at providing simple, incremental instructions, but
implicitly referenced blocks that were recently placed— rather produced verbose and monotonous bullet points
the first purple block. Moreover, in its fourth utterance, the which are simply precise block-by-block descriptions
Builder asks for a direct clarification question, which is of the structure, framed as instructions. The Architect
promptly acknowledged and answered by the Architect. does demonstrate the ability to acknowledge a mistake
on the Builder’s part, but maintains this same linguistic
behavior when correcting it, providing another set of
almost identical bullet points.</p>
          <p>Architect: Let’s begin building the structure. First, let’s
create the bottom layer. The bottom layer has 6 blocks
placed in a row in this order: red, purple, yellow, yellow,
purple, red.
- Place the first red block on the ground.
- Next, place a purple block to the right of the red block.
- Then add a yellow block to the right of the purple
block.
- Place another yellow block to the right of the first
yellow block.
- After that, place a purple block to the right of the
second yellow block.
- Finally, place a red block to the right of the purple block.</p>
        </sec>
      </sec>
      <sec id="sec-3-11">
        <title>Please confirm once you have placed these blocks.</title>
      </sec>
      <sec id="sec-3-12">
        <title>Builder: Placed all the blocks as instructed.</title>
      </sec>
      <sec id="sec-3-13">
        <title>Architect: The blocks in your message don’t match the</title>
        <p>required structure because they are positioned too close
to each other. Please destroy all your recently placed
blocks. For the first row, you should place each block
next to (i.e., one block away from) the previous block.
Let’s retry:
- Place the first red block on the ground.
- Next, place a purple block to the right of the red block
but spaced one block away from the border closest to
the red block.
- Then add a yellow block to the right of the purple
block, also spaced one block away from the border
closest to the purple block.
- Place another yellow block to the right of the first
yellow block, spaced one block away from the border
closest to the yellow block.
- After that, place a purple block to the right of the
second yellow block, spaced one block away from the
border closest to the yellow block.
- Finally, place a red block to the right of the purple
block, spaced one block away from the border closest to
the purple block.</p>
      </sec>
      <sec id="sec-3-14">
        <title>Please confirm once you have placed these blocks.</title>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>