1. Introduction

MLLMs Construction Company: Investigating Multimodal LLMs' Communicative Skills in a Collaborative Building Task

Marika Sarzotti

Giovanni Duca

Chris Madge

Rafaella Bernardi

Massimo Poesio

2 0 CIMeC, University of Trento , Corso Bettini 31, Rovereto, 38068 , Italy 1 Free University of Bozen Bolzano , Piazza Università 1, 39100, Bolzano , Italy 2 Queen Mary University of London , 327 Mile End Rd, Bethnal Green, London E1 4NS , United Kingdom

2025

How efective are the communication choices of Multimodal Large Language Models when pursuing a common goal? Can they make use of common human dialogical patterns? We address these questions by engaging two agents based on the Mistral model in a collaborative building task, where one has to instruct the other how to build a specific target structure. The aim of this work is to investigate whether diferent prompting techniques with varying degrees of multimodality can influence the performance of MLLM-based agents in the proposed task. Code and data available in the project's GitHub repository.

eol>communication dialogue 3D understanding multimodality

1. Introduction

in conversation, woulds easily cause frustration and dissatisfaction among users.

Communication is a crucial aspect of people’s daily life, Our work aims to place a stone on the road toward as it allows them to share and obtain information, guide this objective, by investigating whether Foundational choices and actions, learn, understand their peers, and Multimodal Large Language Models (MLLMs)—a very more. Many common tasks humans often undertake, powerful class of AI models which has been receiving from a simple grocery shopping run to the coordination more and more attention by the research community in of a big work project, require at least a small amount of recent years—can mimic common and eficient human communicative efort [ 1]. A typical and recurrent sce- communication techniques when communicating among nario where communicative skills are intuitively key is themselves in a collaborative building task, where one when two or more people have to collaborate in order to model is required to instruct the other on how to build a pursue a common goal, as the dialogue exchanges have certain target structure, without specific training. to be eficient enough to bring the group to the comple- We intend to proceed by investigating the impact of tion of the task with as little efort and inconvenience as diferent prompting techniques, with varying degrees of possible. multimodality, on the performance of models in the afore

With the rise of powerful AI assistants brought about mentioned task. Specifically, we have designed three difby the progress of modern technology, it is only natu- ferent experimental setups (a text-only, an image-only, ral to want them to communicate with us in a way that and a mixed). Comparing models’ performance in these is somehow familiar, which means close to the commu- conditions will shed light on whether specific techniques nication protocols that we naturally implement and to can induce more efective and human-like communicathe degree of eficiency we are accustomed to. In fact, a tion abilities in MLLMs. At the same time, the specific communication style that is too alien, for instance one building task chosen will allow us to also investigate that largely strays away from the Gricean maxims [2], MLLMs abilities to understand and manipulate diferent which we commonly use to regulate information flows formats of 3D representations, presenting them with a diverse challenge which tackles both their linguistic and analysis of a collaborative building task conducted by needed. human participants in a Minecraft1-like environment. This study extends existing research by presenting The players were divided into couples and assigned the a fully automated implementation of the collaborative role of either Architect or Builder, where the former was building task. Our approach uniquely employs two supposed to instruct the latter on how to build a specific MLLMs-based agents, assessing their performance betarget structure composed of blocks of diferent colors, yond conventional textual prompting to include visual which only the architect could see. The Builder was prompting. We investigate two key areas: the MLLMs’ provided an inventory of 6 colors of blocks, with 20 units capacity for generating human-like dialogue exchanges, each. investigating communication techniques identified in the

The authors thus collected the Minecraft Dialogue Cor- Minecraft Dialogue Corpus, and their proficiency in compus, a large collection of game logs consisting of 509 prehending and manipulating 3D representations. human-human dialogues and screenshots of both the target structures and the participants’ progress in replicating them, at diferent timestamps and from various perspec- 3. Methods tives. Of major interest for our work is the fact that, analyzing the dialogue histories collected, the authors 3.1. Experimental Design were able to highlight the main recurring communication The task presented in this work is an implementation of patterns and techniques that the players employed. the collaborative building task from Narayan-Chen et al.

Notably, they observed that humans in the Architect [3], with the role of the Architect and the Builder being role often relied on choices which would allow them taken by two agents based on the Mistral model2 [7]. to speed up communication and make themselves more To focus our study on high-level spatial reasoning easily understood, such as references to recognizable, and collaboration, we opted not to use a full Minecraft well-known shapes of, for instance, objects, or implicit environment for the multimodal component: instead of references, recalling recently taken actions or referring requiring agents to navigate a 3D world and interpret a to the Builder’s position and perspective. Builders, on the ifrst-person perspective—as is typical in embodied agent other hand, frequently engaged in asking clarification settings—we rendered simplified voxel-based scenes and and verification questions, in providing status updates provided static images from multiple viewpoints (see on the ongoing activity and on the inventory state, or in Figure 1). This design choice isolates the challenge of using extrapolation to take autonomous initiative based reconstructing and reasoning about three-dimensional on their interpretations of the Architect’s goal. spaces from limited visual input, without introducing the

Collaborative building tasks have since then sparked additional complexities of navigation, low-level control, interest in AI research in general and NLP specifically, and egocentric perception. While full embodiment is an with a dedicated challenge, named IGLU challenge, being important long-term goal, our aim here is to evaluate proposed in the 2021 and 2022 editions of the NeurIPS whether agents can jointly interpret structured visual conference [4, 5]. The most recent edition of the IGLU scenes at a higher level of abstraction. challenge included two tracks: a Reinforcement Learning In order to investigate the possible efects that varying one, involving the development of RL agents able to work degrees of multimodality could have on the communicaas Builders in the task; and an NLP one, dedicated to the tive abilities of the models, we designed three diferent advancements of the Builder’s ability to understand when experimental conditions. The basic prompt, which proand how to ask clarification questions. vided each agent with a description of the task and of its

Furthermore, Madge and Poesio [6] realized an imple- role, remained constant across conditions: what changed mentation of the collaborative building task presented in was the format in which the target structure was preNarayan-Chen et al. [3], using Large Language Models sented to the Architect, as well as that of the updated as either the Architect or Builder, with a human as their world states provided periodically throughout the task, counterpart. The models received a text-only prompt de- based on the Builder’s actions. scribing the task, their role and how they were expected The Architect’s basic prompt instructed it to provide to behave. The Architect was provided a (textual) JSON clear and easy to follow instructions, broken down into description of the target structure and required to give small incremental sub-steps, and to acknowledge the clear and easy to follow instructions, while the Builder Builder’s actions and communication. The Builder, on was prompted to state, again in JSON format, the color the other hand, was directed to always respond with a of blocks that it would have used and where it would JSON object listing its actions—either place or remove a have placed them, along with clarification questions, if block—and messages to the Architect. With respect to 1Minecraft is a sandbox video game where players explore a blocky, communication, its instructions were to provide feedback procedurally generated world, mine resources, craft tools, and build structures. 2Mistral Small 3.1 24B Instruct, loaded at bfloat-16 precision. (a) (c) (e) (b) (d) (f) conditions which are described in the following part of this section.

Purely Textual: In the purely textual condition, the

Architect received, along with its basic prompt, a JSON description of the target structure, i.e., the coordinates and color of each block composing it. Furthermore, after each turn, the Architect was supplied with an updated JSON representation of the world state, directly reflecting the Builder’s most recent actions of placing or removing blocks.

Purely Visual: In this second condition, the Architect started by being shown rendered images of the target structure. These images were provided from three specific viewpoints—front, top-down, and an isometric (three-quarter) view—a design choice inspired by the visual conventions of Lego instruction manuals to facilitate a robust perception of 3D forms. Similarly to the textual condition, the Architect was also shown visual updates of the world state after each action performed by the Builder, rendered accordingly.

Mixed: In the mixed condition, both input formats were utilized. The Architect received the JSON description of the target structure concurrently with its three visual representations. Similarly, world state updates were provided in both textual (JSON) and visual formats throughout the interaction.

3.2. Evaluation Metrics

The evaluation of the agents’ performance in the collaborative building task was divided into two aspects: the task success rate (TSR) per se, namely the ability of the agents to correctly recreate the target structure, and the efectiveness and human likeness (HL) of their dialogues. on the ongoing task, to ask clarification questions when To assess TSR, we compared the model generated strucnecessary, and to report any issues or assumptions that ture (that is, the final world state) with the corresponding it had to make. Furthermore, the Builder received an target structure. To account for global shifts—where a explanation of the coordinate system and bounds of the structure might be built correctly but not aligned with environment and, at every step, the state of its inventory.3 the target’s exact coordinates—we normalized the co

Communication between the agents was achieved by ordinates of both the generated and target structures, sequentially passing the extended conversation to each adjusting them so that the minimum coordinates are set model. To ensure clarity, at every turn the extended con- to zero, with all the others shifted accordingly. Moreover, versation directed to the Architect was parsed so that in order to avoid over-penalization of rotational diferthe Builder’s actions modify the world state, which the ences, we implemented a form of fuzzy matching—that Architect received separately from the cleaned commu- is, a comparison method which tolerates small varianication. A schematic representation of the interaction tions or transformations between structures. Specifically, process is provided in Figure 2. we constructed 24, 90-degrees canonical rotations of the

We ran the experiment on 20 target structures from generated structures, and found the one which better the Minecraft Dialogue Corpus, in the three experimental matched the target. Figures 1c and 1d show a case where a target structure—1c—was replicated with a 90-degrees upward rotation. For each pair of target and best match

3The full prompts are available in this works’ GitHub repository.

among the rotations, we proceeded by computing In- tative analysis of the dialogues, to examine them closely tersection over Union, also known as Jaccard Similarity, and highlight merits and shortcomings of the agents’ a metric commonly used in place of accuracy for tasks communicative abilities. We identified and analyzed ocsuch as object detection, instance segmentation and 3D currences of the aforementioned human communication reconstruction, where defining false negatives is often patterns, as well as other potentially interesting forms of problematic or misleading [8], along with precision, re- linguistic behavior displayed by the agents. call and F1.

For what concerns the evaluation of the dialogue exchanges, we chose to adhere to a growing paradigm in 4. Results NLP research, namely the use of LLMs as judges of task In order to shed light on how the three experimental conperformance. Indeed, literature in the field has repeat- ditions (purely textual, purely visual and mixed) afected edly shown how the performance of LLMs in aligning the agents’ abilities to engage with representations of with human judgment is encouraging [9, 10], and we 3D structures and produce efective dialogues exchanges, therefore decided to opt for this solution in light of both we conducted both a quantitative and a qualitative analythe complexity of conducting an online survey with such sis on the data collected, using the metrics and methods lengthy data as the dialogues we collected, opening to introduced in 3.2. the risk of attention drops in the evaluators and, thus, hindered results, and the well-recorded shortcomings of classic NLP evaluation metrics such as BLEU and ROUGE Quantitative Analysis For what concerns Task Suc[11, 12]. We used DeepSeek-R1 [13], prompted to evalu- cess Rate (TSR), the results appear quite underwhelming, ate how human-like and plausible the dialogues appeared with poor performance in all the three conditions. Only on a scale from 1 to 5, and equipped with examples of one structure per condition was perfectly built, and in all conversations among human players from the Minecraft the three cases it was a very simple L-shaped formation Dialogue Corpus, as a reference. The five degrees of comprising just three blocks. The IoU, precision, recall the evaluation scale were described in detail, instructing and F1 mean scores are available in Table 1 the model to judge the dialogues with respect to how As a soft comparison, in Table 1 we also provide the remuch they were distinguishable (1) or indistinguishable sults of the best solution submitted to the reinforcement (5) from the examples of human-human interactions it learning track of the IGLU 2022 challenge [5]. Please be received. A direct comparison between the dialogues to aware that there are key diferences between these works be judged and a human-generated gold standard was also and ours, which only allow for a non-definitive comparmeant to discourage the LLM from excessively inflating ison 4 Keeping this into consideration, it is possible to the scores. To further clarify what signals HL, the exam- observe how our results in the textual condition only ples were annotated with labeled instances of the most common human communication patterns highlighted in Narayan-Chen et al. [3], and summarized in Section 2.

The complete judge prompt is available in A.1.

In order to avoid relying solely on the HL scores provided by the LLM judge, we conducted a thorough quali

4The key diferences are that: as stated above, in our setting there

is no navigation or first-person perspective, but every action of the Builder is textual; we implemented agents based on pre-trained MLLMs rather than training them with RL; that we sampled our target structures from the Minecraft Dialogue Corpus; and that in the IGLU challenge F1, precision and recall scores were computed by searching for the maximal intersection across all possible alignments of grid-based representations of the target and built IbGesLtUso2l0u2t2ion – 0.33 0.26 0.25 mQousatlliytafoticvueseAdnoanlycsloisselyIninovuerstqiguaatliintagtitvheeadniaalloygsiuse, wexechanges among the two agents, in order to analyze their linguistic behavior and check for the presence of the slightly deviate from those that were achieved as part communication patterns and techniques presented in 2. of the IGLU challenge, suggesting that our implemen- As a general observation, the Architect, as expected, tation, which did not involve any task-specific training displayed the typical verbosity associated with LLMs. In for the MLLMs-base agents, went close to matching the fact, even if it was instructed to avoid providing too many performance obtained using agents specifically trained instructions all at once, but rather breaking down the via Reinforcement Learning (RL) in an embodied setting. task into simple steps and waiting for feedback from the

An interesting trend is observable in our results: the Builder, it often produced long and monotonous bullet four computed metrics consistently show that the best points with steps and instructions. This propensity was performance was achieved in the textual set up, followed observed almost double the number of times in the textual by the mixed one and, finally, by the purely visual one. condition then in the other two, and it is likely one of Figure 1 shows three pairs of target and generated struc- the major features that contributed to lowering the HL tures, with diferent degrees of correctness. scores, as such a linguistic behavior is uncommon in

Regarding the human likeness (HL) evaluation, the human dialogues, and therefore in the examples the LLM mean scores in all three conditions approach the midpoint judge had as reference. of the 1-to-5 scale (see Table 2). This result indicates Aside from this undesirable behavior, the agents inthat the dialogues exhibit some characteristics of human deed proved able to employ, at diferent degrees, all the interaction, yet do not consistently achieve a naturalistic typically human communication patterns of interest. The quality. only pattern which was never recorder throughout our

According to the LLM judge’s prompting instructions, task is that of extrapolation, namely instances where the a score of 3 signifies that conversations, while not entirely Builder asks to keep working without further instruchuman-like, contain substantial portions that resemble tions. the provided examples of human dialogue. This suggests Moreover, apart from the specific patterns we are ina baseline capability for human-like interaction that is, terested in, the agents displayed some generic desirable however, far from being fully realized. More specifically, behavior. Specifically, the Architect repeatedly demon55% of dialogues in the textual and mixed conditions strated the ability to spot mistakes in the Builder’s acreceived a score of 3, while in the visual condition it tions and provide guidance in correcting them, either by was achieved by 70% of dialogues. The highest score acknowledging the updated world state or by indepenobtained was 4, assigned to a dialogue exchange in the dently asking the Builder to describe what it was seeing, visual condition, and to another in the mixed one. then suggesting changes. As a reference, B presents

Notably, these results highlight an opposite trend with two snippets of dialogues, a high quality one and a low respect to the one that emerged in the TSR analysis. In quality one, with an analysis of their merits and flaws. fact, the ranking of the three conditions is flipped when it In the following part of this section we will describe comes to HL scores, where the condition which obtained more in details how the single patterns were used by the the best results is the purely visual one, then the mixed agents. one, still occupying the middle position, and finally the textual condition. structures, while we implemented coordinate normalization and canonical rotations before computing these metrics.

Implicit References: This communicative technique, concerning the choice to make references to the Builder’s current position and point of view or its most recent actions, was widely employed by the Architect, being present with at least some instances in all the dialogues collected. While this shows that the Architect was, to an sentence structures, and sounded more natural.

5. Discussion and Conclusion

extent, able to construct references which would speed up communication and at the same time to acknowledge its counterpart, it is worth noticing that in this specific task set up the Architect is not actually able to see the Builder—so whenever it refers to its position, it would be either assuming that they share the same perspective, or trying to infer it based on the updated world state it received.

The results obtained through our collaborative building

task highlighted how MLLMs-based agents are able to conduct dialogues employing some typical communication patterns used by humans in similar scenarios, while still largely struggling to understand and manipulate 3D

Recognizable Shapes and Sub-Structures: This pat- representations.

tern refers to the ability to use well-known shapes to In terms of Task Success Rate (TSR), the best perforidentify the structures or parts of them. Again, the Archi- mance was obtained in in the purely textual condition, tect was able to implement this into its dialogues. Even where the Architect was presented the target structure though its choices in this direction were never as cre- and the subsequent updated world states only as a JSON ative and eccentric as some of the examples presented in representation, while the worst results were observed Narayan-Chen et al. [3], but rather simple choices such when, instead, it received said information in the form as letter shapes, it shows that the agents were able to of images. This shows how processing 3D environments identify and use at their advantage some easily recogniz- from images still seems to pose a complex challenge able formations present in the structures. Interestingly, for MLLMs, regardless of the attempt to achieve a wellin one instance, a recognizable shape (a plus sign) was rounded representation by providing the Architect with consistently mentioned five times by the Architect and diferent points of view of the same structure. Research ultimately adopted by the Builder in its feedback as well, in the area of language and vision tasks has repeatedly almost as if established as a code name through repe- made claims that MLLMs might display cases of unimodal tition. In a similar fashion, in one other instance the biases, where they tend to largely rely on either the viArchitect purposefully proposed to attribute a code name sual or linguistic modality, to the expenses of the other to a specific part of the structure, stating: I’ll call this the [15, 16, 17, 18]. The results obtained through our task, "top leftmost block". where the introduction of a textual description of the target structure improved performance, seem to support Verification and Clarification Questions: LLMs of- such claims, pointing to a unimodal bias which favors ten struggle to ask clarification questions and to under- language. Yet, as briefly mentioned in Section 4, the stand whether the instructions they received are realiz- use of MLLMs-base agents without task-specific training able, or lack some key information [6, 14]. Our Builder allowed us to obtain results which only slightly deviate was no exception, as it was rare for it to ask clarifica- from those achieved by RL agents specifically trained tion or verification questions. Specifically, we recorded for such task. This observation suggests that the imple2 instances of such questions in the textual setup, 5 in mentation of a specific training regime could increase the visual condition, and 8 in the mixed one. Notably, performance, potentially reaching the results obtained it is more common for the Builder to pose its questions by RL agents in the context of the IGLU challenge. in an indirect way, as shown by the fact that, of the 15 Nevertheless, with respect to the quality of dialogue questions it asked, only 5 of them were direct. exchanges, an inverted trend was observed, where the purely visual condition exerted the best results, while the Status Updates: The Builder proved able to eficiently textual one produced the worst ones. communicate status updates to the Architect, as this pat- An hypothesis regarding this opposing efect that the tern is largely found in all the dialogues. However, the three experimental conditions had on the construction vast majority of the updates it provided were extremely of the structure and on the linguistic performance is that, repetitive, being almost always the same throughout the while a JSON description of the structure might be an conversation, and very often sounding unnatural and easier representation for the Architect to understand and, stif. One reason for this behavior might be the fact that, therefore, allowing it to provide more efective instrucfrequently, status updates were directly requested by the tions or to more promptly spot mistakes in the updated Architect, sometimes at every turn, creating an over- world state, it could also present the Architect with an unall repetitive communicative environment to which the desirable shortcut for communication. In fact, the purely Builder might have adapted. In favor to this hypothesis textual condition was the one in which the largest numis the fact that unsolicited status updates, which hap- ber of verbose bullet points of instructions was recorded, pened most often when the Builder had to communicate most of the time being precise, block-by-block descripinventory shortages, were much more varied in terms of tions of the structure. This suggests that such a straightforward structure representation as a JSON description method and previous studies have highlighted its poinduced the Architect to simply copy it and restate it tential, researchers still report flaws and advocate for in the form of a list of instructions, to the expenses of careful application of such automated judges [19]. A furdialogue quality. ther limitation of our evaluation is its reliance on a single,

Such lengthy and monotonous bulleted lists of instruc- holistic score for human likeness. A more granular analtions where generated by the Architect despite its direc- ysis would be beneficial, refining the judge’s work to tives to break down tasks into simple steps and await assess distinct dimensions of the conversation—such as Builder feedback. This verbosity persisted even in cases lfuency, grounding, and collaborative efectiveness—to where the Builder demonstrably failed to follow these provide more nuanced insights. comprehensive directives, suggesting a potential discon- In addition, our qualitative analysis of the dialogues fonect or an attempt by the Architect to over-clarify in the cused on a specific set of communication patterns. Other face of non-compliance. interesting forms of linguistic behavior were recorded,

This behavior, along with the notable absence of and a more general analysis could help explain them. extrapolation—where the Builder requests to continue Notably, it would be informative to investigate the working independently—is consistent with the fundamen- monotony and repetitiveness in the Architect’s uttertal design principles of instruction-tuned LLMs. These ances, possibly by elaborating a metric to quantify it and models are primarily developed to function as assistants, compare it to human-generated dialogues. optimized for providing comprehensive and helpful re- Furthermore, there are important diferences between sponses when explicitly prompted, rather than initiating our implementation of the collaborative building task new tasks or seeking continuation autonomously. While and how people naturally approach such a game. While this optimization for thoroughness can be generally ben- in our pipeline the action and communication spaces eficial, it proved sub-optimal for the Architect, which, were shared, in a human-human setup, the Builder can when faced with cases where the Builder struggled to directly modify the world state without first articulating understand those long and overly-specific instructions, their actions in natural language [3]. Lastly, inferring a it was unable to adapt its communication style to better complete 3D structure from three static images is inhersuit its counterpart’s necessities. ently challenging.

On the other hand, the single presence of images of the target structure deprived the Architect from the opportunity to directly copy from the prompt, inducing it 7. Future Directions to generate more natural and plausible utterances, albeit this time hindering TSR scores. Coherently with this claim, the mixed condition obtained the most balanced results, possibly exploiting the advantages of both visual and textual representations of the target structure.

This study ofered insights on how diferent prompting techniques can afect the communication proficiency of MLLMs partaking in a collaborative building task, along with their abilities to understand and recreate 3D structures. In particular, it showed how varying degrees of multimodality in the models’ prompts afect their communication and building abilities in opposing ways, and how a mixed input, comprising both visual and textual elements, could be a balanced solution to incorporate the advantages of both formats.

We are positive that this work can inspire interesting further implementations to improve models’ communicative abilities in multimodal collaborative settings.

Future work could address these limitations in several ways. A more complex and diverse implementation, potentially involving a modular architecture with specialized components for acting and communicating—for instance, an LLM for language paired with a model for 3D understanding [20]—would allow for a division of action and language spaces. Moreover, having agents that can freely move in a simulated environment could facilitate 3D understanding, though this introduces new challenges related to spatial awareness and navigation [21].

Another promising direction is to explore task-specific training. This could involve fine-tuning on dialogue corpora like the Minecraft Dialogue Corpus, using datasets built to enhance 3D spatial understanding [22], or employing MLLMs pretrained for 3D comprehension [23].

Finally, applying Reinforcement Learning (RL) to train the agents presents an interesting avenue. The reward signal could be twofold: one component for task success, 6. Limitations granting rewards for each correctly placed block (capped to prevent reward hacking), and a second component We acknowledge several limitations in our present work for collaborative quality. This latter reward could be which open avenues for future research. provided by an LLM judge assessing the use of conversa

First, regarding the use of LLMs as judges, it is im- tional grounding techniques, such as acknowledgements portant to note that while this is a growing evaluation and clarification questions, to foster more efective and natural collaboration.

Union: A Metric and a Loss for Bounding Box Re

gression , in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2019, pp. 658–666. URL: https://doi.ieeecomputersociety. [1] eMfo.rtInpzalircahdto, x:A. ESfohretnihsavb,otCh. cYo.stOlylivaonlad, valT-he org/10.1109/CVPR.2019.00075. doi:10.1109/CVPR. ued, Trends in Cognitive Sciences 22 (2018) 337– [9] 2C0.-1H9..C0h0i0a7n5g,. H.-y. Lee, Can large language mod349. URL: https://www.sciencedirect.com/science/ els be an alternative to human evaluations?, in: article/pii/S1364661318300202. doi:https://doi. A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proorg/10.1016/j.tics.2018.01.007. ceedings of the 61st Annual Meeting of the Asso[2] H. Grice, Logic and conversation, Syntax and se- ciation for Computational Linguistics (Volume 1: mantics 3 (1975). Long Papers), Association for Computational Lin[3] A. Narayan-Chen, P. Jayannavar, J. Hockenmaier, guistics, Toronto, Canada, 2023, pp. 15607–15631.

Collaborative dialogue in Minecraft, in: A. Ko- URL: https://aclanthology.org/2023.acl-long.870/. rhonen, D. Traum, L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Asso- [10] dLo.Zi:1he0n.g1,8W65.-3L./Cvh1i/a2n0g2,Y3..Sahceln-gl,So.nZgh.u8a7n0g., Z. Wu, ciation for Computational Linguistics, Associa- Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, tion for Computational Linguistics, Florence, Italy, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge 2019, pp. 5405–5415. URL: https://aclanthology.org/ with mt-bench and chatbot arena, in: A. Oh, T. NauP19-1537/. doi:10.18653/v1/P19-1537. mann, A. Globerson, K. Saenko, M. Hardt, S. Levine [4] J. Kiseleva, Z. Li, M. Aliannejadi, S. Mohanty, (Eds.), Advances in Neural Information Processing M. ter Hoeve, M. Burtsev, A. Skrynnik, A. Zholus, Systems, volume 36, Curran Associates, Inc., 2023, A. Panov, K. Srinet, A. Szlam, Y. Sun, K. Hofmann, pp. 46595–46623. URL: https://proceedings. M.-A. Côté, A. Awadallah, L. Abdrazakov, I. Churin, neurips.cc/paper_files/paper/2023/file/ P. Manggala, K. Naszadi, M. van der Meer, T. Kim, 91f18a1287b398d378ef22505bf41832-Paper-Datasets_ Interactive grounded language understanding in a and_Benchmarks.pdf. collaborative environment: Iglu 2021, in: D. Kiela, [11] E. Reiter, A structured review of the validity M. Ciccone, B. Caputo (Eds.), Proceedings of the of BLEU, Computational Linguistics 44 (2018) NeurIPS 2021 Competitions and Demonstrations 393–401. URL: https://aclanthology.org/J18-3002/. TinrgacRke,sveoarluchm,eP1M7L6Ro,f2P0r2o2c,eepdpi.n1g4s6o–f1M61a.cUhiRnLe:Lhetatrpns-: [12] dKo. i:1B0la.g1e1c6,2/Gc.olDio_rafn_er0,03M22.. Moradi, S. Ott, //proceedings.mlr.press/v176/kiseleva22a.html. M. Samwald, A global analysis of metrics used for [5] J. Kiseleva, A. Skrynnik, A. Zholus, S. Mo- measuring performance in natural language prohanty, N. Arabzadeh, M.-A. Côté, M. Alianne- cessing, in: T. Shavrina, V. Mikhailov, V. Malykh, jadi, M. Teruel, Z. Li, M. Burtsev, M. ter Ho- E. Artemova, O. Serikov, V. Protasov (Eds.), Proceedeve, Z. Volovikova, A. Panov, Y. Sun, K. Srinet, ings of NLP Power! The First Workshop on Eficient A. Szlam, A. Awadallah, S. Rho, T. Kwon, D. Won- Benchmarking in NLP, Association for Computatae Nam, F. Bivort Haiek, E. Zhang, L. Abdrazakov, tional Linguistics, Dublin, Ireland, 2022, pp. 52–63. G. Qingyam, J. Zhang, Z. Guo, Interactive grounded URL: https://aclanthology.org/2022.nlppower-1.6/. language understanding in a collaborative environment: Retrospective on iglu 2022 competition, in: [13] dDoeie:1p0S.ee1k8-6A5I3,/Dv. 1G/u2o0,2D2..Ynalnpgp,oHw.eZrh-a1n.g6,.J. Song, M. Ciccone, G. Stolovitzky, J. Albrecht (Eds.), Pro- R. Z. et al., Deepseek-r1: Incentivizing reasonceedings of the NeurIPS 2022 Competitions Track, ing capability in llms via reinforcement learnvolume 220 of Proceedings of Machine Learning ing, 2025. URL: https://arxiv.org/abs/2501.12948. Research, PMLR, 2022, pp. 204–216. URL: https: //proceedings.mlr.press/v220/kiseleva23a.html. [14] aCr.XDi.vH:2ro5m01e.i,1D2.94M8a.rgiotta, D. Croce, R. Basili, [6] C. Madge, M. Poesio, Large Language Models as MM-IGLU: Multi-modal interactive grounded lanMinecraft Agents, 2024. URL: http://arxiv.org/abs/ guage understanding, in: N. Calzolari, M.-Y. Kan, 2402.08392. doi:10.48550/arXiv.2402.08392, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), ProceedarXiv:2402.08392 version: 1. ings of the 2024 Joint International Conference on [7] Mistral AI, Mistral small 3.1, 2025. URL: https:// Computational Linguistics, Language Resources mistral.ai/news/mistral-small-3-1, release note. and Evaluation (LREC-COLING 2024), ELRA and [8] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, ICCL, Torino, Italia, 2024, pp. 11440–11451. URL:

I. Reid, S. Savarese, Generalized Intersection Over https://aclanthology.org/2024.lrec-main.1000/. [15] M. Chen, Y. Cao, Y. Zhang, C. Lu, Quanti- [22] Y. Zhang, Z. Xu, Y. Shen, P. Kordjamshidi, L. Huang, fying and mitigating unimodal biases in multi- SPARTUN3d: Situated spatial understanding of 3d modal large language models: A causal perspec- world in large language model, in: The Thirteenth tive, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen International Conference on Learning Representa(Eds.), Findings of the Association for Computa- tions, 2025. URL: https://openreview.net/forum?id= tional Linguistics: EMNLP 2024, Association for FGMkSL8NR0.

Computational Linguistics, Miami, Florida, USA, [23] J. H. Cho, B. Ivanovic, Y. Cao, E. Schmerling, 2024, pp. 16449–16469. URL: https://aclanthology. Y. Wang, X. Weng, B. Li, Y. You, P. Kraehenbuehl, org/2024.findings-emnlp.960/. doi: 10.18653/v1/ Y. Wang, M. Pavone, Language-image models 2024.findings-emnlp.960. with 3d understanding, in: The Thirteenth In[16] Y. Zhang, P. E. Latham, A. M. Saxe, Understand- ternational Conference on Learning Representaing unimodal bias in multimodal deep linear net- tions, 2025. URL: https://openreview.net/forum?id= works, in: R. Salakhutdinov, Z. Kolter, K. Heller, yaQbTAD2JJ.

A. Weller, N. Oliver, J. Scarlett, F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceed- A. Appendix ings of Machine Learning Research, PMLR, 2024, pp. 59100–59125. URL: https://proceedings.mlr.press/ Project repository available at: v235/zhang24aa.html. https://github.com/r3lativo/MLLMs-construction[17] H. Zhao, S. Si, L. Chen, Y. Zhang, M. Sun, M. Zhang, company/tree/restructure

B. Chang, Looking beyond text: Reducing language bias in large vision-language models via multimodal dual-attention and soft-image guid- A.1. Judge Prompt ance, 2024. URL: https://arxiv.org/abs/2411.14279. You will be shown some dialogues among two agents, arXiv:2411.14279. an Architect and a Builder. The dialogues were collected [18] S. Frank, E. Bugliarello, D. Elliott, Vision-and- during a collaborative building task, were the Architect language or vision-for-language? on cross-modal was supposed to instruct the Builder on how to build influence in multimodal transformers, in: M.- a certain target structure. You will be provided some F. Moens, X. Huang, L. Specia, S. W.-t. Yih examples of dialogues among humans playing the same (Eds.), Proceedings of the 2021 Conference on game. Based on those examples, your task is to judge Empirical Methods in Natural Language Process- how human-like and plausible each dialogue that you ing, Association for Computational Linguistics, will be shown is, on a scale from 1 to 5, where 1 means Online and Punta Cana, Dominican Republic, that the dialogue is very unnatural and easily detectable 2021, pp. 9847–9857. URL: https://aclanthology.org/ as artificial; 2 means that while the dialogue still sounds 2021.emnlp-main.775/. doi:10.18653/v1/2021. aprattitficeiranl,sasereesntriincttehdeaemxaomunptleosfahruempraensecnotm; 3mmuneiacnastitohnat emnlp-main.775. although the dialogue is not entirely natural, a big por[19] A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, tion of it resembles the human examples; 4 means that R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli, the dialogue is almost perfectly human-like and natural, M. Hanna, A. Koller, A. F. T. Martins, P. Mondorf, with only very few occurrences of artificial-sounding V. Neplenbroek, S. Pezzelle, B. Plank, D. Schlangen, and implausible utterances; 5 means that the dialogue is A. Suglia, A. K. Surikuchi, E. Takmaz, A. Testoni, very human-like, it presents utterances and exchanges Llms instead of human judges? a large scale em- which closely resemble the ones seen in the example, pirical study across 20 nlp evaluation tasks, 2024. thus being almost indistinguishable from the examples. [20] Ja.rXXui,vW: 2.4C0h6e.n1g,8Y4.0G3.ao, X. Wang, S. Gao, Y. Shan, pOrnoldyuacninswgearnwy ioththtehreuntutemrabnecrey.ou think is best, without Instantmesh: Eficient 3d mesh generation from a single image with sparse-view large reconstruction B. Dialogue Snippets models, 2024. URL: https://arxiv.org/abs/2404.07191.

arXiv:2404.07191. Two snippets of dialogues between the agents. Note that [21] I. White, K. Nottingham, A. Maniar, M. Robin- neither the builder’s action nor the represented world son, H. Lillemark, M. Maheshwari, L. Qin, P. Am- state are reported here. manabrolu, Collaborating action by action: A multi-agent llm framework for embodied reasoning, 2025. URL: https://arxiv.org/abs/2504.17950.

arXiv:2504.17950.

Architect: Alright, let’s start building this structure step

by step.

First, we’ll focus on the blue base. Place 4 blue blocks in a square-shaped foundation. Place the 4 blocks side by side, each adjacent to the other, so that they form a 2x2 square. Proceed when you’re ready with the blue base.

Builder: Placed the 4 blue blocks in a 2x2 square at

ground level with the bottom front left coordinate at 100,1,100.

Architect: Great job on placing the blue base. Next,

let’s build the purple part of the structure. Place the first purple block on top of the front left blue block. Give me a confirmation when you’ve placed the purple block.

Builder: Placed the purple block on top of the front left

blue block.

Architect: Great, now I want you to place the second

purple block one level above and directly to the front of the first purple block. Once that’s done, let me know.

Builder: Placed the second purple block one level above,

directly in front of the first purple block.

Architect: Good job placing the second purple block!

Now, let’s build the green part of the structure. Add a third green block on the right side of the back right bottom blue block. Let me know once you’re done!

Builder: I need confirmation, do you mean the right

side relative to the structure or right side relative to the world?

Architect: My apologies for any confusion, I meant the

right side FROM THE VIEWER’S PERSPECTIVE. So the block you’ve placed is the correct one. Now add two more green blocks, one on top of the first green one, and one in front of the first. Let me know when this is done.

B.1. A high quality dialogue exchange B.2. A low quality dialogue exchange

The following is a snippet of the dialogue which got a HL The following excerpt, on the other hand, serves as an score of 4 out of 5 in the visual condition. In this example, example of a low quality communication exchange, and it is possible to appreciate how the Architect provided is taken from a dialogue with score 2 out of 5 in HL. It its instructions in small and clear steps, made references is clear from the extract that, in this case, the Architect to recognizable shapes—squared-shape foundation—and failed at providing simple, incremental instructions, but implicitly referenced blocks that were recently placed— rather produced verbose and monotonous bullet points the first purple block. Moreover, in its fourth utterance, the which are simply precise block-by-block descriptions Builder asks for a direct clarification question, which is of the structure, framed as instructions. The Architect promptly acknowledged and answered by the Architect. does demonstrate the ability to acknowledge a mistake on the Builder’s part, but maintains this same linguistic behavior when correcting it, providing another set of almost identical bullet points.

Architect: Let’s begin building the structure. First, let’s create the bottom layer. The bottom layer has 6 blocks placed in a row in this order: red, purple, yellow, yellow, purple, red. - Place the first red block on the ground. - Next, place a purple block to the right of the red block. - Then add a yellow block to the right of the purple block. - Place another yellow block to the right of the first yellow block. - After that, place a purple block to the right of the second yellow block. - Finally, place a red block to the right of the purple block.

Please confirm once you have placed these blocks. Builder: Placed all the blocks as instructed. Architect: The blocks in your message don’t match the

required structure because they are positioned too close to each other. Please destroy all your recently placed blocks. For the first row, you should place each block next to (i.e., one block away from) the previous block. Let’s retry: - Place the first red block on the ground. - Next, place a purple block to the right of the red block but spaced one block away from the border closest to the red block. - Then add a yellow block to the right of the purple block, also spaced one block away from the border closest to the purple block. - Place another yellow block to the right of the first yellow block, spaced one block away from the border closest to the yellow block. - After that, place a purple block to the right of the second yellow block, spaced one block away from the border closest to the yellow block. - Finally, place a red block to the right of the purple block, spaced one block away from the border closest to the purple block.

Please confirm once you have placed these blocks.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.