1. Introduction

CoCoMaMa: Contextual Combinatorial Multi-Armed Bandit Router for Multi-Agent Systems with Volatile Arms

Jonathan Rau

j.rau.1@tu-berlin.de 0

Jonathan Bader

Philipp Wiesner

wiesner@tu-berlin.de 0

Odej Kao

odej.kao@tu-berlin.de 0 0 Technische Universität Berlin , Straße des 17. Juni 135, 10623 Berlin , Germany

Agentic Large Language Models (LLMs) are designed for specialized objectives using fine-tuning, prompting techniques, and tool calling to outperform general-purpose models in their expert domains. Standardization eforts like the Agent2Agent Protocol could drastically increase the number and heterogeneity of experts available via the Web. A router is required to find the best agent for any given task. However, existing LLM routing methods use a fixed-sized pool of models and often rely on ofline training data such as benchmarks. We propose CoCoMaMa and Neural-CoCoMaMa, a combinatorial contextual volatile multi-armed bandit approach that leverages similarities between tasks and agents by learning on online feedback. It can handle volatile arms by incorporating agent cards as defined by the Agent2Agent Protocol without requiring changes to the internal structures or retraining. Our experimental evaluation shows that CoCoMaMa and Neural-CoCoMaMa achieve better results than respective state-of-the-art algorithms using the LLM routing dataset SPROUT and a novel extended version of SPROUT with synthetic specialized agents.

eol>Multi-Agent Systems Multi-Armed Bandit Large Language Models Agent routing Online learning

1. Introduction

options are usually referred to as sleeping [ 17 ] or volatile bandits [ 18 ]. Applying such MAB algorithms to the routing problem in hMAS has not been examined to the best of our knowledge.

In this paper, we propose using Contextual Combinatorial MAB with volatile arms to route tasks to agents based on their agent card, theoretically enabling an infinite number of volatile agents to enter and leave the pool without retraining a router.

Contributions. This paper makes the following contributions: • We present CoCoMaMa, a novel MAB approach that learns from online feedback and eficiently explores and exploits similarities between tasks and agents in high-dimensional context spaces by adaptively discretizing the context space following statistically informed decisions. • We propose Neural-CoCoMaMa, which improves the CoCoMaMa method by leveraging the benefits of neural networks while maintaining exploration behavior. • We evaluate our approaches using the LLM routing dataset SPROUT [ 19 ] and a novel setup with synthetic specialized agents and compare them to three state-of-the-art contextual combinatorial volatile MAB algorithms.

• We provide an open-source implementation of our CoCoMaMa methods 1.

2. Related Work

We outline related work focusing on LLM ensemble methods. Chen et al. [ 20 ] provide a classification with three groups: route to an expert before the inference step, combine multiple models during inference within the model architecture, and combine the results of diferent models after inference. Treating Web Agents as black boxes rules out the possibility of ensemble methods applied during inference. Thus, that path is neglected in the remainder of this work.

Before inference: Shnitzer et al. [ 21 ] propose repurposing benchmark datasets to learn router models for LLM selection by training a classifier for each candidate LLM. Many similar approaches are proposed to route to an expert from a fixed size of candidate models [ 22, 23, 24, 25, 26, 27, 28 ], while some of them also aim to balance cost and performance. Online algorithms could also be used to train the router. Sikeridis et al. [ 29 ] propose using reinforcement learning to train a router based on online user or AI feedback [ 30, 31 ]. There is also recent work looking into various bandits for online LLM routing [ 32, 19, 25, 33, 34 ]. Many of them are creating a task requirement vector using an embedding model like [ 35 ] and formulate the routing problem as a contextual bandit [ 15 ].

After inference: Cascading [ 36, 37, 38 ] can be used to escalate a task to a model with higher costs and higher expected quality, in case the answer of the initial model does not meet quality requirements. This requires feedback on the quality, which could be obtained by asking users. Though using Large Reasoning Models as a Judge is also viable [ 30 ] with limitations [ 31 ]. Majority voting to select the best answer is presented in Agent Forest [ 39 ]. Regenerating an answer after querying and ranking multiple agents was shown by Lv et al. [ 40 ]

Contrary to the solutions above, our solution considers metadata from agent cards and is built with high and volatile amounts of agents as routing targets in mind.

3. Problem Formulation

Consider a sequence of tasks indexed by time steps ∈ {1, 2, . . . , }. A task is a natural-language user query or intent that requires at least one agent response, but more responses do not hurt, e.g., weather retrieval or booking assistance. For each task , there is a set of available agents . Each agent ∈ has distinct capabilities described by its agent card . Agents may appear in multiple rounds, but can only be selected once per round. To ensure distinguishability, all agents in a round must have unique agent cards. If an agent’s capabilities or metadata change (e.g., through an update), it receives a

1https://github.com/dos-group/CoCoMaMa

new agent card and is treated as a distinct agent. However, similar capabilities yield similar embeddings, allowing the router to transfer prior knowledge through semantic similarity in the context space.

Both the task and an agent card can be mapped into a multi-dimensional context space. By combining the context of task with that of , we obtain the context for the arm ,. The true expected performance of agent on task , denoted ,, is initially unknown. After the agent provides an answer, a “judge” infers , by assigning a continuous score in [ 0, 1 ]. This feedback can come from users or other evaluation methods [ 30 ].

Because invoking and scoring agents typically incurs cost, we impose a fixed budget that limits how many agents may be selected at each round. Following MAB terminology, we call the subset of chosen agents a super arm, denoted ⊆ , with || = . The reward on task is given by () = max ,,

∈ reflecting the requester’s interest in only the best individual performance among the selected agents. The regret at task is then the diference between the maximum achievable reward, i.e., max∈ ,, and the actual reward (). The router’s objective is to select each in order to minimize the cumulative regret over all rounds, i.e.,

min ∑︁[︁ max , − ( )]︁.

=1 ∈

4. Approach

The core idea of contextual, combinatorial, volatile MAB algorithms applied on hypermedia Multi-Agent Systems (hMAS) is to continuously learn and refine the understanding of the conceptual requirements per task and capabilities of each agent based on feedback. E.g., we might have observed that weather agent A provided a good result for the task "What is the weather going to be like in Bologna tomorrow?". Then, the weather agent A might also perform well on the task "What is the weather going to be like in Rome tomorrow?", because the tasks are very similar to each other. Later, we observe that weather agent A performs badly on the task "What is the weather going to be like in Berlin?", but we also tried weather agent B and it provides a good result. Conclusively, a router might learn that requests for weather information in Italy should be routed to weather agent A, and requests for locations in

Germany should be routed to weather agent B. Therefore, we need to extract features describing each task and agent that allow us to exploit semantic similarities. This is described in 4.1. Next, algorithms that are capable of exploring the capabilities of agents and exploiting good task-agent assignments are covered in 4.2.

4.1. Constructing the Context Space To apply contextual bandit algorithms efectively, both the task and the agent must be mapped and combined into suitable feature vectors that semantically describe how a specific task is assigned to a specific agent. Using pre-trained Sentence Transformers [ 35, 41 ] to produce compact embeddings out of a task is an established practice in LLM-routing (e.g. [ 24, 22 ]).

We propose creating feature vectors for the agent cards using the same method. This yields a pair of vectors for each task-agent combination, where semantically similar tasks and agent cards have similar embeddings, e.g., a high cosine similarity. The two embedding vectors are concatenated to form the unified context , for the task-arm pair. This preserves all the available information, contrary to adding or multiplying the vectors or applying similarity metrics such as the Euclidean distance. 4.2. The Contextual, Combinatorial, Volatile Multi-Armed Bandits Three state-of-the-art algorithms that support the contextual, combinatorial and volatile setting were identified and are presented briefly. This is followed by the introduction of our CoCoMaMa and

Neural-CoCoMaMa algorithms.

4.2.1. CC-MAB 4.2.2. ACC-UCB Chen et al. [ 42 ] propose splitting the context space into evenly sized non-overlapping regions and balancing exploration of unknown regions and exploitation of known regions with high expected reward in their CC-MAB algorithm. Whenever an arm is played, statistics for the respective context region are updated.

Nika et al. [ 43 ] introduce the Adaptive Contextual Combinatorial Upper Confidence Bound (ACCUCB) algorithm, which uses a tree-based approach to iteratively partition the context space into non-overlapping regions of varying sizes using sets of hypercubes to define a region. A set containing a single hypercube is split by creating non-overlapping sets of hypercubes with half the side length. E.g., a context region containing a 2x2 chessboard could be split into sets based on rows, columns, or black and white tiles. Using sets of hypercubes to define regions consumes many resources in high-dimensional context spaces. E.g., using "all-MiniLM-L6-v2" [ 35 ] as an embedding model yields a 768-dimensional context space, which would be split into 2768 hypercubes. Initializing that many objects is not feasible on standard hardware (assuming 64GB memory as of 2025).

We change the implementation of ACC-UCB by using hyperrectangles, defined by a center vector and a length vector, to mark context regions. Nodes are split at the center along the dimension with the highest length (random selection to break ties). We term that variant High-Dimensional-ACC-UCB (HD-ACC-UCB) for the remainder of this work. It efectively just adds a small constraint to the core concept of Nika et al. [ 43 ]: splitting a region into "black and white tiles" is prohibited. 4.2.3. Neural-MAB 4.2.4. CoCoMaMa Lin et al. [ 44 ] follow a greedy selection strategy in their Neural-MAB algorithm using two neural networks with one hidden layer each to predict the reward of individual arms and the super arm. We hypothesize that making statistically informed decisions on the split condition and the split location can yield better results on high-dimensional context spaces. Especially in the field of hMAS, we expect many heterogeneous tasks and versatile agents, which require many dimensions to capture their nuances. Thus, we propose the CoCoMaMa Algorithm 1 as an improvement to HD-ACC-UCB. We maintain for each leaf node ℎ, ∈ , the following metrics: • ¯(ℎ,) ∈ R: running mean of the arms (i.e., the average context vector). • ¯(ℎ,) ∈ R: running mean of the reward. • Cov(ℎ,) ∈ R: running covariances between each context dimension and the reward. • Var(( ℎ,)) ∈ R: running variance of the reward. • (ℎ,) ∈ R: number of times the node has been played.

• (ℎ,): parent node with all the associated metrics above at the state when it was split. For each newly observed data point (︀ ,, ,︀) , the statistics for the node can be updated using Welford’s Algorithm [ 45 ]. We introduce the combined confidence of a node and its parent node, defined as (ℎ,, (ℎ,)) := √︁ (ℎ,2)+log((ℎ,) . The node index is defined as: (ℎ,) := max {︃¯(ℎ,) + (ℎ,, (ℎ,)), ¯((ℎ,)) + (ℎ,, (ℎ,)) (1) A high variance of rewards for a node could indicate that splitting the node could yield a good and a bad performing region. Therefore, it seems desirable to split the nodes with the highest potential based on the variance. A node is split if the variance of rewards of a node is times bigger than the weighted average variance of rewards of all leaf nodes:

· · Var(( ℎ,)) > ∑︁ () · Var (()) With being defined as a hyperparameter. Frequent splitting could yield branches with relatively high confidence values (ℎ,) for each individual node in a branch. Adding the following dampening condition circumvents overcommitment of the algorithm to explore and split such branches: (ℎ,) > ((ℎ,)) * = arg max ⃒⃒ Cov(ℎ,)⃒⃒

∈[] We propose selecting the dimension * with the highest running absolute covariance between context and reward: (2) (3) (4) and splitting the region at the running mean of the arms ¯(ℎ,) in the respective dimension * . 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

end if end for 15: end for

Algorithm 1 CoCoMaMa-Algorithm Require: budget , split parameters , ,

2: for = 1, 2, ..., do 1: Initialize: ¯0, Var0, 0 = 0, 1 = {0,1} 1, root 0,1 Observe available agents and the task Construct arm contexts , for each agent in

Compute indices according to (1) for each arm

Select arm Set based on indices and budget Play arm Set and observe rewards , for node ℎ, ∈ do Identify set of selected nodes

Update metrics for node

if ((2) and (3)) or ∞(ℎ,) ≤ 1 ℎ then +1 ← split at

¯(ℎ,) on dimension * (4) 4.2.5. Neural-CoCoMaMa ^(,) + (ℎ,, (ℎ,)), where ℎ, corresponds to the node the arm is in.

Instead of using ¯(ℎ,) in the calculation of the index in Equation 1, we propose predicting the expected reward of an arm ^(,) using a neural net with a single hidden layer as in Neural-MAB [ 15 ], which learns each time an outcome is observed. The index is then defined per arm as (,) :=

5. Evaluation

We evaluate the outlined algorithms on two datasets. The first dataset has been introduced by Somerstep et al. [ 19 ] and is enriched by agent cards in this work. It has a fixed size of general-purpose LLMs as agents and highlights the ability of the algorithms to identify and exploit the better-performing agents from a set of options with similar descriptions. For the second dataset, we add synthetic specialized agents and derive the performance based on a mathematical definition of a task-agent fit and the base-agent score from the first dataset [ 19 ]. The second experiment outlines the capabilities of the algorithms to identify and exploit specialized agents. 5.1. Routing on SPROUT

HD-ACC-UCB CoCoMaMa (ours) CC-MAB Neural-CoCoMaMa (ours) Neural-MAB

Random 8 tt o6 p u tr e g rvee4 lit a u m u C2 0 2.5 tto2.0 p u t reg1.5 e r e v it lau1.0 m u C 0.5 0.0 4 ttoup3 tr e g e r ilte2 v a u m u1 C 0 1.75

The SPROUT [19] dataset provides quality scores for answers from 13 diferent LLMs on over 40000

queries from 6 benchmarks. Agent cards for each model were created based on public announcements from the respective providers and are shown in Annex A. A random router and an oracle router, which always greedily selects the agents with the highest true mean in each round, are used as naive baselines. The HD-ACC-UCB, CC-MAB and Neural-MAB algorithms serve as the state-of-the-art baselines. The experiments are conducted 10 times each with the same sequential ordering of tasks for the budgets 1,2,3,4. The decisions made by the algorithms are compared with the optimal solution made by the oracle router. Making sub-optimal decisions yields regret and should be minimized.

The plots in Figure 1 show that CC-MAB yields the same regret as the random router. CoCoMaMa yields less cumulative regret than HD-ACC-UCB for all tested budgets. This supports our prior hypothesis that making statistically informed splitting decisions can increase performance. Neural-CoCoMaMa achieves even better results for all budgets, which could be attributed to a faster learning rate, as weights on all input dimensions of the neural net can be updated after each observation, while splitting of nodes only takes place under certain conditions. Furthermore, nodes are split on just one dimension. Neural-CoCoMaMa matches the performance of Neural-MAB for a budget of 2 and 3, and only shows a slightly higher regret for the other budgets.

The agent selection rates in Figure 2 show that Neural-MAB does not spend significant amounts to explore all agents and exploits the same 3 agents instead. This is indicated by the selection rates close to 1.0, where the sum of all selection rates should sum up to the budget 3. Always picking the t n e g A claude-3-5-sonnet-v1 titan-text-premier-v1

gpt-4o gpt-4o-mini granite-3-2b granite-3-8b llama-3-1-70b llama-3-1-8b llama-3-2-1b llama-3-2-3b llama-3-3-70b llama-3-405b mixtral-8x7b-v01 -ACC-UCCBoCoMaMa (ours) ral-CoCoM Neu aMa (ours)

Neu ral-MAB same 3 agents is not a bad strategy in this case, as each of them is among the best performing agents in over 70% of the tasks. In 1% of the tasks, Mixtral is the unique best-performing model, but it is never selected by Neural-MAB. All other algorithms follow a design where they are actively trying to explore cases where other models might perform better than their known best educated guesses. Our CoCoMaMa methods spend more efort on exploration than the greedy Neural-MAB, and provide a sharper distinction between good and bad performing agents compared to HD-ACC-UCB. 5.2. Routing on SPROUT with Specialized Experts The SPROUT dataset does not contain many queries, where a unique best-performing agent can be identified and the average response quality of many models is high. This will most likely not be the case for highly specialized WebAgents. Therefore, we are adding synthetic specialized agents to the

SPROUT dataset.

Let denote the index of the task , where we begin adding new specialized agents. If ≥ , a new agent is added every rounds to all following sets of available agents . If > , the new agent is a strong expert, and a weak expert otherwise. A new expert is always based on a random base agent by copying their agent card embedding. The value at random dimensions is set to 1 for strong experts, and 0.9 for weak experts, to signal their specialization in certain areas. The possible expert dimensions are limited to 50% of the used dimensions from the embeddings. The true reward of an agent doing a task depends 80% on the task-agent fit and 20% on the base agent score. The task-agent fit , is computed based on matching the value at the task embedding at the dimension with the highest value, with the respective value at the same dimension at the agent card embedding using the following equation: , = {︃(5 · · 0 0 5000 10000 15000 20000 25000 30000

Arriving task (t) 0

, where denotes the logistic function. Using 0.9 for weak experts and 100 should mimic the behavior that innovative agents based on new technologies promising full integration are introduced and advertised, but they have flaws due to being early adopters. The second generation overcomes those issues. The base agent score is taken from the SPROUT dataset. It is multiplied by 0.1 for specialized agents in case the task-agent fit is below 0.6. Reducing the base agent score for specialists should mimic behavior, where an agent is tasked to answer "I don’t know" on questions outside their domain. Duplicated agents in are not permitted and the generation of a specialized agent is skipped for the respective round.

The experimental results using = 2000, = 200, = 6000, = 5 for expert generation are shown in Figure 3 displaying the average reward. For both budgets = 1 and = 4, the optimal average reward achieved by an oracle router increases continuously from below 0.2 to 0.45 after the strong experts are introduced at = 6000. Many algorithms yield decreasing average rewards after the weak experts are introduced at = 2000 and do not select strong experts for tasks in their respective domains often enough to show an increase in average reward for a budget of 1. However, as the chance rises to randomly pick a good task-agent match for a higher budget, all algorithms except Neural-MAB show increasing average performance after = 6000 for the budget 4. The curves for Neural-CoCoMaMa show that it is resilient to the introduction of weak experts and is the best algorithm at exploring and exploiting the strong experts.

6. Discussion

The CoCoMaMa approach shows that statistically informed splitting of nodes can yield better results than the respective baseline HD-ACC-UCB. However, the learning rate is limited as we only split at one dimension each time. Neural-MAB is much faster at converging on a well-performing agent. Though it might become trapped at a local optimum and never escapes it due to a lack of exploration. Neural-CoCoMaMa combines eficient exploration and a fast learning rate. It outperforms all other methods on SPROUT with synthetic agents. It matches the performance of the best-performing stateof-the-art algorithm, Neural-MAB, on the basic SPROUT dataset, while ofering more explainability regarding the routing decision. Before letting an agent execute a task, the expected performance of the agent (in all neural-based approaches) and historical insights (in CoCoMaMa-based approaches and also HD-ACC-UCB and CC-MAB to a limited extent) can be communicated to the client. E.g., the CoCoMaMa approaches can provide historical variance, average performance and confidence scores for the context region of the task-agent pair. This allows operators to escalate tasks to humans before wasting resources on an agent when expected performance is low and highly variable.

A current limitation is that playing a good task-agent match at least once is required to start learning. Drastically decreasing the chance of finding a fit by increasing the amount of agents and making the required specializations more granular (more embedding dimensions, lower ) would require a lot of data to train an eficient router. Providing better data in the agent cards may mitigate the problem. The agent cards already contain an example request. This is useful for humans, but it only resembles very small data points for a router. In the future, the agent developers could define entire context regions using hyperrectangles and provide respective average performance, confidence, and variance. Furthermore, diferent clients could share their aggregated reviews over the web. New federated learning approaches may then be used at the router to overcome cold starts and data scarcity.

Berners-Lee et al. [ 7 ] stress the importance of ontologies and linked data for knowledge representations in the Semantic Web. Ciortea et al. [ 13 ] hint at designing structured query capabilities to search for the matching agent in a web-scale hMAS containing billions of agents. The CoCoMaMa approach does not make using structured queries obsolete. Instead, those approaches can go hand in hand, as ifltering using a query could drastically narrow down the action space per task and CoCoMaMa can then step in to decide on the best agent based on historical feedback. Future work could combine both approaches and further improve the results in experiments similar to Section 5.2. Learning based on the feedback could also uncover routing policies, which are not possible to identify using structured queries, as the data in the agent card might not be that informative. This can be observed in the experiments in

Section 5.1.

Next, the A2A Protocol [ 10 ] does not specify how incurring costs should be communicated. Doing so would allow the router to balance between costs and performance. Using monetary budget constraints per task instead of budgets for the amounts of agents to play would open an interesting research area for routing in hMAS.

Lastly, the algorithms may be hardened on edge cases. E.g., an agent with underlying randomness might yield rewards with a high variance, causing the router to split often without getting any information gain. Variance-aware UCB Algorithms [ 46 ] are an active research domain and may also be incorporated to further improve the CoCoMaMa methods. Adversarial providers of agents might attempt to craft their agent cards to benefit from high expected rewards in already well-established domains to generate trafic for their mediocre services or badmouth competition. A good router should be robust enough to recover from such attacks. Transferring the insights from trust scores and digital signatures to agent routing in hMAS is also open future work.

7. Conclusion

In this work, we introduced CoCoMaMa and Neural-CoCoMaMa, two contextual combinatorial volatile multi-armed bandit approaches tailored for the dynamic and heterogeneous landscape of agentic LLMs. By leveraging task-agent similarities and online feedback, our methods address key limitations in existing routing strategies, particularly their reliance on static model pools and ofline training data. Our approach is compatible with the A2A Protocol [ 10 ] and accommodates agent volatility through standardized agent cards, making it a promising fit for scalable, decentralized hMAS.

Experimental results on the SPROUT [ 19 ] dataset demonstrated equal performance to the best performing state-of-the-art method while improving the explainability of the routing decisions. NeuralCoCoMaMa is the only method capable of exploring and exploiting strong niche experts without sufering as much from the introduction of weak experts to the pool of available agents in our second experimental dataset.

Importantly, our approach complements rather than replaces structured search methods. Combining CoCoMaMa with query-based filters could significantly reduce the action space, enabling eficient feedback-driven selection within semantically scoped agent sets. Additionally, integrating cost-awareness, variance-sensitive strategies, and trust mechanisms will be essential for deploying robust routing in open, adversarial, or resource-constrained environments.

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 414984028 – SFB 1404 FONDA

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar and spelling check, paraphrase and reword. Further, the authors used perplexity.ai for agent cards in Annex A in order to: Drafting content. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

A. Agent Cards

Three agent cards are shown below. They were built to resemble a typical agent card similar to the example given in the A2A specification [ 10 ]. They were not optimized to accomplish automated toolcalling [ 11 ] and may lack required information to make technically correct calls. All agent cards are included as JSON files in the uploaded supplementary material and will be made publicly accessible after acceptance.

Listing 1: Claude 3.5 Sonnet Agent Card }, "version": "3.5", "documentationUrl": "https://docs.aws.amazon.com/bedrock/latest/userguide/modelparameters-claude.html", "capabilities": { }, { }

Listing 2: GPT-4o Agent Card "name": "GPT-4o", "description": "OpenAI’s versatile, high-intelligence flagship model that accepts both text and image inputs with a 128K context window", "url": "https://platform.openai.com/docs/models/gpt-4o", "provider": { "organization": "OpenAI", "url": "https://openai.com" }, "version": "2024-11-20", "documentationUrl": "https://platform.openai.com/docs/models/gpt-4o", "capabilities": { "streaming": true, "pushNotifications": false, "stateTransitionHistory": false }, "defaultInputModes": ["text/plain", "image/png", "image/jpeg"], "defaultOutputModes": ["text/plain", "application/json"], }, { }, { }, { } "id": "image-understanding", "name": "Image Understanding", "description": "Analyze and interpret images to provide relevant information and insights", "tags": ["vision", "image-analysis", "multimodal"], "examples": [ "What’s in this image?", "Describe what you see in this chart", "Help me understand what this diagram is showing" ], "outputModes": ["application/json", "text/plain"] "id": "reasoning", "name": "Complex Reasoning", "description": "Handle complex problem-solving tasks requiring multi-step reasoning", "tags": ["reasoning", "problem-solving", "analysis"], "examples": [ "Solve this multi-step math problem", "Help me debug this programming issue", "Analyze the logical fallacies in this argument" "name": "GPT-4o mini",

Listing 3: GPT-4o mini Agent Card 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 }, "authentication": {

"schemes": ["Bearer"] }, "defaultInputModes": ["text/plain", "image/png", "image/jpeg"], "defaultOutputModes": ["text/plain", "application/json"], "skills": [ { "id": "text-generation", "name": "Text Generation", "description": "Generate coherent and contextually relevant text based on input prompts", "tags": ["text-generation", "conversation", "content-creation"], "examples": [ "Write a short story about time travel", "Draft an email to a colleague about project updates", "Create a product description for an e-commerce site" "id": "image-understanding", "name": "Image Understanding", "description": "Analyze and interpret images to provide relevant information", "tags": ["vision", "image-analysis", "multimodal"], "examples": [ "What objects are in this image?", "Describe what you see in this photo", "What text is shown in this screenshot?"

[1]

Yao ,

Zhao ,

Yu ,

Du , I. Shafran,

Narasimhan ,

Cao , React: Synergizing reasoning and acting in language models , in: International Conference on Learning Representations (ICLR) , 2023 .

[2]

Hong ,

Zheng ,

Chen , Y. Cheng, J. Wang , C.

Zhang , Z.

Wang , S. K. S.

Yau , Z.

Lin , L.

Zhou , et al., Metagpt: Meta programming for multi-agent collaborative framework , arXiv preprint arXiv:2308.00352 3 ( 2023 ) 6 .

[3]

Marro ,

E. La

Malfa ,

Wright ,

Li ,

Shadbolt ,

Wooldridge ,

Torr , A scalable communication protocol for networks of large language models , arXiv preprint arXiv:2410.11905 ( 2024 ).

[4]

He ,

Treude ,

Lo , Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead , ACM Transactions on Software Engineering and Methodology ( 2024 ).

[5]

Chen ,

Su ,

Zuo ,

Yang ,

Yuan ,

Qian , C.-M. Chan , Y.

Qin , Y.

Lu , R.

Xie , et al., Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents , arXiv preprint arXiv:2308.10848 2 ( 2023 ) 6 .

[6]

Lu ,

R. T.

Lange ,

Foerster ,

Clune ,

Ha , The ai scientist: Towards fully automated open-ended scientific discovery , arXiv preprint arXiv:2408.06292 ( 2024 ).

[7]

Berners-Lee ,

Hendler ,

Lassila , Web semantic, Scientific American 284 ( 2001 ) 34 - 43 .

[8]

Shazeer ,

Mirhoseini ,

Maziarz ,

Davis ,

Le ,

Hinton ,

Dean , Outrageously large neural networks: The sparsely-gated mixture-of-experts layer , arXiv preprint arXiv:1701.06538 ( 2017 ).

[9]

Yang ,

Chai ,

Song ,

Qi ,

Wen ,

Li ,

Liao ,

Hu ,

Lin ,

Chang , et al., A survey of ai agent protocols , arXiv preprint arXiv:2504.16736 ( 2025 ).

[10] Google , A2A: Agent2Agent Protocol, https://github.com/google/A2A, 2025 . Accessed: 2025 -04-21.

[11]

Kong ,

Ruan ,

Chen ,

Zhang , T. Bao,

Shiwei ,

Qing ,

Hu ,

Mao ,

Li , et al., Tptu-v2: Boosting task planning and tool usage of large language model-based agents in real-world industry systems , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , 2024 , pp. 371 - 385 .

[12]

S. G.

Patil ,

Zhang ,

Wang ,

J. E.

Gonzalez , Gorilla: Large language model connected with massive apis , Advances in Neural Information Processing Systems 37 ( 2024 ) 126544 - 126565 .

[13]

Ciortea ,

Mayer ,

Gandon ,

Boissier ,

Ricci ,

Zimmermann , A decade in hindsight: the missing bridge between multi-agent systems and the world wide web , in: AAMAS 2019-18th International Conference on Autonomous Agents and Multiagent Systems , 2019 , p. 5 .

[14]

T. L.

Lai ,

Robbins , Asymptotically eficient adaptive allocation rules , Advances in applied mathematics 6 ( 1985 ) 4 - 22 .

[15]

Lu ,

Pál ,

Pál , Contextual multi-armed bandits , in: Proceedings of the Thirteenth international conference on Artificial Intelligence and Statistics , JMLR Workshop and Conference Proceedings, 2010 , pp. 485 - 492 .

[16]

Chen ,

Wang ,

Yuan , Combinatorial multi-armed bandit: General framework and applications , in: International conference on machine learning, PMLR , 2013 , pp. 151 - 159 .

[17]

Kleinberg ,

Niculescu-Mizil ,

Sharma , Regret bounds for sleeping experts and bandits , Machine learning 80 ( 2010 ) 245 - 272 .

[18]

Bnaya ,

Puzis ,

Stern ,

Felner , Volatile multi-armed bandits for guaranteed targeted social crawling ., AAAI (Late-Breaking Developments ) 2 ( 2013 ) 16 - 21 .

[19]

Somerstep ,

F. M.

Polo , A. F. M. de Oliveira , P. Mangal, M.

Silva , O.

Bhardwaj , M.

Yurochkin , S.

Maity , Carrot: A cost aware rate optimal router , arXiv preprint arXiv:2502.03261 ( 2025 ).

[20]

Chen ,

Li ,

Chen ,

Li ,

Sun ,

Luo ,

Mao ,

Yang ,

Sun ,

P. S.

Yu , Harnessing multiple large language models: A survey on llm ensemble , arXiv preprint arXiv:2502.18036 ( 2025 ).

[21]

Shnitzer ,

Ou ,

Silva ,

Soule ,

Sun ,

Solomon ,

Thompson ,

Yurochkin , Llm routing with benchmark datasets , in: NeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models , 2023 .

[22]

Jain , T.-Y. Tung,

T. H.

Kofman , RoRF - Open Source LLM Router, https://www.notdiamond.ai/ blog/rorf, 2024 . Accessed: 2025 -03-25.

[23]

Chen ,

Jiang ,

Lin ,

Kwok ,

Zhang , Routerdc: Query-based router by dual contrastive learning for assembling large language models , Advances in Neural Information Processing Systems 37 ( 2024 ) 66305 - 66328 .

[24]

Q. J.

Hu ,

Bieker ,

Li ,

Jiang ,

Keigwin , G. Ranganath,

Keutzer ,

S. K.

Upadhyay , Routerbench: A benchmark for multi-llm routing system , arXiv preprint arXiv:2403.12031 ( 2024 ).

[25]

Li , Llm bandit: Cost-eficient llm generation via preference-conditioned dynamic routing , arXiv preprint arXiv:2502.02743 ( 2025 ).

[26]

Wang ,

Liu , W. Cheng,

Zhao ,

Chen ,

Yu ,

Fu ,

Chen , Mixllm: Dynamic routing in mixed large language models , arXiv preprint arXiv:2502.18482 ( 2025 ).

[27]

Ong ,

Almahairi ,

Wu , W.-L. Chiang,

Wu ,

J. E.

Gonzalez ,

M. W.

Kadous , I. Stoica , Routellm: Learning to route llms from preference data , in: The Thirteenth International Conference on Learning Representations , 2024 .

[28]

Lu ,

Yuan ,

Lin ,

Yuan ,

Zhou ,

Zhou , Routing to the expert: Eficient rewardguided ensemble of large language models , arXiv preprint arXiv:2311.08692 ( 2023 ).

[29]

Sikeridis ,

Ramdass ,

Pareek , Pickllm: Context-aware rl-assisted large language model routing , arXiv preprint arXiv:2412.12170 ( 2024 ).

[30]

Zheng , W.-L. Chiang,

Sheng ,

Zhuang ,

Wu ,

Zhuang ,

Lin ,

Li ,

Xing , et al., Judging llm-as-a-judge with mt-bench and chatbot arena , Advances in Neural Information Processing Systems 36 ( 2023 ) 46595 - 46623 .

[31]

Szymanski ,

Ziems ,

H. A.

Eicher-Miller ,

T. J.-J.

Li ,

Jiang ,

R. A.

Metoyer , Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks , in: Proceedings of the 30th International Conference on Intelligent User Interfaces , 2025 , pp. 952 - 966 .

[32]

Zhang ,

Huang ,

Fan , N. Liu,

Li ,

Yang ,

Yao ,

Wang ,

Wang , Kabb: Knowledgeaware bayesian bandits for dynamic expert coordination in multi-agent systems , arXiv preprint arXiv:2502.07350 ( 2025 ).

[33]

Xia ,

Kong ,

Yu ,

Guo ,

R. A.

Rossi ,

Kim ,

Li , Convergence-aware online model selection with time-increasing bandits , in: The Web Conference 2024 , 2024 .

[34]

Hoveyda , A. P. de Vries , M. de Rijke, H.

Oosterhuis , F.

Hasibi , Aqa: Adaptive question answering in a society of llms via contextual multi-armed bandit , arXiv preprint arXiv:2409.13447 ( 2024 ).

[35]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2019 . URL: https://arxiv.org/abs/ 1908 .10084.

[36]

Gupta ,

Narasimhan ,

Jitkrittum ,

A. S.

Rawat ,

A. K.

Menon ,

Kumar , Language model cascades: Token-level uncertainty and beyond , arXiv preprint arXiv:2404.10136 ( 2024 ).

[37]

Yue ,

Zhao ,

Zhang ,

Du ,

Yao , Large language model cascades with mixture of thoughts representations for cost-eficient reasoning , arXiv preprint arXiv:2310.03094 ( 2023 ).

[38]

Chen ,

Zaharia ,

Zou , Less is more: Using multiple llms for applications with lower costs , in: Workshop on eficient systems for foundation models@ ICML2023 , 2023 .

[39]

Li ,

Zhang ,

Yu ,

Fu ,

Ye , More agents is all you need , arXiv preprint arXiv:2402.05120 ( 2024 ).

[40]

Lv ,

Tang ,

Zhang , X. Liu,

Luo ,

Yu , Urg: A unified ranking and generation method for ensembling language models , in: Findings of the Association for Computational Linguistics ACL 2024 , 2024 , pp. 4421 - 4434 .

[41] OpenAI, text-embedding-3 models , https://platform.openai.com/docs/guides/embeddings, 2024 . Accessed: 2025 -05-20.

[42]

Chen ,

Xu ,

Lu , Contextual combinatorial multi-armed bandits with volatile arms and submodular reward , Advances in Neural Information Processing Systems 31 ( 2018 ).

[43]

Nika ,

Elahi ,

Tekin , Contextual combinatorial volatile multi-armed bandit with adaptive discretization , in: International Conference on Artificial Intelligence and Statistics , PMLR, 2020 , pp. 1486 - 1496 .

[44]

Lin ,

Yao ,

Zhang ,

H. Y.

Noh ,

Joe-Wong , A neural-based bandit approach to mobile crowdsourcing , in: Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications , 2022 , pp. 15 - 21 .

[45]

A. A.

Efanov ,

S. A.

Ivliev ,

A. G.

Shagraev , Welford's algorithm for weighted statistics , in: 2021 3rd International Youth Conference on Radio Electronics , Electrical and Power Engineering (REEPE), IEEE, 2021 , pp. 1 - 5 .

[46]

Han ,

Xu , On the precise asymptotics and refined regret of the variance-aware ucb algorithm , arXiv preprint arXiv:2412.08843 ( 2024 ).

"name": "Claude 3.5 Sonnet", "description": "Anthropic's Claude 3.5 Sonnet model for advanced natural language understanding and generation", "url": "https://aws.amazon.com/bedrock/claude/", "provider": { "organization": "Anthropic via AWS", "url": "https://aws.amazon.com/bedrock/" "description": "A more efficient and cost-effective version of GPT-4o with similar capabilities", "url": "https://platform.openai.com/docs/models", "provider": { "organization": "OpenAI", "url": "https://openai.com" }, "version": "1.0.0", "capabilities": { "streaming": true, "pushNotifications": false, "stateTransitionHistory": false