Multimodal Agents, Modality Fusion & World Models

13 Nov

In the past, AI systems have been confined by their singular focus on one type of data. A natural language processing (NLP) model could handle text, but it was blind to images. A computer vision system could identify objects in a photo but couldn’t interpret written descriptions. These siloed systems lacked the depth that comes from viewing information through multiple lenses.

Multi-modal agents break down these barriers, bringing together different “sensory” inputs into a unified framework. This allows them to analyse a video clip while grasping the context of the accompanying text or translate insights across images, speech, and written words.

At this point, this is hardly news, OpenAI’s ChatGPT has had multimodal capabilities for the best part of a year — synthesising images, audio and text into one user experience.

What is Modality Fusion?

But with multi-modal agents now able to process different types of data, a new step for AI capabilities is emerging: modality fusion, a process that doesn’t treat each modality as a separate input but instead brings them together into a unified, multi-dimensional framework. Unlike typical multimodal approaches that may combine data streams at the output level, modality fusion leverages specific mechanisms like cross-attention and self-attention to fuse data into a unified embedding space at multiple stages.

Basically, this means each type of data adds its own perspective, enabling AI to go beyond isolated predictions and capture complex patterns, relationships, and preferences with greater accuracy. This unified approach gives AI a more complete view, enhancing its ability to make context-aware and precise predictions.

The Walmart Case Study

One recent example of this in action is how Walmart’s advanced recommendation system leverages a Triple Modality Fusion (TMF) model that combines visual, textual, and graph data. By integrating these different types of data within a Large Language Model (LLM), Walmart’s system gains a layered, comprehensive view of user preferences, enabling it to make recommendations with a level of precision that traditional models can’t match.

To gauge the effectiveness of this approach, Walmart measured how often the top recommendation aligned with users’ true preferences, a metric known as HitRate@1. For the uninitiated, a higher HitRate@1 score means that the system’s first suggestion — the top recommendation — is more likely to be exactly what the user wants. TMF achieved up to a 38% improvement in this metric over other advanced recommendation models when tested on product categories like Electronics, Pets, and Sports, meaning users were finding Walmart’s top recommendation relevant more often.

This multi-dimensional insight enables Walmart’s AI to adapt to complex user behaviour patterns. For example, if a user explores outdoor gear, TMF might recommend complementary items like camping accessories or fitness equipment. By analysing the full range of visual appeal, descriptive text, and patterns in user interactions, Walmart’s system provides recommendations that feel personalised, timely, and more likely to satisfy diverse customer needs.

Studies show that multimodal models excel in capturing context, which is critical in settings where user needs are dynamic or layered. Research in areas such as medical imaging, where models combine imaging data with patient histories, confirms that integrating data types can improve diagnostic accuracy. Similarly, multimodal applications in autonomous driving, where vision and sensor data are combined, are helping vehicles better interpret and respond to their surroundings. Walmart’s success with TMF echoes this pattern, underscoring how a multi-dimensional approach can lead to significant improvements in model accuracy and responsiveness across diverse applications.

The Role of Modality Fusion in World Models

Modality fusion aligns naturally with the growing concept of world models in AI. World models are systems that can simulate, understand, and predict complex environments by integrating multiple data types. Inspired by how human brains simulate the world, world models aim to enable AI systems to make informed decisions and predictions based on an internally generated representation of reality. This idea stems from reinforcement learning and robotic control research, where agents learn within simulated environments to reduce the high costs and limitations of real-world training.

World models aim to bridge the gap between current AI capabilities and artificial general intelligence ( the elusive AGI) by facilitating counterfactual reasoning — AI’s ability to infer the outcomes of hypothetical scenarios. This form of reasoning, natural to us humans but out of the range of most current AI systems, allows the AI to make predictions and adjust actions even in unprecedented situations.

Multimodal Models and JEPA

Could Multimodal Large Models (MLMs) be the missing link? MLMs combine diverse data types which researchers hope can bring AI closer to a form of AGI by enabling it to reason and make decisions with a human-like understanding of context.

Researchers are working on different approaches to building world models, including the use of large multimodal datasets or smaller datasets enriched with hierarchical planning and rule-based reasoning. To do this, they are constantly refining multimodal systems through techniques like Multimodal Chain of Thought (M-COT) and Multimodal Instruction Tuning (M-IT). M-COT enhances AI’s ability to reason sequentially across data types, while M-IT fine-tunes its ability to learn from multimodal inputs, enhancing both reasoning and adaptability.

These approaches aim to capture dynamic environmental changes and allow models to simulate various scenarios. However, the development of fully functional world models remains a challenging endeavour, as current systems still require substantial advancements in data handling, computational efficiency, and reasoning abilities to achieve the nuanced understanding seen in human cognition.

One promising approach addressing these challenges is JEPA (Joint Embedding Predictive Architecture), which introduces hierarchical planning. Inspired by human cognition, JEPA organizes tasks into layers of subtasks, enabling AI to make predictions and decisions dynamically across high-level goals and detailed actions. By adopting this structured planning, future world models could improve their ability to simulate and respond to complex, evolving scenarios.

That said, in this context, Walmart’s TMF model and modality fusion still provides a tangible step toward the broader vision of AI world systems that can understand and operate within complex, real-world environments.

The Great Unhobbling

While world models are still some way away, in the immediate future, AI’s potential won’t be unlocked through creating ever more powerful models but through creatively “unhobbling” the ones we already have. Today, as proprietary and open-source models converge in foundational capabilities and become increasingly multimodal by default, the question isn’t which model to use, but how we use them. Techniques like modality fusion and chain-of-thought prompting — as seen in OpenAI’s GPT-o1 — will be central to this shift, where innovation is driven by how effectively we leverage existing model capabilities rather than pushing computational limits.

Follow the lines.

ACQUAINTED