A comprehensive exploration of multimodal large language models that integrated vision and language capabilities, enabling AI systems to process images and text together. Learn how GPT-4 and other 2023 models combined vision encoders with language models to enable scientific research, education, accessibility, and creative applications.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2023: Multimodal Large Language Models
By 2023, the field of artificial intelligence had reached a pivotal moment in its evolution toward more comprehensive and capable systems. Large language models had demonstrated remarkable proficiency in understanding and generating text, while vision models had achieved impressive results in image recognition and understanding. Yet these capabilities existed in separate domains, each excelling in its own modality but unable to seamlessly combine visual and textual understanding in the way humans naturally process information. This limitation became particularly apparent as researchers and practitioners sought to build AI systems that could handle real-world tasks requiring simultaneous understanding of text, images, and other modalities. The integration of vision and language capabilities into unified multimodal large language models emerged as one of the most significant developments of 2023, fundamentally expanding the scope of what AI systems could accomplish.
The year 2023 witnessed a convergence of technical advances that made multimodal large language models feasible at scale. The success of models like GPT-3 and its successors had demonstrated that large language models could serve as powerful reasoning engines when trained on massive text corpora. Simultaneously, vision transformers and models like CLIP had shown how to effectively encode visual information into representations that captured semantic content. What remained was the challenge of combining these capabilities in a way that preserved the strengths of each modality while enabling new capabilities that emerged from their interaction. The breakthrough came from several directions: architectural innovations that enabled effective cross-modal fusion, training methodologies that leveraged vast amounts of image-text data, and scaling laws that suggested multimodal systems could benefit from the same principles that had driven language model success.
OpenAI's GPT-4, released in March 2023, represented a watershed moment in this evolution. While previous GPT models had been exclusively text-based, GPT-4 introduced sophisticated vision capabilities that allowed it to process and understand images alongside text. The model could analyze charts and graphs, describe photographs, answer questions about diagrams, and even read text from images. This capability emerged from careful architectural design that integrated a vision encoder with the language model, enabling GPT-4 to reason about visual content using the same powerful language understanding capabilities it possessed for text. The integration was not merely additive. The ability to process images and text together enabled GPT-4 to perform tasks that would have been impossible with either modality alone, such as explaining the content of a scientific diagram or analyzing a complex infographic.
Beyond GPT-4, 2023 saw the emergence of numerous other multimodal models that explored different approaches to combining vision and language. Google's PaLM-E and PaLM 2 models demonstrated how to integrate vision capabilities into large language models trained on diverse data sources. Anthropic's Claude models, while initially text-only, laid important groundwork for multimodal integration that would follow. Microsoft's work on multimodal systems explored how to effectively combine visual encoders with language models while maintaining efficiency. These developments were not isolated technical achievements but represented a broader shift toward AI systems that could engage with the rich, multimodal nature of human communication and understanding.
The broader significance of multimodal large language models extended far beyond their technical capabilities. These systems demonstrated that AI could move beyond narrow single-modality tasks toward more general understanding that mirrored human intelligence. The ability to simultaneously process visual and textual information opened new possibilities for applications ranging from scientific research and education to content creation and accessibility. Perhaps most importantly, multimodal models suggested a path toward artificial general intelligence that was more aligned with how humans actually experience and understand the world, through the integrated processing of multiple sensory modalities rather than isolated channels.
The Problem
The fundamental challenge facing researchers in 2023 was how to bridge the gap between two remarkably capable but fundamentally separate AI capabilities. On one side, large language models had achieved extraordinary proficiency in understanding and generating text. Models like GPT-3, GPT-3.5, and their successors could engage in sophisticated reasoning, answer complex questions, write creative content, and perform diverse language tasks. Yet these models were blind to visual information. They could process descriptions of images but could not directly perceive or understand visual content. On the other side, vision models had made significant advances in understanding images. Systems like CLIP could relate images to text descriptions, while vision transformers could recognize objects, scenes, and relationships in images with high accuracy. Yet these vision models lacked the sophisticated reasoning capabilities and broad knowledge that language models possessed.
This separation between vision and language capabilities created significant limitations for practical applications. Consider a user who wants to understand a scientific paper that includes complex diagrams, charts, and visualizations. A language model could process the text but would miss crucial information conveyed in the visual elements. A vision model could describe what it sees in an image but might struggle with the nuanced reasoning required to explain how the diagram relates to the paper's arguments. Neither system alone could provide the integrated understanding that the task requires. Similarly, educational applications needed AI tutors that could explain both textual content and accompanying visual aids. Content creation tools needed systems that could understand both images and text to generate coherent multimodal content. Accessibility applications required AI that could describe images for visually impaired users while maintaining sophisticated language understanding.
The problem was not merely technical but also conceptual. Traditional approaches to multimodal AI often treated vision and language as separate pipelines that were combined late in processing, resulting in shallow integration. Systems might encode images and text separately and then combine their representations, but this approach missed the rich interactions that occur when humans process multimodal information. Human understanding of an image accompanied by text involves complex bidirectional interactions: the text guides attention to relevant parts of the image, while the image provides context that disambiguates and enriches the text. Capturing these interactions required architectural and training innovations that went beyond simply concatenating vision and language features.
Previous attempts at multimodal integration had demonstrated both promise and limitations. CLIP had shown how contrastive learning could align vision and language representations in a shared space, enabling powerful applications like zero-shot image classification. Flamingo had demonstrated few-shot learning across vision-language tasks using gated cross-attention mechanisms. Yet these systems still had limitations in their reasoning capabilities and scope of understanding. CLIP excelled at relating images to text but lacked the deep reasoning abilities of large language models. Flamingo showed impressive few-shot learning but was constrained by its architecture and training approach. The challenge in 2023 was to combine the sophisticated reasoning and broad knowledge of large language models with the visual understanding capabilities of vision models in a way that preserved the strengths of both.
The scale of the challenge was substantial. Training multimodal systems required massive amounts of diverse image-text pairs, sophisticated architectures that could handle both modalities effectively, and computational resources that exceeded what was available to most researchers. The data requirements were particularly daunting: while text data existed in abundance on the internet, high-quality aligned image-text data was more limited and required careful curation. Architectural challenges included designing components that could process images efficiently while maintaining compatibility with language model architectures, handling variable-length visual inputs, and enabling effective cross-modal attention mechanisms. These technical barriers, combined with the conceptual challenge of achieving deep multimodal integration, made the development of truly capable multimodal large language models a significant undertaking.
The Solution: Multimodal Architecture and Training
The solution to the multimodal integration challenge involved careful architectural design, innovative training methodologies, and the strategic application of scaling principles that had proven successful in language model development. The key insight was that effective multimodal integration required going beyond simple feature concatenation to create architectures that enabled deep bidirectional interaction between vision and language modalities while preserving the powerful reasoning capabilities of large language models.
The architectural approach typically involved several key components working together. A vision encoder, often based on vision transformer architectures or convolutional neural networks, processed input images and converted them into sequences of visual tokens that could be understood by the language model. These visual tokens were embedded into the same representation space as text tokens, allowing the language model's attention mechanisms to process both modalities together. The integration was not merely about encoding images as text-like tokens but about designing embeddings and attention mechanisms that enabled the language model to reason about visual content using its existing powerful capabilities.
GPT-4's multimodal architecture exemplified this approach. The model used a vision encoder to process images and convert them into a sequence of visual embeddings. These embeddings were then integrated into the language model's input sequence alongside text tokens. The language model's transformer architecture, with its attention mechanisms, could then process both visual and textual information together. When processing a prompt that included an image, GPT-4's attention layers could attend to relevant parts of the image while processing the text, enabling it to answer questions about images, describe visual content, and reason about the relationship between visual and textual information. This integration allowed GPT-4 to leverage its sophisticated language understanding capabilities while processing visual inputs.
The training process for multimodal large language models required careful orchestration of diverse data sources. Unlike pure language models that could be trained on vast text corpora scraped from the internet, multimodal models needed aligned image-text pairs. These pairs came from various sources: captioned images from the internet, scientific papers with figures and diagrams, books with illustrations, websites with images and text, and curated datasets of image-text pairs. The challenge was ensuring sufficient diversity and quality while maintaining alignment between visual and textual content. Training often involved techniques like contrastive learning to ensure that related images and text were close in representation space, supervised learning on specific tasks to improve performance, and scaling up data and model size following principles similar to those that had driven language model success.
One critical aspect of training multimodal models was handling the different information densities and processing requirements of images and text. Images contained rich spatial information that required careful encoding, while text had sequential structure that language models were designed to handle. The architecture needed to balance these different requirements: processing images efficiently without losing important details, while ensuring that visual information integrated smoothly with textual reasoning. This often involved using vision encoders that could extract relevant visual features efficiently and designing interfaces between vision encoders and language models that enabled effective information flow.
The scaling approach applied to multimodal models followed similar principles to language model scaling but with important adaptations. While language models benefited primarily from scaling model size and training data, multimodal models required careful consideration of the relative amounts of visual and textual data, the balance between vision encoder capacity and language model capacity, and the optimal strategies for jointly training both components. Some approaches froze the vision encoder after initial training, focusing computational resources on training the language model to effectively use visual features. Other approaches jointly trained vision and language components, requiring significantly more computational resources but potentially enabling better integration.
The solution also involved innovations in how models processed and reasoned about multimodal information. Rather than treating images and text as separate inputs that were combined, effective multimodal models learned to process them as integrated inputs to a unified reasoning system. When answering questions about an image, the model could attend to relevant parts of the image while processing the question text, enabling it to reason about visual content using its language understanding capabilities. This integration enabled capabilities like explaining scientific diagrams, analyzing charts, describing photographs in detail, and answering complex questions that required understanding both visual and textual information.
Applications and Impact
The emergence of multimodal large language models in 2023 enabled a wide range of applications that were previously impossible or required complex multi-system architectures. These applications leveraged the models' ability to understand and reason about both visual and textual information simultaneously, opening new possibilities across domains.
In scientific research and education, multimodal models found immediate applications. Researchers could upload scientific papers with complex diagrams, charts, and visualizations and receive explanations that integrated visual and textual understanding. Students learning from textbooks with illustrations could ask questions about both the text and accompanying figures, receiving explanations that connected visual and textual information. The models could analyze research figures, explain experimental results depicted in graphs, and help users understand how visual elements supported textual arguments. This capability was particularly valuable for fields like biology, chemistry, physics, and mathematics where visual information is central to understanding.
Content creation and creative applications benefited significantly from multimodal capabilities. Users could provide images as prompts for creative writing, asking models to generate stories or descriptions based on visual content. The models could analyze photographs and generate detailed captions or articles incorporating visual descriptions. Graphic designers and content creators could use these models to understand design elements, generate text that complemented visual content, and create coherent multimodal content. The ability to understand both images and text enabled more sophisticated creative workflows where visual and textual elements worked together.
Accessibility applications saw substantial improvements with multimodal models. Image description systems could now provide not just basic descriptions but sophisticated explanations of visual content tailored to user needs. The models could answer questions about images, describe complex scenes with detail, and help visually impaired users understand visual information in ways that were previously challenging. The integration of language understanding meant that image descriptions could be contextualized, detailed, and responsive to user queries rather than static and generic.
Data analysis and visualization applications leveraged multimodal models' ability to understand charts, graphs, and visualizations. Users could upload charts and ask questions about trends, relationships, or specific data points. The models could explain what visualizations showed, identify patterns, and help users interpret data. This capability was valuable for business intelligence, scientific research, and educational contexts where visual data representation is common but interpretation requires expertise.
The impact extended beyond individual applications to broader shifts in how AI systems could be deployed and used. Multimodal models reduced the need for complex multi-system architectures that combined separate vision and language models with custom integration logic. Instead, a single model could handle tasks requiring both modalities, simplifying deployment and improving performance through better integration. This shift made multimodal AI more accessible to developers and users, enabling new applications to be built more quickly and efficiently.
The performance of multimodal models on diverse tasks demonstrated their practical value. GPT-4, for example, could analyze medical images with appropriate caveats, explain scientific diagrams, describe photographs in detail, and answer questions about complex visual content. While not replacing specialized systems, these general capabilities enabled new use cases and improved existing applications. The models' ability to handle diverse tasks without task-specific training made them particularly valuable for applications with varied or evolving requirements.
The impact also extended to how AI systems were perceived and used. The ability of multimodal models to process visual information in natural language interactions made AI feel more capable and aligned with human communication patterns. Users could interact with AI systems more naturally, providing information in whatever form was most convenient, whether text, images, or both together. This naturalness of interaction was crucial for adoption and practical use.
Limitations
Despite their impressive capabilities, multimodal large language models in 2023 faced significant limitations that constrained their practical applications and highlighted areas for future improvement. Understanding these limitations was crucial for realistic expectations and appropriate deployment of these systems.
One fundamental limitation was the resolution and detail level of visual understanding. While models could process and understand images, their ability to perceive fine details was constrained by the vision encoder's resolution limits and computational requirements. High-resolution images were often downsampled before processing, potentially losing important details. This limitation was particularly problematic for applications requiring precise visual analysis, such as medical imaging, scientific diagram analysis, or reading small text in images. The models could miss subtle visual elements or fail to distinguish between similar visual patterns that humans could easily differentiate.
Reasoning about spatial relationships and geometric understanding remained challenging. While models could describe what they saw in images and answer questions about visual content, their understanding of spatial relationships, geometry, and precise visual details was less sophisticated than their textual reasoning capabilities. Tasks requiring precise spatial reasoning, understanding of perspective, or geometric calculations based on visual information could be challenging. This limitation reflected that visual understanding, while impressive, had not reached the same level of sophistication as the models' language understanding capabilities.
The training data limitations created biases and gaps in visual understanding. Models were trained on image-text pairs available on the internet, which meant they reflected the biases, perspectives, and limitations present in that data. Images from certain cultures, contexts, or domains might be underrepresented, leading to gaps in understanding. The models might struggle with visual content outside their training distribution, such as unusual artistic styles, specialized technical diagrams, or images from underrepresented contexts. These biases were not merely technical limitations but raised important questions about fairness and representation in AI systems.
Computational requirements remained substantial, limiting accessibility and scalability. Training multimodal models required significant computational resources, making it challenging for smaller organizations or researchers to develop or fine-tune these systems. Inference also required more computation than text-only models, as processing images added overhead. This computational cost limited deployment options and made it difficult to run these models on consumer hardware or in resource-constrained environments. Real-time applications or applications requiring processing of many images could be impractical.
The integration of vision and language, while impressive, was not always seamless or optimal. The models could sometimes struggle with tasks requiring deep integration of visual and textual reasoning, such as understanding complex diagrams with extensive textual annotations, following multi-step visual instructions, or reasoning about temporal sequences of images with accompanying text. The architectural choices made to enable multimodal integration sometimes involved trade-offs that limited capabilities in specific domains or applications.
Safety and reliability concerns were particularly important for multimodal systems. Vision models could be fooled by adversarial images, fail to detect subtle but important visual elements, or misinterpret visual content in ways that could have serious consequences. The combination of visual and textual understanding created new attack surfaces and failure modes that needed careful consideration. For applications where errors could have significant consequences, such as medical image analysis or safety-critical systems, these limitations required careful evaluation and appropriate safeguards.
The models' understanding of visual content, while impressive, was still not as nuanced or comprehensive as human visual understanding. They could describe images and answer questions, but their understanding lacked the depth, context awareness, and commonsense reasoning that humans bring to visual interpretation. The models might miss subtle visual cues, fail to understand context-dependent visual meanings, or struggle with visual content requiring specialized knowledge or cultural understanding.
Legacy and Looking Forward
The development of multimodal large language models in 2023 established new paradigms for AI systems and set directions for future research and development. These models demonstrated that the integration of multiple modalities was not merely a technical addition but a fundamental expansion of AI capabilities that opened new possibilities for human-AI interaction and practical applications.
The architectural innovations developed for multimodal integration influenced subsequent model designs. The approaches to integrating vision encoders with language models, handling variable-length visual inputs, and enabling cross-modal attention mechanisms became foundations for future multimodal systems. The training methodologies, data curation approaches, and scaling strategies established patterns that guided later developments. These innovations were not confined to specific models but contributed to a broader understanding of how to build effective multimodal AI systems.
The success of multimodal models in 2023 also demonstrated the value of unified architectures that could process multiple modalities together rather than requiring separate systems. This insight influenced the design of subsequent AI systems, leading to more integrated approaches that handled multiple modalities natively. The ability to reason about different types of information together proved valuable across diverse applications, suggesting that future AI systems should be designed with multimodal capabilities from the beginning rather than added as extensions.
The practical applications enabled by multimodal models showed the importance of AI systems that could engage with the rich, multimodal nature of human communication and understanding. As humans interact with information through multiple senses and modalities simultaneously, AI systems that could do the same were better aligned with human needs and communication patterns. This alignment proved crucial for adoption and practical utility, suggesting that future AI development should prioritize capabilities that match how humans actually experience and understand the world.
Looking forward, multimodal large language models established trajectories toward more capable and general AI systems. The ability to process multiple modalities together was a step toward artificial general intelligence that more closely resembled human intelligence. Future developments would explore integrating additional modalities like audio, video, and other sensor data. The principles established in 2023 for vision-language integration would be extended to these new modalities, creating systems with even broader capabilities.
The limitations identified in 2023 multimodal models also guided future research directions. Improving visual understanding resolution and detail, enhancing spatial reasoning capabilities, addressing training data biases, and reducing computational requirements became active areas of research. The safety and reliability concerns raised by multimodal systems led to increased focus on robustness, evaluation, and appropriate safeguards. These research directions would continue to shape the development of multimodal AI systems in subsequent years.
The impact of 2023 multimodal models extended beyond technical achievements to influence how the field thought about AI capabilities and development priorities. The success of these models demonstrated that significant advances could come from integrating existing capabilities in innovative ways, not just from developing new algorithms or architectures. This insight influenced research strategies and resource allocation, leading to increased emphasis on integration, scaling, and practical applications alongside fundamental research.
Modern AI systems continue to build on the foundations established by 2023 multimodal models. The integration of vision and language has become a standard capability for state-of-the-art AI systems. The architectural patterns, training approaches, and design principles developed during this period have become part of the standard toolkit for building multimodal AI systems. The applications enabled by these capabilities have become integral to how many AI systems are used, from scientific research tools to creative applications to accessibility systems.
The development of multimodal large language models in 2023 represents a crucial milestone in the evolution of AI toward more capable and general systems. These models demonstrated that combining multiple modalities was not just technically feasible but essential for building AI systems that could engage with the complexity and richness of human experience. The innovations, applications, and insights from this period continue to influence how AI systems are designed, developed, and deployed, establishing multimodal understanding as a fundamental capability for future AI systems.
Quiz
Ready to test your understanding of multimodal large language models? Challenge yourself with these questions about the development, architecture, and impact of 2023 multimodal models and see how well you've grasped the key concepts. Good luck!
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization
A comprehensive guide covering t-SNE (t-Distributed Stochastic Neighbor Embedding), including mathematical foundations, probability distributions, KL divergence optimization, and practical implementation. Learn how to visualize complex high-dimensional datasets effectively.

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations
A comprehensive guide covering LIME (Local Interpretable Model-Agnostic Explanations), including mathematical foundations, implementation strategies, and practical applications. Learn how to explain any machine learning model's predictions with interpretable local approximations.

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction
A comprehensive guide covering UMAP dimensionality reduction, including mathematical foundations, fuzzy simplicial sets, manifold learning, and practical implementation. Learn how to preserve both local and global structure in high-dimensional data visualization.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
