Surge una Nueva Alternativa a los LLMs Multimodales

Name: Surge una Nueva Alternativa a los LLMs Multimodales
Uploaded: 2026-01-23T14:25:51.000Z
Duration: 34 min 2 s

Introduction to Biel Jeppa

Overview of Biel Jeppa

Gabriel Merlo introduces Biel Jeppa, a new alternative to multimodal models developed by MetaFir, claiming it improves upon the Vision Language Model.

The model achieves a 2.85 reduction in decoding operations while maintaining similar performance and has 50% fewer trainable parameters compared to traditional language models that process visual input using tokens.

Evolution from Previous Models

Biel Jeppa is a continuation of previous models like PL JPA, which focused on integrating text and images but were limited to sentiment classification tasks.

This new model implements the philosophy of world models, allowing for simultaneous tasks such as classification and retrieval.

Potential Applications and Innovations

Dominance Over Traditional Models

The architecture shows potential for dominance over traditional large multimodal language models in various use cases.

Example Scenario

An illustrative example is provided where a person reacts instinctively to an object potentially falling, showcasing human-like predictive capabilities without needing extensive processing time.

Understanding Predictions with Biel Jeppa

Human-Like Intuition

Unlike traditional language models that predict the next token in a sequence, Biel Jeppa predicts meaning based on visual inputs combined with textual prompts.

Embedding Space Representation

The model utilizes an embedding space represented numerically in two dimensions, allowing for quick predictions based on spatial relationships between meanings.

Practical Demonstration of Capabilities

Real-Time Predictions

As the model describes actions (e.g., "walking towards the kitchen"), it updates its position within the embedding space, demonstrating how closely related meanings cluster together.

Speed and Efficiency

The ability to make rapid predictions about potential dangers (like an object falling) illustrates why this model excels in scenarios requiring immediate responses compared to traditional text-based predictions.

Innovations in Vision Language Models

Selective Decoding and Human-like Communication

The discussion introduces the concept of wearable technology, such as glasses that can communicate with users through a model called BL Jeppa, which mimics human communication styles.

The model is designed to intuitively understand context without generating extensive verbal constructs, similar to how humans process thoughts and make decisions.

Selective decoding is highlighted as a key innovation, reducing the number of decoding operations by 2.85 times compared to traditional Vision Language Models (VLMs), which typically generate text uniformly.

Traditional VLMs use uniform sampling for text generation; however, selective decoding allows the model to respond only when significant changes are detected in visual input, enhancing efficiency.

This approach not only conserves computational resources but also results in a lighter and faster model with no observed disadvantages during testing with different configurations of Lama 3.2.

Architectural Insights

A simpler architecture diagram is presented, illustrating an input visual (X sub V), which undergoes encoding to transform it into a numerical representation for processing by the predictor.

Alongside visual input, textual queries are continuously fed into the model. For instance, queries about actions being performed by individuals captured on camera enhance predictive accuracy.

The predictor operates within an embedding space where training focuses on understanding meanings rather than merely predicting tokens—this distinction is crucial for effective learning.

Challenges in Model Training

An example illustrates challenges faced when training models using token-based predictions versus meaning-based predictions; this complexity increases computational demands and hinders understanding of contextual nuances.

When asked about turning off a light switch, responses generated from token-based training may miss alternative interpretations like "the room will get dark," leading to incomplete understanding of scenarios.

The need for models to grasp broader meanings rather than just specific tokens emphasizes the importance of context in language processing tasks.

Implications for Future Development

The discussion concludes with reflections on how current approaches could be improved by focusing on meaning prediction rather than strict token sequences during training processes.

Emphasizing meaningful interactions over rigid token structures could lead to more intuitive AI systems capable of better understanding human-like reasoning and responses.

This structured overview captures essential insights from the transcript while providing timestamps for easy reference back to specific points discussed.

Understanding the Architecture of JEPA Models

Overview of Embeddings and Model Predictions

The use of embeddings simplifies model training by allowing predictions to point towards a general direction in the semantic space, representing meanings such as "the lamp is off."

The model identifies its position within the embedding space to provide accurate responses, utilizing loss functions to calculate differences between text inputs and expected outputs.

By encoding input data into embeddings, the model learns to predict meanings effectively, comparing its predictions against target texts transformed into embeddings.

Training Phases of JEPA Models

The architecture includes a decoder that selectively activates based on changes in visual input. This structure is crucial for understanding how models process information.

Training occurs in two phases: pretraining without queries using a large dataset with captions, followed by supervised training where queries are introduced alongside visual inputs.

Performance Comparison with Other Models

Results show that JEPA models require significantly fewer examples for training compared to other models while still achieving competitive performance metrics.

The efficiency of JEPA arises from its focus on meaning prediction rather than token plausibility, leading to superior benchmarks against baseline models.

Limitations and Future Directions

Despite strong performance, state-of-the-art (Sota) models generally outperform JEPA due to more extensive training and parameters; however, there are specific benchmarks where JEPA excels.

Future evaluations should consider how JEPA compares with Visual Language Models (VLMs), especially regarding reasoning tasks which may favor more robust architectures like BLM.

Potential Applications and Benchmarks

For tasks involving live video or robotics, JEPA shows significant potential as an alternative to multimodal large language models due to its lightweight decoder architecture.

A notable benchmark called World Prediction tests the model's ability to infer relationships between images; here, JEPA achieved a new Sota score of 65.7%, indicating strong predictive capabilities.