Introduction
Hello everyone, this is Best Partner, and I'm Da Fei. A few days ago, Google DeepMind suddenly released the Gemini 2.5 Pro Preview I/O version. According to the WebDev Arena Leaderboard, this model has successfully surpassed Claude 3.7 and Sonnet, topping the list and becoming the strongest model in the programming field. At the same time, it also ranks first in the Coding ranking of @LMArena ai.
Significance of Gemini's Success
This achievement not only demonstrates Google's profound technical capabilities in the field of artificial intelligence but also lies in the fact that Gemini 2.5 Pro has further enhanced the model's long-context processing ability. In coding scenarios, programmers often need to handle a large number of code files. Gemini 2.5 Pro, with its ultra-long context ability, can quickly understand the logic of the entire code library, significantly reducing the cost of cognitive switching and improving overall collaboration efficiency. Additionally, in scenarios such as long-text writing and agent decision-making chains, long context also plays an important role and has become one of the most valuable core capabilities of the Gemini series of models.
Insights from Nikolay Savinov
To help everyone gain a deeper understanding of long-context technology, Google invited Nikolay Savinov, a research scientist at DeepMind and one of the co-leaders of the long-context pre-training project, to share topics such as the evaluation of long context, interaction with agents, the relationship with reasoning and long output, and more in a podcast. He also provided best practice suggestions for developers when using long context and looked ahead to future development trends. This is a rare program for those who want to learn about context.
Token: The Basic Unit of Information Processing
Savinov first introduced the concept of tokens. Tokens are the basic units for large language models to process information. For text, a token is usually slightly smaller than a word. It can be a complete word, such as "apple," or a part of a word, such as "appl," or even punctuation marks like commas and periods. However, in the fields of images and audio, the definition of tokens is different. Although humans are accustomed to understanding text in characters, large language models use tokens to process information for a reason. Researchers have tried to let the model generate directly at the character level to get rid of tokens, but they found that this method has a big problem: the generation speed will be significantly slower. Since the model usually generates one token at a time, if a token can represent a word, the generation speed will be much faster than generating characters one by one. Therefore, although character-level generation may have some potential benefits, tokens are still the mainstream method at present.
Context Window: Input to the Model
Next, Savinov introduced the context window. The context window refers to the sequence of tokens input to the large language model, which can include the current prompt, the user's previous interaction history, and even files uploaded by the user, such as videos or PDFs. The model's knowledge comes from two main sources: in-weight memory (also known as pre-training memory) and in-context memory. In-weight memory is the knowledge the model obtains during the pre-training stage by learning a large amount of Internet data. For some common and unchanging facts, such as objects falling rather than rising, the model can rely on in-weight memory and does not need additional context. In-context memory, on the other hand, is the memory explicitly provided by the user and exists in the current context. The difference between the two is very important. In-context memory is much easier to modify and update than in-weight memory. For example, some knowledge may be correct during pre-training but may become outdated during inference over time. At this time, it is necessary to update these facts through the context. Of course, the role of the context is not only to update knowledge but also to provide the model with information it originally did not know, such as personalized information. The model itself does not understand the user's private information. Only when the user provides this information through the context can the model give a personalized answer instead of always using a uniform and universal answer. Another example is some rare facts that appear rarely on the Internet, so the model may not remember them firmly and is prone to hallucinations. At this time, explicitly putting these facts in the context can improve the accuracy of the answer. Although with the enhancement of the model's learning ability, such knowledge may be completely remembered by the model in the future, it is still a challenge at present.
Importance of Context Window Size
The size of the context window is also crucial as it determines how much additional information can be provided. For short-context models, users need to weigh when providing different knowledge sources because its context window length is limited and may not be able to provide enough information. However, if the context window is large enough, users do not have to be too picky about the input content, which can cover more relevant knowledge, improve the recall rate of information, and effectively alleviate the shortage of in-weight memory.
Relationship between Retrieval-Augmented Generation (RAG) and Long Context
When it comes to the context window, it is necessary to mention retrieval-augmented generation (RAG) and its relationship with long context. Savinov explained that the workflow of RAG is as follows: First, the knowledge base needs to be processed by splitting a large knowledge base into smaller text blocks. Then, a special embedding model is used to convert each text block into a vector. When receiving a user query, the query is also converted into an embedding vector. Next, the similarity between the query vector and the text block vectors in the knowledge base is compared, and the text blocks most relevant to the query are extracted and packaged into the context of the large model. Finally, the large model processes and generates a reply based on this constructed context. Some people may ask whether RAG will be integrated into the model's internal as the model's information retrieval ability becomes stronger and stronger, or whether RAG is a wrong research direction. Savinov believes that this is not the case. Even after the release of Gemini 1.5 Pro, RAG has not become obsolete. Especially for enterprise-level knowledge bases, their data volume often reaches billions of tokens, which far exceeds the current model's million-level context window. In this case, RAG is still essential. However, in the long run, long context and RAG are more likely to be a collaborative relationship rather than a replacement relationship. The advantage of long context for RAG is that it allows the RAG system to retrieve and accommodate more relevant key information fragments into the model's context. In the past, due to context limitations, the RAG system had to set some conservative thresholds to filter text blocks. But with long context, more potential relevant facts can be introduced more "generously," thereby improving the recall rate of useful information and achieving a good synergy between the two.
Application Considerations
In practical applications, the choice between short context and long context often depends on the application's latency requirements. If the application needs to perform real-time interactions, such as an online chatbot, it may have to sacrifice some context length to ensure a fast response. However, if the application can tolerate a slightly longer waiting time, such as some document analysis and complex question-answering applications, using long context usually can bring a higher recall rate of information, resulting in more accurate and comprehensive answers.
Quality of Long Context
In addition, the quality of long context has always been the focus of attention. Since the "needle in a haystack" test was shown in the Gemini 1.5 Pro technical report, Google has made significant progress in long-context technology. This is not only reflected in the fact that the Gemini 2.5 Pro model performs better than powerful baseline models such as GPT-4.5, Claude 3.7, and DeepSeek in the 128k context benchmark test but also shows a significant advantage over Gemini 1.5 Pro at a context length of 1 million. However, there are also some issues worth exploring regarding whether the quality of the model is consistent or fluctuates at different context lengths. Savinov pointed out that the industry has previously observed an "intermediate information loss" effect, that is, when the model processes long context, the information in the middle part is more likely to be ignored. Fortunately, Gemini has basically not observed this obvious "intermediate information loss" effect in the 1.5 Pro and subsequent versions. However, the research team also found that when dealing with difficult tasks, especially those with strong interference items, the performance quality of the model will slightly decline as the context length increases. For example, when the model needs to find a specific target from a long context full of key-value pairs with similar structures, it will face great challenges. This is because of a characteristic of the attention mechanism itself: the competition between tokens. If a token gets more attention, other tokens will get less attention. Therefore, if there are strong interference items, that is, content that is very similar to the target information but irrelevant, these interference items may attract a lot of attention, resulting in insufficient attention to the information that really needs to be found. Moreover, the more tokens there are in the context, the more intense this competition will be.
Evaluation of Long-Context Ability
So, how to evaluate long-context ability? Initially, in the "needle in a haystack" test of the Gemini 1.5 technical report, the model was required to find a specific information point in millions of tokens. For simple interference items, such as finding a specific number in Paul Graham's article, this type of test can now be solved well. The current research frontier is how to deal with the "needle in a haystack" with strong interference items, which is more difficult for the model, and multi-needle retrieval, that is, requiring the model to retrieve multiple unrelated information points from the long context, which is also a big challenge for the model. Savinov also mentioned that when evaluating long-context ability, it is also necessary to consider the authenticity of the evaluation and the measurement of core capabilities. For evaluations that overly pursue real-world scenarios, such as answering complex programming questions based on large code libraries, it may deviate from the goal of measuring long-context core capabilities and instead test other capabilities, such as coding ability, which may give wrong optimization signals. In addition, retrieval-based evaluation and comprehensive evaluation also have their own characteristics. Single-needle retrieval tasks can theoretically be solved by RAG, but what really reflects the advantages of long context are tasks that require integrating the entire context information, such as long-text summarization. However, RAG will encounter difficulties in dealing with such tasks, and the automatic evaluation of such comprehensive tasks also has challenges. For example, the ROUGE evaluation index for summarization tasks is not perfect, has strong subjectivity, and the consistency between different human evaluators may be low, which makes the signal for model optimization based on such indicators not strong enough. Savinov personally prefers to optimize on indicators with stronger signals and less prone to being "exploited."
Relationship between Long Context and Reasoning Ability
So, what is the relationship between long context and reasoning ability? If the context length is increased, can the model's ability to predict the next token be improved? This question can be understood from two aspects. On the one hand, inputting a longer context will also improve the prediction of short answers; on the other hand, the output tokens are essentially similar to the input tokens. If the model is allowed to feedback its own output as new input to form a chain of thought, this actually expands the effective context. Therefore, in theory, a strong long-context ability should be helpful for reasoning. The reasoning process usually requires generating a thinking trajectory rather than directly outputting a single-token answer. This is because the model's ability to perform logical jumps in the context through the attention layer is limited by the depth of the network. But if the model can use the intermediate thinking steps, that is, the output, as new input, it will no longer be limited by the depth of the network and can perform more complex tasks. This is equivalent to having a readable and writable memory. Therefore, there is no fundamental difference between long-input ability and long-output ability. After pre-training, the model itself has no inherent limitation on generating a large number of tokens. For example, the model can be input with 500,000 tokens and required to copy them, and the model can actually do it. However, in the post-training stage, this long-output ability needs to be handled very carefully, mainly because there is a special sequence end token. If the sequences in the supervised fine-tuning (SFT) data are all relatively short, the model will learn to generate this end token and stop within a shorter context length. This is essentially an alignment problem. Reasoning is just one type of long-output task. For example, translation is also a long-output task, and its entire output may be long, not just the thinking trajectory. This requires appropriate alignment to encourage the model to generate long output. Currently, Savinov's team is actively researching and improving this aspect of ability.
Best Practices for Developers
For developers, Savinov provided some best practices that can be referred to when using long-context technology. First, make good use of context caching. When providing a long context to the model for the first time and asking a question, the processing time will be long and the cost will be high. However, if the second and subsequent questions are based on the same context, the context caching function can be used, which can not only speed up the response but also reduce the cost. Currently, some models already provide this function. For files uploaded by users, such as documents, videos, code libraries, etc., try to cache their context. This not only makes the processing faster but also significantly reduces the average price of input tokens, which is said to be about four times cheaper. The context caching function is most suitable for application scenarios such as "chat with documents," "chat with PDFs," or "chat with my data" because the core requirement of these scenarios is that the context of the original input remains unchanged. If the input context changes with each request, the caching effect will not be very good. If the context really needs to be changed, it is best to only append content at the end so that the system will automatically match the cached prefix and only process the new part. Also, the position of the question is important. The question should be placed after the context. If the question is placed before the context, each question will cause the cache to fail, and no cost savings can be achieved. In addition, Savinov suggested that when dealing with billions of tokens of context, it is necessary to combine with RAG. Even if the context requirements are not so large, it is beneficial to combine with RAG in applications that need to retrieve multiple information points. Also, avoid irrelevant contexts. Do not fill the context with completely irrelevant content. This will not only increase the cost but also may affect the accuracy of multi-needle retrieval because irrelevant content may become interference items. Although the purpose of long context is to simplify user operations and avoid the time-consuming manual screening, at the current stage, in order to obtain the best effect and cost-effectiveness, Savinov still recommends avoiding inputting obviously useless information. However, as the model quality improves and the cost decreases, users may not need to consider this point too much in the future. In addition, developers sometimes try to fine-tune the model on a specific knowledge base, such as a large enterprise's document library, in the hope of improving the model's performance. However, for long-context tasks, fine-tuning may not always be the best choice. Because fine-tuning is usually based on limited data, these data are difficult to cover the various complex scenarios and massive information involved in long context. Moreover, the fine-tuning process may introduce new biases, resulting in a decline in the model's performance on other tasks. If the purpose is only to use long context to update specific knowledge, then passing information through in-context memory is often more flexible and efficient than fine-tuning. Only when the model really needs to deeply adapt to the complex tasks and data patterns of a specific field is it necessary to carefully consider fine-tuning and do sufficient testing and evaluation.
Development History and Future Outlook
Of course, the development of long-context technology has not been achieved overnight. There is also a history full of challenges and breakthroughs behind it. When Google's team was developing long-context technology, they set extremely high goals for themselves from the beginning, aiming to create a model that can handle context lengths far beyond the industry level at that time. Before the release of Gemini 1.5 Pro, it had already shown exciting results in internal tests. The team found that when the model processed 10 million contexts, it could not only accurately understand the information but also perform effective reasoning and generation based on this information. However, the Google team did not rush to launch a model with 10 million context capabilities because in practice, running such a model requires consuming a large amount of computing resources, resulting in slow processing speed and high cost. Therefore, they finally chose to set the released context window length between 1 million and 2 million, which can not only be technically leading but also barely controllable.
Relationship with Agents
For the very popular agents today, the development of long-context technology is also closely related to it. Agents can be regarded as entities that can autonomously perceive the environment, make decisions, and execute actions. Long-context ability provides a more powerful information processing foundation for agents, enabling them to understand and handle more complex task backgrounds and environmental information. In turn, the development of agents also puts forward higher requirements for long-context technology, prompting researchers to continuously improve long-context technology. For example, in the process of interacting with users, agents need to make reasonable decisions and responses based on a large amount of context information provided by users, including historical conversation records, task goals, background knowledge, etc. Long context enables agents to better understand the relationships between these information, thereby providing more accurate and more in line with user needs services. At the same time, agents can also use long-context technology to automatically obtain and integrate relevant information without the need for users to manually input a large amount of content. Taking the intelligent office scenario as an example, agents can automatically extract relevant information from multiple sources such as enterprise knowledge bases, emails, and meeting records to provide users with comprehensive decision support.
Future Development Trends
Looking to the future, Savinov believes that long-context technology will usher in more exciting developments in the next three years. In terms of quality, the model will be able to better handle complex long-context information, reduce performance degradation caused by interference items and other factors, and achieve more accurate and coherent information processing and generation. In terms of cost, through algorithm optimization and hardware upgrades, the cost of running long-context models will be significantly reduced, making it affordable for more developers and enterprises. In terms of the expansion of the context window, Savinov pointed out that today's million-level tokens are just the first step. The second step is to reach the ten-million-level tokens, and finally, to move towards the hundred-million-level tokens of context. Although this goal is full of challenges, DeepMind's researchers have been exploring new technical paths, such as improving the attention mechanism and developing more efficient coding methods, which are expected to achieve a further expansion of the context window.
Conclusion
Well, that's the main content of this podcast. In general, long-context technology has shown strong strength and great potential in Google's Gemini series of models and has undoubtedly become one of the key technical fields of today's large models. I will also continue to pay attention to the development trends of this technology and provide you with the latest interpretations. Thank you for watching this video, and we'll see you next time.