Jul 15, 2025

拿到Gemini 1.5试用后，我问了三个以前不敢想的问题，它，独一无二

中文 (Chinese) English

经过几天等待，虽然没有收到Google官方邮件确认，但是我的“Google AI Studio”平台中model下多出了“Gemini 1.5 Pro”的选项，意味着，我可以开始尝试100万token上下文的“神奇”了。

我试了三个问题，都是以前不敢想象的问题。

问题一：上传一集电视剧，生成剧情简介。

这应该是第一个可以上传一个多小时视频的大模型。为了留点空间，我只上传了《疑犯追踪》（AI时代，这部当年“神剧”至今仍值得细细回味）第一季的第二集，时长大概是45分钟多一点。

也只问了一个问题：生成故事摘要，截图如下。

Story Summary Task

首先，44分钟不到的电视剧大概等于70万不到的token大小；

其次，故事概要几乎完全正确，虽然视频里是有字幕的，但是能把故事理解到这个程度，或许，我们再也不需要看那些“五分钟带我们看完一部电影”的视频了。

用“魔法对抗魔法”，然后，人或许可以回归更有意义的事情中去，这是我的愿望，我也坚信会发生。

问题二：上传19篇论文，提取知识图谱

这是第一个原生支持百万token上下文的模型，除了看长视频外，最擅长的场景是处理一批文档，建立跨文档的知识图谱。所以，我上传了19篇与图像和视频生成相关的论文，问题是：提取100个关键词，并生成知识图谱。

Knowledge Graph Task

19篇论文大概占用了50万个token，等于一次把50*0.75（一个token大约等于0.75个单词）个词输给了“大模型”，让它分析。

结果如下：

Knowledge Graph based on 100 Keywords and 19 AI Research Reports:

Based on the 100 keywords and the content of the 19 research reports, I have created a more detailed knowledge graph that captures the relationships between these concepts. Please note that this is still a high-level overview, and the actual knowledge graph could be even more complex with additional nodes and edges to represent more nuanced relationships.

Nodes:

Model Types:
- Diffusion Models: Latent Diffusion Models (LDMs), Denoising Diffusion Probabilistic Models (DDPMs), Cascaded Diffusion Models, Score-based Generative Models (SGMs)
- Autoregressive Models (ARMs): Transformer-based ARMs (e.g., DALL-E, CogVideo, Parti)
- Generative Adversarial Networks (GANs): StyleGAN, Long Video GAN (LVG)
- Other Models: Variational Autoencoders (VAEs), Flow-based Models, VQ-VAE, VQGAN, Masked Autoencoders, MaskGIT, C-ViViT
Applications:
- Image Generation: Text-to-Image Generation, Class-Conditional Image Generation, Unconditional Image Generation, Image Inpainting, Image Outpainting, Image Extrapolation, Image Manipulation, Super-Resolution, Layout-to-Image Synthesis
- Video Generation: Text-to-Video Generation, Image-to-Video Generation, Video Prediction, Frame Interpolation, Multi-view Generation, Novel View Synthesis, Personalized Video Generation, Visual Storytelling
- Robotics: Robotic Control, Planning, Autonomous Driving Simulation
Datasets:
- Image Datasets: ImageNet, Places2, MS-COCO, LAION-5B, JFT-300M
- Video Datasets: Kinetics, BAIR Robot Pushing, Moments in Time (MiT), WebVid-10M
Technical Details:
- Attention Mechanisms: Self-Attention, Multi-Head Attention, Cross-Attention, Window Attention, Spatial Window (SW), Spatiotemporal Window (STW), Causal Attention, Bidirectional Attention
- Other Techniques: Mask Scheduling Function, Iterative Decoding, Conditioning Augmentation, Classifier-free Guidance, Gradient Checkpointing, Positional Encoding, Sinusoidal Embedding, Layer Normalization, Adaptive Layer Norm (adaLN), adaLN-Zero, ResNet, EfficientNet, TokenLearner, Universal Sentence Encoder (USE), FiLM, VQ Objective, Commitment Loss, L2 Loss, Perceptual Loss (LIP), Video Perceptual Loss (LVP), Adversarial Loss (LAdv), StyleGAN, Cross-Entropy Loss, Causal Masking, Decomposed Action Space, PopArt Normalization, Multi-task Learning, Multi-Agent Learning, Fictitious Self-Play, Self Reward-Play, Exploration Distillation Loss, Principal Component Analysis (PCA), Spherical Interpolation, Text Diffs, Kohonen Network, Self-Organizing Maps
Metrics:
- Image Quality: Fréchet Inception Distance (FID), Inception Score (IS), Precision, Recall, Classification Accuracy Score (CAS)
- Video Quality: Fréchet Video Distance (FVD)
- Other Metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), Learned Perceptual Image Patch Similarity (LPIPS)

Edges:

Model Type - Application:
- Diffusion Models can be used for Text-to-Image Generation, Image-to-Image Translation, etc.
- LDMs can be used for High-Resolution Image Synthesis, Text-to-Video Generation, etc.
- ARMs can be used for Text-to-Image Generation, Text-to-Video Generation, etc.
- GANs can be used for Unconditional Image Generation, Video Generation, etc.
Model Type - Technical Detail:
- MaskGIT uses Bidirectional Transformers and Masked Visual Modeling.
- C-ViViT uses Causal Attention and Vector Quantization.
- LDMs use Denoising Autoencoders and operate in Latent Space.
Application - Dataset:
- Text-to-Image Generation is evaluated on MS-COCO.
- Video Prediction is evaluated on BAIR Robot Pushing and Kinetics.
Model Type - Metric:
- FID is used to evaluate the quality of Diffusion Models and ARMs.
- FVD is used to evaluate the quality of Video Generation models.

This knowledge graph provides a high-level overview of the key concepts and relationships explored in the 19 research reports. It can be used to understand the connections between different model types, applications, datasets, technical details, and metrics used in the field of generative AI for vision tasks. I hope this is a more helpful representation of the knowledge contained in the research reports!

首先，提取的关键词确实基本覆盖了上传的19篇论文，而且都给出了几乎准确的分类；

其次，可以生成一些知识图谱关系，但是量还不够，关系复杂度也还没达到我的期待。当然，姑且可以认为这是在模型测试环节，而不是生产环境，产品团队还需要很多优化的工作要进行。即便如此，这样的结果也能够充分证明百万token的能力。

问题三：上传一篇长文档，问一些问题。

上传当初GPT-4多模态模型发布时，微软发的研究论文《The Dawn of LMMs》（多模态大模型的黎明），全文166页，文件大小接近50MB（Claude3 Opus能够支持的单个文件大小是10MB）。上传后，使用了约10万个tokens。

LMMs Paper Upload

我问的第一个小问题，来自于原文中9.7节。

Reference Screenshot

回答如下：

Answer Screenshot

不仅精确找到了原文的位置，还根据GPT-4V的回答内容，进行了重新生成，答案也显示出Gemini 1.5完全理解这个问题。

定位细节后，再考察一下总结能力。

Summary Analysis

除了完整准确，还是完整准确。虽然，Gemini 1.5要正式商用，还需要一些时间去优化，但我已经迫不及待要把模型接入到“非敏感数据”的个人应用场景中。Sora、Claude3、Gemini 1.5，以及众人期待尚在路上的其他一些一定会引起轰动效果的模型，2024，AI全面进入生产领域的确定性越来越高。

彩蛋：看LeCun最新采访视频，回答问题

Yann LeCun接受采访的视频这两天一直在刷屏，我曾第一时间让Gemini 1.0 Ultra“看”过这段视频，并给出总结，说实话，这个功能虽然独一无二，但还是有很多限制：

只能是youtube视频链接；
模型其实是根据视频链接网页下附带的文字纪要进行的分析。

同时，针对本文第一个问题里上传的一集电视剧，视频里也是有字幕的。所以，仅就第一个问题而言，我们无法百分百确定模型是否是真的在“看”电视剧还是在“看”字幕（区别就在于是不是读懂人物，情节，同时听懂对白）。

所以，我另外做了个“彩蛋”：我转录了Lecun接受采访的视频，视频全程没有实时字幕，同时考虑到版权和模型处理能力，截取了其中半小时。

所以，视频全程基本上是这样的，只有表情和说话声音，没有任何字幕。半小时的视频约48万个token。如果考虑前面一集电视剧约45分钟时长，约70万token的话，可以几乎认定，token长度跟视频长度成正比，而与分辨率无关（电视剧是1080P的，这段采访视频我压缩到了480P）。

Video Context Check

上传后，我简单问了两个问题，第一个问题是：生成一个详细的摘要。回答如下：

Video Summary

虽然篇幅不长，但是要点很准确。

第二个问题，相对细节一点：LeCun为什么认为LLM不能实现AGI？

Detailed Video Q&A

毫无疑问，这是第一个“听得懂”声音的大模型（一些论文里有可以加入声音理解的多模态，但是仅限于论文介绍里；ChatGPT虽然可以接收语音输入，但是这是嵌入了Whisper模型，等于是一个语音输入法功能）。

至于答案，显然，是正确的。

如果我们遵守LeCun的定义，现在的Gemini 1.5显然还不能称为世界模型，但它显然是目前所有模型中最接近的那一个，在这一点上，也远超Sora：Sora是一个生成器，对吗？Sora不能理解声音，对吗？Sora不具备对话能力，对吗？……

当然，如果OpenAI把Sora的功能整合到下一个版本的GPT中（可能是5，也可能就是4.5，但都是下一代），再加上声音能力，它肯定可以跟Gemini 1.5全方位对比一下。

但是，现在，只有一个独一无二的模型，叫做，Gemini 1.5。