我用Sora建了几个三维模型，证明它懂“世界”

Sep 3, 2025

我用Sora建了几个三维模型，证明它懂“世界”

中文 (Chinese) English

I made a mistake. I was inclined to believe that Sora didn't particularly understand the "physical world," and that training and generation based purely on 2D images and videos would cause it to lack 3D information.

However, since the weekend, a flood of information has been challenging my thinking. I shifted from assuming the model "doesn't understand" the world to assuming it "does." Consequently, I used Sora's official videos to try and build several 3D models.

First, I selected four representative videos characterized by: relatively fixed scenes, slow camera movements, and subjects that were either not in close-up or underwent minimal change.

Second, I converted the videos into individual frames at a sampling rate of 3-5 times (e.g., if the original video is 24 fps, I sampled about 6-8 frames per second).

Finally, I used my established 3D modeling workflow to reconstruct the "world" scenes corresponding to the videos.

I will first present the original videos and 3D models for the four scenes, and then discuss the conclusions.

Scene 1: Coastal Town

The original video is 20 seconds long with a bird's-eye view.

After sampling: 150 images.

3D Model (Screen recording of dragging, zooming, and changing perspectives in the tool interface, same below).

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044439-4563.gif

Result: We restored this coastal town. Despite many detailed issues, for someone like me who couldn't possibly draw a 3D model of a town by hand, the model solved the only step I couldn't achieve on my own.

Scene 2: Gold Rush Village

The scene in the original video is significantly larger than the previous one, and the camera moves very quickly. The video is 25 seconds long.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

After sampling: 188 images.

3D Model; I didn't expect the results to be great before modeling, but there were surprisingly some highlights.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

The geographical and topographical structure of the entire village is clear, and the missing middle parts could easily be filled in current game engines. Many building structures are complete; the unevenness caused by rapid camera movement can be corrected by current models. Perhaps, in just a few days, a "Gold Rush" game could be ready.

Scene 3: Christmas Atmosphere Close-up

The original video is 17 seconds long, with a fixed scene, no moving objects, the camera rotating around the subject, and no change in focal length.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

After sampling: 128 images.

3D Scene; unsurprisingly, the quality is clearly the highest.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

A static scene and objects with a rotating camera is a standard method for 3D modeling, so the result is not surprising.

Scene 4: Cyberpunk

The original video is 20 seconds long. Although robots are the main subjects, the robots' heads turn, and the scene changes.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

After sampling: 149 images.

3D Model; the original video has three perspectives, so the model automatically turned into three scenes.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

2024-03-07-我用sora建了几个三维模型证明 it 懂世界-18ozxj-1772018044437-3077.png

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

Every scene has quite a few gaps, which is natural given the speed of change. However, if we use Sora for a demo to generate scenes for multiple acts, optimized with game engine tools and ray tracing, perhaps the core competitiveness of the film industry will remain the script, actors, and special effects. But for set design, costumes, and props... I dare not imagine.

Conclusion

All four selected official Sora videos were successfully reconstructed into 3D models. Under current workflows, manual operation is minimal. Although each model has obvious gaps due to image quality, rapid scene changes, object movement, and incomplete perspectives, these videos weren't meant for 3D modeling. If we can control scenes and camera movement via prompts in the future, the quality will surely see a qualitative leap.
The output is "True 3D." As shown in the GIF below, you can measure distances (the scale can be preset) and angles. It can be imported into CAD software, Unity, Unreal Engine, and other game engines for editing. The process that used to take professional designers a long time in manual tools can now potentially be fully automated. At least, the modeling process proves this is viable, and quality will improve as models advance.

2024-03-07-我用sora建了几个三维模型证明它懂世界-18ozxj-1772018044437-3077.png

I believe Sora is currently not generating videos by first building 3D models, at least not in its model architecture. However, the modeling results clearly prove the "3D consistency" described in Sora's technical report. I used the strictest "pixel-level" photogrammetry model—spatially inconsistent objects cannot be constructed. Although I was previously inclined to deny it, these results leave me no evidence to reject the "Sora understands the world" hypothesis; in fact, this is confirmation.
Given time, a generator like Sora combined with appropriate prompts and 3D reconstruction models could build the entire world, provided there is enough computing power.
These 3D models can be imported directly into game engines, which are already widely used in film effects, digital twins, and other fields. If Sora is seen only as text-to-video, it might not replace related jobs in the short term, but if it can generate 3D like this, the replacement of those roles is imminent.
What's missing? Data. Sora is a man-made generator or simulator; we want it to generate what the human eye sees. One of the values of Apple Vision Pro is telling us the boundary is roughly 8K resolution. 8K is not just about training higher-resolution images; it requires many more high-quality images describing details and the relationship between the whole and parts, which will in turn challenge the algorithm. Furthermore, it seems only more data and larger models can make generated objects and movements more closely follow physical laws. In this sense, the value of existing massive data may be declining, while systematic collection or synthesis of data is becoming significantly more important. Everyone has a chance, but the required investment is staggering. At this stage, every double in capability might require a ten- or hundred-fold increase in investment.
I used to think I was relatively neutral, believing in the magic of models while maintaining my carbon-based self-esteem. Now, I'm left with complex emotions. I was talking to a friend about my discovery, and they said, "Maybe what the model sees is more than just three dimensions?"

Maybe?

← Back to Blog