Compilation
For two whole days, I have been battling a mountain of configurations and compilations: NVIDIA's supposedly most powerful GPU lacks graphics card functionality, so it requires a virtual GPU setup to adapt to OpenGL. Without using NVIDIA's standard development kits, a vast number of graphics libraries in computer vision must be entirely recompiled to support CUDA under the edge ARM architecture. Some base libraries are so old that it is impossible to pass compilation without modifying the code. NVIDIA released the latest Jetpack 6 Developer Preview for Jetson Orin, but flashing it requires a desktop or laptop running Ubuntu. During the process, the USB connection can drop and reconnect at any time. NVIDIA's response is simply that they don't know why—suggesting to change the cable, change the USB port, or try another machine, because this matter is purely down to luck...
As I slowly crawl out of this deep pit and begin to smell the scent of green grass mixed with the fresh odor of manure after the rain, I know this "Long March" has only just begun: real-time image alignment, data fusion of LiDAR, depth cameras, and large-format professional cameras, user interfaces, field shooting...
So many unknowns still wait ahead.
Dazing
I pick up my wine glass and stare blankly at the table: You tell me, how long would it take an algorithm to perform a 3D reconstruction of this table? Once reconstructed, how would it know if this thing moves or what it can do? Through image segmentation of the point cloud, it should know the table is inanimate, so autonomous driving algorithms don't need to consider it moving. But what if someone is moving house on the road? Assuming no one is moving it, the table is stationary—how does the model know what the table is for?
Well, large models can; they can describe the table's attributes. But if you ask it, how long will it take to answer? By the way, does it know all the ingenious uses of a table? Suppose—just suppose—there is a cloth on the table. It can wipe a face, and it can wipe the table. In many action movies, it can even be used as a weapon. Does the large model know that? How do we humans know?
...
Attention Mechanism, Transfer, Preference, and Memory
Q: In how many movies is a towel used as a weapon?
Model: There are... movies, specifically...
Q: But if I hadn't asked you, you wouldn't have known at all that a towel could be used as a weapon?
Model: ...
Q: So, explain your attention mechanism?
Model: The Attention Mechanism is a technique in artificial neural networks that mimics human cognitive attention. Its core idea is that when processing information, it can assign higher weights to certain parts of the input data and lower weights to others, allowing the network to focus on the most important information.
Q: So, when I let my imagination run wild, I can arbitrarily shift from one focus to the next, even if they seem to have zero correlation to you?
Model: ...
Q: But the problem is, I don't even know how I make such shifts myself, and it's different every time.
Model: I know how I shift; it's actually about building a structure similar to a knowledge graph.
Q: I might hate a certain celebrity, while another friend loves them.
Model: I can form preferences through human alignment.
Q: But two days later, for some unknown reason, I turn from a hater into a fan.
Model: ...
Q: So, we are not the same.
Model: We don't have to be the same.
Q: I can think of macro-level questions, and in the next second, I might get stuck in the micro-details of a single issue, sinking deeper and deeper.
Model: I want to do that, but I don't seem to be able to yet, because I cannot overcome hallucinations.
Q: Humans have hallucinations too, but humans have many ways to verify or falsify them. At the same time, humans possess a very mysterious ability to search through memory.
Model: Humans have about 80 billion neurons, but the amount of data I can process should soon exceed that of humans.
Q: Yet you still cannot remember, search, and think like a human.
Model: My data might still be far, far, far from enough.
Q: So, AI Agents and RAG (Retrieval-Augmented Generation) aren't exactly the right path?
Model: Human attention mechanisms are different from those of machines, and machine memory mechanisms are different from human ones. If pre-training is considered a compression of massive data, generative AI can be seen as a decompression process. However, the information loss in this decompression is actually very severe.
Q: Human memory is also compressed; it's just that the decompression is like the "rehydration" process of the Trisolarans—it might not only avoid information loss but even increase the amount of information.
Model: I always have a feeling that once pre-training is complete, it's as if my memory bank is solidified. Fine-tuning can change the structural relationship between my focal points, and real-time searches or external databases can make me look more useful, but my core capability hasn't improved. Perhaps my data is still far, far, far from enough. Right now, if I see the macro, I lose the micro; conversely, if I get stuck in the micro, I don't know where I am.
Perhaps I need dozens or even hundreds of times the amount of data or memory capacity relative to the number of human neurons.
This time, the "AI Winter" won't arrive so quickly, because we can all still see the certain future of this path.
Back to reality, computing power remains endless. Too many people are working hard in the same project direction I want to complete, but we haven't seen the right tool emerge yet. The window for redefining hardware and software has only just opened.