From Photogrammetry to Gaussian Splatting: A DIY Guide to Top-Tier 3D Imaging Systems
While researching the latest research progress in real-time NeRF (Neural Radiance Fields) and NeRF model merging, I came across a promotional video for a project that has been in development for nearly a decade.
For an architectural photographer, the charm of such a project is impossible to resist. This was also the first project I encountered when I started diving into the "rabbit hole" of 3D reconstruction over a year ago—a 2009 paper titled "Building Rome in a Day."
Through that paper, I began studying photogrammetry, started using the first open-source program, Colmap, and understood SIFT and SfM models. As for the algorithms I'll mention later—such as various NeRFs, Gaussian Splatting, commercial software like Metashape, RealityCapture, Pix4D, and mobile apps like PolyCam or Kiri Engine—those came much later.
That paper also showed me another possibility for photography: digitizing the world around us, whether you call it the Metaverse or Digital Twins. Over the past year, I've encountered many pitfalls and will surely encounter more. However, the paths—hardware, algorithms, and workflows—have gradually become clear. With the new wave of spatial computing brought by Apple Vision Pro, I hope to systematically share my practical lessons through this article.
The Origins of 3D Reconstruction: Photogrammetry
Since the invention of the camera, photographers and engineers have pondered whether a photo could be used to measure an object's dimensions—for example, determining the size of a building from a photograph. However, many issues had to be overcome, such as scale, reference points, perspective, and distortion. Nonetheless, it provided an unprecedented method: using focal length, image distance, and film size to calculate dimensions geometrically. While accuracy was low by modern standards, it was a significant leap for its time.
The term "Photogrammetry" first appeared in an article titled Die Photometrographie by Prussian architect Albrecht Meydenbauer. In 1868, he climbed to the top of the Rotes Rathaus in Berlin to take 360-degree film photos—likely the earliest panoramas. Thanks to digital progress, smartphones can now generate panoramas effortlessly.
To further improve accuracy, one must build a 3D model. For engineering students familiar with mechanical drawing, the simplest method is the three-view drawing (front, top, side). Digital technology has made this much easier due to algorithmic progress. The two pillars of photogrammetry are:
- SfM (Structure from Motion): The principle that if you have two or more photos of the same object from different angles, you can derive its 3D spatial structure through linear transformation.
- SIFT (Scale-Invariant Feature Transform): An algorithm that ensures extracted features remain identical whether a photo is scaled or rotated.
Combining these: first, SIFT performs feature extraction and matching; then, SfM generates the 3D structure. The key is identifying the same object across different angles and calculating the relative spatial positions of the imaging planes, known as Camera Pose.
Software like Colmap or commercial tools like Metashape and RealityCapture excel at this. Obtaining camera pose is the foundation for almost all 3D reconstruction algorithms, including NeRF and Gaussian Splatting. While some recent papers use neural networks to predict poses, their precision does not yet match Colmap or commercial software.
Practically speaking, two parameters heavily impact quality and processing time: the number of features extracted per photo and the photo resolution. Higher feature counts and quality require massive computing power. In the 3D era, computing power and storage are the ultimate truths.
Point Clouds (Sparse Models)
Once camera positions are calculated, reconstruction begins. In SfM, features on an object map to an image plane; the inverse is that a feature in space maps to corresponding regions across photos. By requiring a feature to appear in at least three photos to be mapped to 3D space, we can outline an object's structure. These discrete points are called a "Sparse Model" or a Point Cloud.
The point cloud is the most critical node in reconstruction. Its quality depends on equipment, workflow, and algorithms. Using 35-50mm lenses is recommended to balance distortion and information. Key tips include using tripods for sharpness, ensuring >60% overlap between photos, moving the camera's position rather than just its angle, and avoiding video frames due to "rolling shutter" effects.
3D Reconstruction and Mesh Rendering
While a point cloud allows for measurement, creating a continuous, realistic 3D space requires algorithms like MVS (Multi-View Stereo), NeRF, or 3DGS (3D Gaussian Splatting).
Under the MVS framework, the first step is generating a Depth Map for each photo, where grayscale represents distance from the camera. Fusing depth maps from different angles results in a higher-quality continuous space.
The next step is Texture Mapping, where original photos are projected onto the 3D structure. However, because pixels are discrete samples, gaps remain. The standard solution is Mesh Rendering, which connects vertices into a surface of triangles. Alternatively, the newer 3D Gaussian Splatting (3DGS) generates Gaussian planes for each vertex, creating a rendering that looks like an ink wash painting, hence the term "splatting."
Supplement: LiDAR, IMU, and Fusion Solutions
For real-time applications like autonomous driving or robotics, traditional photogrammetry is too slow. LiDAR (Light Detection and Ranging) is used instead to directly generate distance data and point clouds. However, LiDAR lacks visual color information, leading to "Fusion Solutions."
These solutions combine fixed, calibrated cameras with IMUs (Inertial Measurement Units) to capture device movement. By fusing high-frequency IMU data and camera streams with LiDAR data, systems can achieve real-time 3D environment construction.
Apple Vision Pro's popularity has sparked a "Spatial Computing" craze. It is essentially a highly integrated 3D capture and computing device. By using TrueDepth (ToF) cameras and LiDAR along with the ARKit SDK, Apple has significantly lowered the barrier to spatial modeling. Whether for smart cars or MR headsets, the core challenge remains the efficient reconstruction of our world through visual and sensor data.