Benchmark Results - E3D-Bench

E3D-Bench: A Benchmark for

End-to-End 3D Geometric Foundation Models

1The University of Texas at Austin 2Brown University
3University of Central Florida 4NVIDIA Research 5Stanford University

Abstract

Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants. With the rapid proliferation of 3D GFMs, we ask:

Q1 Can GFMs serve as an effective and robust foundation for diverse 3D tasks and scenarios?
Q2 Can GFMs serve as an efficient foundation, especially for latency-constrained 3D applications?
In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial AI.

Effectiveness

Extremely Sparse Dense
Method DTU 7-Scenes NRGBD ScanNet TUM-RGBD
ACC ↓Comp ↓NC ↑ ACC ↓Comp ↓NC ↑ ACC ↓Comp ↓NC ↑ ACC ↓Comp ↓NC ↑ ACC ↓Comp ↓NC ↑
DUS3R/LSM1.7311.9360.7860.1460.1810.7440.1440.1540.8670.4740.4200.7141.1080.7460.724
MASt3R1.8952.0030.7880.2620.2540.7320.1130.1020.8100.4670.3890.7010.7380.7470.739
Spann3R6.2755.4600.7050.2550.1880.6530.2620.2620.6280.4870.4080.6171.5611.0020.621
FLARE3.4063.9500.4910.1520.1540.7040.0600.0560.8390.3570.3020.5610.5150.4860.677
CUT3R6.8855.0220.7270.1180.1420.7170.1040.0780.8280.2600.2380.6920.5870.5530.683
VGGT2.7162.3010.7650.0770.0800.7620.0690.0710.9030.0630.0790.7980.3850.3310.747
Fast3R4.4933.6810.7350.1490.1160.6920.3610.2010.7820.5460.3060.6210.9550.6300.627
MonST3R20.14510.3220.6030.2760.2770.6770.4710.4580.6590.6230.5410.5941.6881.0310.670
DUS3R/LSM1.2841.3490.7200.0220.0290.7090.0350.0240.8380.0260.0220.7840.6200.4740.718
MASt3R1.3741.4090.7230.0250.0280.6970.0430.0420.8090.0350.0200.7570.2090.2110.708
Spann3R6.5053.1100.6680.1760.0870.5990.3430.0730.6610.2620.1180.6060.6350.9300.662
CUT3R4.7102.4130.6990.0250.0280.6650.0760.0290.7820.0420.0300.6930.7400.5950.665
VGGT2.1031.9250.7480.0190.0320.6590.0150.0120.8740.0160.0170.7280.0650.0910.692
Fast3R3.6472.3190.7250.0460.0570.6360.0590.0280.7720.2000.0970.6250.7110.3370.610
MonST3R14.4557.5080.6360.1000.0910.6480.3360.2460.6650.3460.2930.5991.1380.9480.591
Object-Centric Indoor Scenes
Method CO3Dv2 ScanNet & ADT & TUM-Dyn. KITTI Odometry Bonn & Sintel & Rel10k ACID & Syndrone ULTRRA
ATE ↓RPEtransRPErot ATE ↓RPEtransRPErot ATE ↓RPEtransRPErot ATE ↓RPEtransRPErot ATE ↓RPEtransRPErot RPEtransRPErot
DUSt3R/LSM0.9031.3254.3120.1390.1022.3942.9351.1352.8320.1410.6418.0380.3700.6074.09970.35070.390
MASt3R0.9871.4073.9990.1310.0982.8891.4920.3990.4070.1270.6427.7140.3720.6073.84971.51978.036
Spann3R0.9151.2956.3520.2940.1643.77815.8485.0314.6450.1400.6337.8170.3510.5993.27240.50338.366
CUT3R0.8471.2096.3610.1850.1334.4712.4210.7470.6690.1090.6337.5690.3030.5932.86455.13554.395
VGGT1.6398.70271.3500.6540.42530.7875.0123.5463.8850.0620.1110.5920.3000.4620.81853.688110.521
Fast3R0.6981.0354.3520.4990.39123.73922.1097.5737.3660.1360.6368.7000.3780.6373.65351.14954.150
MonST3R2.4563.32723.4580.4480.28612.8172.4260.7820.9490.1180.6326.6660.3200.5682.16770.38877.325
Align3R1.0271.5506.4990.4250.2159.4304.6110.8170.6000.1340.6286.8100.3780.5502.41472.01070.638
Easi3R0.8571.2715.0520.1740.1032.8723.6250.9190.6150.1250.6337.6030.3560.5813.50862.06171.060
Geo4D0.7981.2645.6920.4360.17510.5651.6620.4970.6960.1510.4572.6520.3910.6220.964
Aether---0.0670.0331.6191.5530.7440.744------
In Distribution Long Sequence Street Driving Indoor-Outdoor Drone Air-Ground
Normalized Metric
Method DTU ScanNet KITTI ETH3D T&T
AbsRel ↓δ<1.03 ↑ AbsRel ↓δ<1.03 ↑ AbsRel ↓δ<1.03 ↑ AbsRel ↓δ<1.03 ↑ AbsRel ↓δ<1.03 ↑
Robust MVD2.49080.0567.46835.6519.41930.5059.30242.9096.37958.409
DUSt3R/LSM2.74175.6854.73261.3379.11339.4953.13274.8513.10677.033
MASt3R3.34368.3015.94954.5169.54246.8052.47181.2912.38182.262
Spann3R6.43138.3397.77933.71310.19530.8585.12154.7085.58052.812
CUT3R6.20047.4218.23139.46423.84912.0875.22459.8644.59456.773
VGGT1.08594.3054.38664.9689.43641.3091.78286.3372.07585.174
Fast3R3.94062.1206.27150.28313.39026.7344.69262.6634.42364.873
MonST3R5.34667.9775.55753.30910.19140.2743.36872.6243.28972.491
Robust MVD2.24284.5748.01635.92410.84625.53410.94435.5266.98260.643
MASt3R84.9040.00093.5840.00099.0690.00097.0210.00098.2340.000
CUT3R84.9040.00093.5840.00099.0690.00097.0220.00098.2340.000
Object-Centric Indoor Scene Outdoor Scene Mixed Scene
Normalized Metric
Method Bonn TUM Dyn KITTI PointOdyssey Syndrone Sintel
AbsRel ↓δ<1.25 ↑ AbsRel ↓δ<1.25 ↑ AbsRel ↓δ<1.25 ↑ AbsRel ↓δ<1.25 ↑ AbsRel ↓δ<1.25 ↑ AbsRel ↓δ<1.25 ↑
DepthAnyVideo0.51525.30.18484.60.07495.30.41761.70.29983.10.45547.9
VideoDepthAnything0.26848.31.10189.00.06098.20.28370.30.13892.51.69145.4
DepthCrafter0.10788.30.15979.50.12086.20.14481.30.38087.50.35458.2
Marigold0.32952.20.60032.80.33243.30.34647.51.33116.80.41745.4
DUSt3R/LSM0.17483.50.18779.20.12484.90.16877.80.06396.90.47559.1
MASt3R0.16081.50.16283.10.08293.20.15079.30.04697.50.37463.9
Spann3R0.20577.40.20470.60.44949.10.30358.40.24174.50.58743.3
CUT3R0.06895.00.10884.70.10489.90.09588.40.11189.50.46656.0
VGGT0.05696.30.06893.90.05196.60.02699.00.07595.90.24265.9
Fast3R0.23269.40.22171.10.30846.80.27166.20.36844.80.56548.7
MonST3R0.06195.40.19772.60.08393.40.06692.30.11089.70.34359.4
Align3R0.06296.80.10790.10.10589.20.07793.30.09792.90.23769.0
Easi3R0.06195.80.19276.90.15076.20.14382.10.09594.00.32353.9
Geo4D0.06097.80.09693.20.08693.80.08293.00.10593.10.20573.2
Aether0.58261.20.19280.60.06596.20.12387.90.14591.10.34369.4
GeometryCrafter0.06196.80.11587.70.41053.80.12483.60.12390.80.28072.4
MASt3R0.5494.60.6330.90.7546.40.7490.20.96700.7012.3
CUT3R0.09790.30.13580.60.11887.40.12788.10.82401.02023.6
Indoor Scene Outdoor Scene Large Dynamic Motion Drone Scene Mixed Scene
Method DTU RealEstate10k ScanNet++ ACID
PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓ PSNR ↑SSIM ↑LPIPS ↓
LSM11.680.32940.521814.040.43880.487312.390.45960.547916.730.45620.4567
NoPoSplat17.910.63060.281024.530.84500.163422.150.79880.235925.350.77740.1875
FLARE17.010.56720.290122.150.71260.236323.190.81170.220122.440.62290.2818
Object-Centric Indoor Scenes Drone Scenes

Inference Efficiency

Method 2 4 8 16 32 64 128 256
Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓ Time ↓GPU ↓
DUST3R0.35 ± 0.192.496.00 ± 0.302.613.96 ± 0.863.6550.37 ± 2.288.38196.81 ± 6.3827.52OOMOOMOOMOOMOOMOOM
MASt3R9.43 ± 0.282.6114.63 ± 0.522.6821.38 ± 2.262.7842.28 ± 9.063.35117.77 ± 40.836.87392.23 ± 184.3628.78OOMOOMOOMOOM
Spann3R0.16 ± 0.122.790.28 ± 0.012.80.65 ± 0.002.811.38 ± 0.012.842.81 ± 0.072.895.51 ± 0.032.9911.25 ± 0.163.1923.64 ± 0.703.55
CUT3R0.19 ± 0.073.330.26 ± 0.043.380.42 ± 0.033.480.78 ± 0.033.651.50 ± 0.034.283.12 ± 0.315.545.76 ± 0.1211.6811.65 ± 0.1617.36
VGGT0.32 ± 0.417.110.29 ± 0.407.720.24 ± 0.019.060.72 ± 0.4910.292.35 ± 0.0412.754.23 ± 0.0717.6611.76 ± 0.4128.6534.21 ± 2.5150.92
Fast3R0.13 ± 0.144.050.11 ± 0.034.260.15 ± 0.024.750.30 ± 0.015.80.69 ± 0.027.251.78 ± 0.038.435.13 ± 0.0610.9116.55 ± 0.1215.75
MonST3R0.32 ± 0.252.7914.78 ± 0.524.818.77 ± 0.207.8435.76 ± 0.358.973.19 ± 0.3716.15148.17 ± 0.9932.99605.83 ± 25.2466.66OOMOOM
Easi3R0.35 ± 0.192.4917.35 ± 1.103.4124.18 ± 0.764.1560.12 ± 2.677.69137.16 ± 10.8615.96273.78 ± 2.0832.53901.05 ± 5.2965.68OOMOOM

Findings and Takeaways

What Is the Impact of Tasks with Different Difficulties?

  • Multi-view geometry inference is inherently harder than pair-view inference.
  • Directly predicting dense 3D scene representations is much more challenging than estimating individual 3D attributes like depth and camera poses.
  • Metric-scale depth estimation remains a key challenge for GFMs.
  • Joint prediction of multiple geometric attributes (e.g., pose, depth, matching) may underlie recent performance gains.

Takeaway 1: Current GFMs are promising but face significant challenges when learning from overly complex tasks. Recommendation: Carefully decomposing difficult tasks (e.g., jointly predicting geometry, pose, depth, and tracking) into simpler sub-problems can facilitate more effective learning, especially under limited 3D data.

Do GFMs Generalize Well on Different Data Domains?

  • GFMs struggle to generalize in domains with extreme data scarcity.

Takeaway 2: Diverse, high-quality data is critical for strong generalization. To improve robustness in underrepresented domains, GFMs must be trained on data that covers broader distributions and metric-scale annotations.

Hints for Model Architecture Design, ViT or Diffusion? Strong 2D Feature Extractor?

  • No single design, feed-forward ViT or diffusion, is universally superior.
  • Stronger 2D foundation models can significantly enhance 3D GFMs.

Takeaway 3: No single backbone—feed -forward ViT or diffusion, dominates; architecture choice should align with task needs. Moreover, leveraging strong 2D feature extractors (e.g., DINO) substantially boosts 3D performance.

Are Current GFMs Ready for Real-Time Perception Systems?

  • Despite progress, GFMs still lack the efficiency required for real-time 3D applications.

Takeaway 4: As GFMs scale to handle more views and complex tasks, efficiency becomes as critical as accuracy for enabling real-time 3D perception.

Citation

@article{cong2025e3dbench,
  title={E3D-Bench: An End-to-End Benchmark for 3D Geometric Foundation Models},
  author={Cong, Wenyan and Liang, Yiqing and Zhang, Yancheng and Yang, Ziyi and Wang, Yan and Ivanovic, Boris and Pavone, Marco and Chen, Chen and Wang, Zhangyang and Fan, Zhiwen},
  journal={arXiv preprint arXiv:2506.01933},
  year={2025}
}