Benchmark Results - E3D-Bench

Abstract

Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants. With the rapid proliferation of 3D GFMs, we ask:

Q1 Can GFMs serve as an effective and robust foundation for diverse 3D tasks and scenarios?

Q2 Can GFMs serve as an efficient foundation, especially for latency-constrained 3D applications?

In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial AI.

Effectiveness

Extremely Sparse Dense

Method	DTU			7-Scenes			NRGBD			ScanNet			TUM-RGBD
Method	ACC ↓	Comp ↓	NC ↑	ACC ↓	Comp ↓	NC ↑	ACC ↓	Comp ↓	NC ↑	ACC ↓	Comp ↓	NC ↑	ACC ↓	Comp ↓	NC ↑
DUS3R/LSM	1.731	1.936	0.786	0.146	0.181	0.744	0.144	0.154	0.867	0.474	0.420	0.714	1.108	0.746	0.724
MASt3R	1.895	2.003	0.788	0.262	0.254	0.732	0.113	0.102	0.810	0.467	0.389	0.701	0.738	0.747	0.739
Spann3R	6.275	5.460	0.705	0.255	0.188	0.653	0.262	0.262	0.628	0.487	0.408	0.617	1.561	1.002	0.621
FLARE	3.406	3.950	0.491	0.152	0.154	0.704	0.060	0.056	0.839	0.357	0.302	0.561	0.515	0.486	0.677
CUT3R	6.885	5.022	0.727	0.118	0.142	0.717	0.104	0.078	0.828	0.260	0.238	0.692	0.587	0.553	0.683
VGGT	2.716	2.301	0.765	0.077	0.080	0.762	0.069	0.071	0.903	0.063	0.079	0.798	0.385	0.331	0.747
Fast3R	4.493	3.681	0.735	0.149	0.116	0.692	0.361	0.201	0.782	0.546	0.306	0.621	0.955	0.630	0.627
MonST3R	20.145	10.322	0.603	0.276	0.277	0.677	0.471	0.458	0.659	0.623	0.541	0.594	1.688	1.031	0.670
DUS3R/LSM	1.284	1.349	0.720	0.022	0.029	0.709	0.035	0.024	0.838	0.026	0.022	0.784	0.620	0.474	0.718
MASt3R	1.374	1.409	0.723	0.025	0.028	0.697	0.043	0.042	0.809	0.035	0.020	0.757	0.209	0.211	0.708
Spann3R	6.505	3.110	0.668	0.176	0.087	0.599	0.343	0.073	0.661	0.262	0.118	0.606	0.635	0.930	0.662
CUT3R	4.710	2.413	0.699	0.025	0.028	0.665	0.076	0.029	0.782	0.042	0.030	0.693	0.740	0.595	0.665
VGGT	2.103	1.925	0.748	0.019	0.032	0.659	0.015	0.012	0.874	0.016	0.017	0.728	0.065	0.091	0.692
Fast3R	3.647	2.319	0.725	0.046	0.057	0.636	0.059	0.028	0.772	0.200	0.097	0.625	0.711	0.337	0.610
MonST3R	14.455	7.508	0.636	0.100	0.091	0.648	0.336	0.246	0.665	0.346	0.293	0.599	1.138	0.948	0.591

Object-Centric Indoor Scenes

Method	CO3Dv2			ScanNet & ADT & TUM-Dyn.			KITTI Odometry			Bonn & Sintel & Rel10k			ACID & Syndrone			ULTRRA
Method	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	ATE ↓	RPE_trans ↓	RPE_rot ↓	RPE_trans ↓	RPE_rot ↓
DUSt3R/LSM	0.903	1.325	4.312	0.139	0.102	2.394	2.935	1.135	2.832	0.077	0.557	1.657	0.126	0.379	2.836	70.350	70.390
MASt3R	0.987	1.407	3.999	0.131	0.098	2.889	1.492	0.399	0.407	0.058	0.559	1.305	0.130	0.376	2.601	71.519	78.036
Spann3R	0.915	1.295	6.352	0.294	0.164	3.778	15.848	5.031	4.645	0.083	0.102	1.297	0.117	0.149	1.484	40.503	38.366
CUT3R	0.847	1.209	6.361	0.185	0.133	4.471	2.421	0.747	0.669	0.033	0.039	0.500	0.071	0.090	0.914	55.135	54.395
VGGT	0.478	0.704	2.264	0.113	0.086	1.535	0.955	0.315	0.335	0.062	0.111	0.580	0.280	0.461	0.802	63.451	77.281
Fast3R	0.698	1.035	4.352	0.499	0.391	23.739	22.109	7.573	7.366	0.111	0.170	2.017	0.436	0.518	1.979	51.149	54.150
MonST3R	2.456	3.327	23.458	0.448	0.286	12.817	2.426	0.782	0.949	0.098	0.152	0.830	0.335	0.504	1.514	70.388	77.325
Align3R	1.027	1.550	6.499	0.425	0.215	9.430	4.611	0.817	0.600	0.076	0.091	1.083	0.150	0.179	0.977	72.010	70.638
Easi3R	0.857	1.271	5.052	0.174	0.103	2.872	3.625	0.919	0.615	0.075	0.094	1.361	0.119	0.138	1.733	62.061	71.060
Geo4D	0.798	1.264	5.692	0.436	0.175	10.565	1.662	0.497	0.696	0.573	0.472	3.779	0.384	0.329	1.395	-	-
Aether	3.168	2.366	21.643	0.644	0.273	14.804	1.553	0.744	0.744	0.195	0.122	1.610	0.152	0.097	0.796	-	-

In Distribution Long Sequence Street Driving Indoor-Outdoor Drone Air-Ground

Normalized Metric

Method	DTU		ScanNet		KITTI		ETH3D		T&T
Method	AbsRel ↓	δ<1.03 ↑	AbsRel ↓	δ<1.03 ↑	AbsRel ↓	δ<1.03 ↑	AbsRel ↓	δ<1.03 ↑	AbsRel ↓	δ<1.03 ↑
Robust MVD	2.490	80.056	7.468	35.651	9.419	30.505	9.302	42.909	6.379	58.409
DUSt3R/LSM	2.741	75.685	4.732	61.337	9.113	39.495	3.132	74.851	3.106	77.033
MASt3R	3.343	68.301	5.949	54.516	9.542	46.805	2.471	81.291	2.381	82.262
Spann3R	6.431	38.339	7.779	33.713	10.195	30.858	5.121	54.708	5.580	52.812
CUT3R	6.200	47.421	8.231	39.464	23.849	12.087	5.224	59.864	4.594	56.773
VGGT	1.085	94.305	4.386	64.968	9.436	41.309	1.782	86.337	2.075	85.174
Fast3R	3.940	62.120	6.271	50.283	13.390	26.734	4.692	62.663	4.423	64.873
MonST3R	5.346	67.977	5.557	53.309	10.191	40.274	3.368	72.624	3.289	72.491
Robust MVD	2.242	84.574	8.016	35.924	10.846	25.534	10.944	35.526	6.982	60.643
MASt3R	84.904	0.000	93.584	0.000	99.069	0.000	97.021	0.000	98.234	0.000
CUT3R	84.904	0.000	93.584	0.000	99.069	0.000	97.022	0.000	98.234	0.000

Object-Centric Indoor Scene Outdoor Scene Mixed Scene

Normalized Metric

Method	Bonn		TUM Dyn		KITTI		PointOdyssey		Syndrone		Sintel
Method	AbsRel ↓	δ<1.25 ↑	AbsRel ↓	δ<1.25 ↑	AbsRel ↓	δ<1.25 ↑	AbsRel ↓	δ<1.25 ↑	AbsRel ↓	δ<1.25 ↑	AbsRel ↓	δ<1.25 ↑
DepthAnyVideo	0.515	25.3	0.184	84.6	0.074	95.3	0.417	61.7	0.299	83.1	0.455	47.9
VideoDepthAnything	0.268	48.3	1.101	89.0	0.060	98.2	0.283	70.3	0.138	92.5	1.691	45.4
DepthCrafter	0.107	88.3	0.159	79.5	0.120	86.2	0.144	81.3	0.380	87.5	0.354	58.2
Marigold	0.329	52.2	0.600	32.8	0.332	43.3	0.346	47.5	1.331	16.8	0.417	45.4
DUSt3R/LSM	0.174	83.5	0.187	79.2	0.124	84.9	0.168	77.8	0.063	96.9	0.475	59.1
MASt3R	0.160	81.5	0.162	83.1	0.082	93.2	0.150	79.3	0.046	97.5	0.374	63.9
Spann3R	0.205	77.4	0.204	70.6	0.449	49.1	0.303	58.4	0.241	74.5	0.587	43.3
CUT3R	0.068	95.0	0.108	84.7	0.104	89.9	0.095	88.4	0.111	89.5	0.466	56.0
VGGT	0.056	96.3	0.068	93.9	0.051	96.6	0.026	99.0	0.075	95.9	0.242	65.9
Fast3R	0.232	69.4	0.221	71.1	0.308	46.8	0.271	66.2	0.368	44.8	0.565	48.7
MonST3R	0.061	95.4	0.197	72.6	0.083	93.4	0.066	92.3	0.110	89.7	0.343	59.4
Align3R	0.062	96.8	0.107	90.1	0.105	89.2	0.077	93.3	0.097	92.9	0.237	69.0
Easi3R	0.061	95.8	0.192	76.9	0.150	76.2	0.143	82.1	0.095	94.0	0.323	53.9
Geo4D	0.060	97.8	0.096	93.2	0.086	93.8	0.082	93.0	0.105	93.1	0.205	73.2
Aether	0.582	61.2	0.192	80.6	0.065	96.2	0.123	87.9	0.145	91.1	0.343	69.4
GeometryCrafter	0.061	96.8	0.115	87.7	0.410	53.8	0.124	83.6	0.123	90.8	0.280	72.4
MASt3R	0.549	4.6	0.633	0.9	0.754	6.4	0.749	0.2	0.967	0	0.701	2.3
CUT3R	0.097	90.3	0.135	80.6	0.118	87.4	0.127	88.1	0.824	0	1.020	23.6

Indoor Scene Outdoor Scene Large Dynamic Motion Drone Scene Mixed Scene

Method	DTU			RealEstate10k			ScanNet++			ACID
Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓
LSM	17.38	0.6274	0.3198	18.92	0.6677	0.3643	17.12	0.6860	0.3887	20.46	0.6160	0.3822
NoPoSplat	17.91	0.6306	0.2810	24.53	0.8450	0.1634	22.15	0.7988	0.2359	25.35	0.7774	0.1875
FLARE	17.01	0.5672	0.2901	22.15	0.7126	0.2363	23.19	0.8117	0.2201	22.44	0.6229	0.2818

Object-Centric Indoor Scenes Drone Scenes

Inference Efficiency

Method	2		4		8		16		32		64		128		256
Method	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓	Time ↓	GPU ↓
DUST3R	0.35 ± 0.19	2.49	6.00 ± 0.30	2.6	13.96 ± 0.86	3.65	50.37 ± 2.28	8.38	196.81 ± 6.38	27.52	OOM	OOM	OOM	OOM	OOM	OOM
MASt3R	9.43 ± 0.28	2.61	14.63 ± 0.52	2.68	21.38 ± 2.26	2.78	42.28 ± 9.06	3.35	117.77 ± 40.83	6.87	392.23 ± 184.36	28.78	OOM	OOM	OOM	OOM
Spann3R	0.16 ± 0.12	2.79	0.28 ± 0.01	2.8	0.65 ± 0.00	2.81	1.38 ± 0.01	2.84	2.81 ± 0.07	2.89	5.51 ± 0.03	2.99	11.25 ± 0.16	3.19	23.64 ± 0.70	3.55
CUT3R	0.19 ± 0.07	3.33	0.26 ± 0.04	3.38	0.42 ± 0.03	3.48	0.78 ± 0.03	3.65	1.50 ± 0.03	4.28	3.12 ± 0.31	5.54	5.76 ± 0.12	11.68	11.65 ± 0.16	17.36
VGGT	0.32 ± 0.41	7.11	0.29 ± 0.40	7.72	0.24 ± 0.01	9.06	0.72 ± 0.49	10.29	2.35 ± 0.04	12.75	4.23 ± 0.07	17.66	11.76 ± 0.41	28.65	34.21 ± 2.51	50.92
Fast3R	0.13 ± 0.14	4.05	0.11 ± 0.03	4.26	0.15 ± 0.02	4.75	0.30 ± 0.01	5.8	0.69 ± 0.02	7.25	1.78 ± 0.03	8.43	5.13 ± 0.06	10.91	16.55 ± 0.12	15.75
MonST3R	0.32 ± 0.25	2.79	14.78 ± 0.52	4.8	18.77 ± 0.20	7.84	35.76 ± 0.35	8.9	73.19 ± 0.37	16.15	148.17 ± 0.99	32.99	605.83 ± 25.24	66.66	OOM	OOM
Easi3R	0.35 ± 0.19	2.49	17.35 ± 1.10	3.41	24.18 ± 0.76	4.15	60.12 ± 2.67	7.69	137.16 ± 10.86	15.96	273.78 ± 2.08	32.53	901.05 ± 5.29	65.68	OOM	OOM

Findings and Takeaways

What Is the Impact of Tasks with Different Difficulties?

Multi-view geometry inference is inherently harder than pair-view inference.
Directly predicting dense 3D scene representations is much more challenging than estimating individual 3D attributes like depth and camera poses.
Metric-scale depth estimation remains a key challenge for GFMs.
Joint prediction of multiple geometric attributes (e.g., pose, depth, matching) may underlie recent performance gains.

Takeaway 1: Current GFMs are promising but face significant challenges when learning from overly complex tasks. Recommendation: Carefully decomposing difficult tasks (e.g., jointly predicting geometry, pose, depth, and tracking) into simpler sub-problems can facilitate more effective learning, especially under limited 3D data.

Do GFMs Generalize Well on Different Data Domains?

GFMs struggle to generalize in domains with extreme data scarcity.

Takeaway 2: Diverse, high-quality data is critical for strong generalization. To improve robustness in underrepresented domains, GFMs must be trained on data that covers broader distributions and metric-scale annotations.

Hints for Model Architecture Design, ViT or Diffusion? Strong 2D Feature Extractor?

No single design, feed-forward ViT or diffusion, is universally superior.
Stronger 2D foundation models can significantly enhance 3D GFMs.

Takeaway 3: No single backbone—feed -forward ViT or diffusion, dominates; architecture choice should align with task needs. Moreover, leveraging strong 2D feature extractors (e.g., DINO) substantially boosts 3D performance.

Are Current GFMs Ready for Real-Time Perception Systems?

Despite progress, GFMs still lack the efficiency required for real-time 3D applications.

Takeaway 4: As GFMs scale to handle more views and complex tasks, efficiency becomes as critical as accuracy for enabling real-time 3D perception.

Citation

@article{cong2025e3dbench,
  title={E3D-Bench: An End-to-End Benchmark for 3D Geometric Foundation Models},
  author={Cong, Wenyan and Liang, Yiqing and Zhang, Yancheng and Yang, Ziyi and Wang, Yan and Ivanovic, Boris and Pavone, Marco and Chen, Chen and Wang, Zhangyang and Fan, Zhiwen},
  journal={arXiv preprint arXiv:2506.01933},
  year={2025}
}

E3D-Bench: A Benchmark for

End-to-End 3D Geometric Foundation Models