DINO-3DRA | Jiayang Lu

MICCAI 2026

Author A · Author B · Author C

Anonymized Affiliations

We transfer frozen 2D VFM semantics into a 3D U-Net via slice-level embeddings, depth-coherence restoration, and calibrated residual fusion — achieving +13% Dice over nnU-Net with only 116K extra parameters and zero catastrophic failures on cross-dataset evaluation.

0.758

Aneurysm Dice

+13%

over nnU-Net

116K

Extra params

0%

Failure rate

Fig. 1. Overview of DINO-3DRA. (a) Frozen DINOv3-Small extracts per-slice semantic embeddings from 2D axial slices. (b) Embeddings are broadcast to multi-scale spatial supports, refined by Room-Lite Mixers for depth coherence, and injected into U-Net skip connections via Calibrated Fusion.

Abstract

Accurate aneurysm segmentation in 3D rotational angiography (3DRA) is hindered by extreme class imbalance, morphological similarity to vessels, and absent large-scale 3D pretraining. 2D vision foundation models encode dense structural priors from ~1.7 billion images, yet naïve slice-wise transfer fragments anatomical continuity and destabilises optimisation.

We propose DINO-3DRA, a dual-path framework achieving effective cross-dimensional semantic transfer by injecting frozen DINOv3 features into a 3D U-Net via Room-Lite spatial mixing and calibrated residual fusion. On multi-centre 3DRA data, DINO-3DRA achieves state-of-the-art aneurysm segmentation (Dice: 0.758; HD95: 2.75 mm; +13% over nnU-Net) with only 5.72M trainable parameters. Ablation studies confirm that gains arise from structured cross-dimensional transfer rather than loss design alone. Cross-dataset evaluation eliminates all catastrophic failure cases, demonstrating robust generalisation across heterogeneous imaging protocols.

The Challenge

Ambiguity. 3DRA produces near-binary contrast — bright vessel lumen on dark background. On any single 2D slice, both a vessel cross-section and an aneurysm appear identical: a white circle on black. Distinguishing them requires observing how the cross-sectional shape changes across consecutive depth slices — a vessel maintains a consistent tube, while an aneurysm bulges and contracts.

The 2D–3D gap. The best 2D models (DINOv3) are trained on 1.7B images with rich structural priors. No equivalent large-scale 3D pretraining exists. But naïvely injecting 2D features into 3D models fails — it makes performance worse than using no 2D features at all.

Fig. 2. 3D rotational angiography sample. Left: ground truth label. Right: raw image.

Method

DINO-3DRA pairs a frozen 2D VFM with a trainable 3D U-Net. The frozen branch provides what kind of anatomy is in each slice. The trainable branch provides where exactly structures are. Two lightweight modules bridge the gap.

1 — Slice-Level Semantic Extraction. All 64 axial slices processed independently by frozen DINOv3-Small. 196 patch tokens globally averaged to a single 384-dim semantic embedding per slice. Projected via 1×1 convolutions and stacked into 3D volumes at three U-Net skip scales.

(B×64, 196, 384) → mean-pool → (B×64, 384) → (B, {24/48/96}, {64³/32³/16³})

2 — Room-Lite Depth Coherence. Learnable depth-positional bias B(z) + depthwise-separable 3D convolutions restore inter-slice continuity. Without this, Dice collapses −49%. 21K params total.

Y = X + GELU(GN(Conv₁ₓ₁(Conv₃ₓ₃(X + B(z)))))

3 — Calibrated Residual Fusion. GN-normalised DINO features injected as a learnable residual correction into U-Net skips. The residual bypass guarantees DINO can only help, never hurt.

F_fused = F_U + GELU(GN(W[F_U; GN(F_D)]))

Results

Ablation Study

Four of five DINO variants fall below the plain U-Net baseline. Only the full pipeline — Room-Lite + Calibrated Fusion — surpasses it.

	Aneurysm		Vessel
Configuration	Dice	HD95	Dice	HD95
Ours (Full)	0.758 ± 0.234	2.75	0.897 ± 0.033	3.74
Baseline U-Net	0.711 ± 0.240	4.20	0.873 ± 0.046	5.41
Random Features	0.628 ± 0.279	6.15	0.885 ± 0.039	4.41
DINOv2-small	0.590 ± 0.252	6.50	0.868 ± 0.050	4.30
w/o Calibration	0.492 ± 0.301	10.12	0.857 ± 0.045	6.71
w/o Room-Lite	0.385 ± 0.310	22.08	0.857 ± 0.043	5.64

Loading visualization…

Fig. 4. Interactive vessel confidence maps across ablation variants. Ground truth (semi-transparent blue), predictions (red; darker = higher confidence). Full DINO-3DRA produces homogeneous confidence across the aneurysm dome, recognising aneurysms as pathological vessel dilations rather than isolated anomalies.

Cross-Dataset Generalisation

Evaluated on two external datasets without fine-tuning. DINO-3DRA eliminates all catastrophic failures (Dice < 0.5).

11.1% → 0%

Failures on CADA

3.3% → 0%

Failures on SHINY-ICARUS

75.8%

HD95 reduction (SI)

Dataset	Model	Dice	Jaccard	Vol. Sim.	HD95	Failures
CADA	Baseline	0.711 ± 0.186	0.577 ± 0.186	0.813 ± 0.128	5.16	11.1%
CADA	DINO-3DRA	0.739 ± 0.093	0.594 ± 0.118	0.807 ± 0.109	3.58	0%
SHINY-ICARUS	Baseline	0.718 ± 0.147	0.574 ± 0.129	0.808 ± 0.089	15.52	3.3%
SHINY-ICARUS	DINO-3DRA	0.793 ± 0.049	0.660 ± 0.065	0.812 ± 0.056	3.76	0%

Fig. 5. Cross-dataset validation on CADA (left) and SHINY-ICARUS (right). Blue: vessel. Red: aneurysm. Columns: ground truth, baseline, DINO-3DRA.

BibTeX

@inproceedings{dino3dra2026,
  title     = {DINO-3DRA: Leveraging 2D Foundation Model
               Semantics for 3D Cerebral Aneurysm Segmentation},
  author    = {Anonymized Authors},
  booktitle = {International Conference on Medical Image Computing
               and Computer-Assisted Intervention (MICCAI)},
  year      = {2026}
}