DINO-3DRA

Leveraging 2D Foundation Model Semantics for 3D Cerebral Aneurysm Segmentation

MICCAI 2026

Author A  ·  Author B  ·  Author C

Anonymized Affiliations

📄 Paper 💻 GitHub 📦 arXiv
We transfer frozen 2D VFM semantics into a 3D U-Net via slice-level embeddings, depth-coherence restoration, and calibrated residual fusion — achieving +13% Dice over nnU-Net with only 116K extra parameters and zero catastrophic failures on cross-dataset evaluation.

0.758

Aneurysm Dice

+13%

over nnU-Net

116K

Extra params

0%

Failure rate

Fig. 1. Overview of DINO-3DRA. (a) Frozen DINOv3-Small extracts per-slice semantic embeddings from 2D axial slices. (b) Embeddings are broadcast to multi-scale spatial supports, refined by Room-Lite Mixers for depth coherence, and injected into U-Net skip connections via Calibrated Fusion.

Abstract

Accurate aneurysm segmentation in 3D rotational angiography (3DRA) is hindered by extreme class imbalance, morphological similarity to vessels, and absent large-scale 3D pretraining. 2D vision foundation models encode dense structural priors from ~1.7 billion images, yet naïve slice-wise transfer fragments anatomical continuity and destabilises optimisation.

We propose DINO-3DRA, a dual-path framework achieving effective cross-dimensional semantic transfer by injecting frozen DINOv3 features into a 3D U-Net via Room-Lite spatial mixing and calibrated residual fusion. On multi-centre 3DRA data, DINO-3DRA achieves state-of-the-art aneurysm segmentation (Dice: 0.758; HD95: 2.75 mm; +13% over nnU-Net) with only 5.72M trainable parameters. Ablation studies confirm that gains arise from structured cross-dimensional transfer rather than loss design alone. Cross-dataset evaluation eliminates all catastrophic failure cases, demonstrating robust generalisation across heterogeneous imaging protocols.


The Challenge

Ambiguity. 3DRA produces near-binary contrast — bright vessel lumen on dark background. On any single 2D slice, both a vessel cross-section and an aneurysm appear identical: a white circle on black. Distinguishing them requires observing how the cross-sectional shape changes across consecutive depth slices — a vessel maintains a consistent tube, while an aneurysm bulges and contracts.

The 2D–3D gap. The best 2D models (DINOv3) are trained on 1.7B images with rich structural priors. No equivalent large-scale 3D pretraining exists. But naïvely injecting 2D features into 3D models fails — it makes performance worse than using no 2D features at all.

Fig. 2. 3D rotational angiography sample. Left: ground truth label. Right: raw image.

Method

DINO-3DRA pairs a frozen 2D VFM with a trainable 3D U-Net. The frozen branch provides what kind of anatomy is in each slice. The trainable branch provides where exactly structures are. Two lightweight modules bridge the gap.

1 — Slice-Level Semantic Extraction. All 64 axial slices processed independently by frozen DINOv3-Small. 196 patch tokens globally averaged to a single 384-dim semantic embedding per slice. Projected via 1×1 convolutions and stacked into 3D volumes at three U-Net skip scales.

(B×64, 196, 384) → mean-pool → (B×64, 384) → (B, {24/48/96}, {64³/32³/16³})

2 — Room-Lite Depth Coherence. Learnable depth-positional bias B(z) + depthwise-separable 3D convolutions restore inter-slice continuity. Without this, Dice collapses −49%. 21K params total.

Y = X + GELU(GN(Conv₁ₓ₁(Conv₃ₓ₃(X + B(z)))))

3 — Calibrated Residual Fusion. GN-normalised DINO features injected as a learnable residual correction into U-Net skips. The residual bypass guarantees DINO can only help, never hurt.

F_fused = F_U + GELU(GN(W[F_U; GN(F_D)]))

Results

Ablation Study

Four of five DINO variants fall below the plain U-Net baseline. Only the full pipeline — Room-Lite + Calibrated Fusion — surpasses it.

Aneurysm Vessel
Configuration Dice HD95 Dice HD95
Ours (Full) 0.758 ± 0.234 2.75 0.897 ± 0.033 3.74
Baseline U-Net 0.711 ± 0.240 4.20 0.873 ± 0.046 5.41
Random Features 0.628 ± 0.279 6.15 0.885 ± 0.039 4.41
DINOv2-small 0.590 ± 0.252 6.50 0.868 ± 0.050 4.30
w/o Calibration 0.492 ± 0.301 10.12 0.857 ± 0.045 6.71
w/o Room-Lite 0.385 ± 0.310 22.08 0.857 ± 0.043 5.64

Loading visualization…

Fig. 4. Interactive vessel confidence maps across ablation variants. Ground truth (semi-transparent blue), predictions (red; darker = higher confidence). Full DINO-3DRA produces homogeneous confidence across the aneurysm dome, recognising aneurysms as pathological vessel dilations rather than isolated anomalies.

Cross-Dataset Generalisation

Evaluated on two external datasets without fine-tuning. DINO-3DRA eliminates all catastrophic failures (Dice < 0.5).

11.1% → 0%
Failures on CADA
3.3% → 0%
Failures on SHINY-ICARUS
75.8%
HD95 reduction (SI)
Dataset Model Dice Jaccard Vol. Sim. HD95 Failures
CADA Baseline 0.711 ± 0.186 0.577 ± 0.186 0.813 ± 0.128 5.16 11.1%
CADA DINO-3DRA 0.739 ± 0.093 0.594 ± 0.118 0.807 ± 0.109 3.58 0%
SHINY-ICARUS Baseline 0.718 ± 0.147 0.574 ± 0.129 0.808 ± 0.089 15.52 3.3%
SHINY-ICARUS DINO-3DRA 0.793 ± 0.049 0.660 ± 0.065 0.812 ± 0.056 3.76 0%
Fig. 5. Cross-dataset validation on CADA (left) and SHINY-ICARUS (right). Blue: vessel. Red: aneurysm. Columns: ground truth, baseline, DINO-3DRA.

BibTeX

@inproceedings{dino3dra2026,
  title     = {DINO-3DRA: Leveraging 2D Foundation Model
               Semantics for 3D Cerebral Aneurysm Segmentation},
  author    = {Anonymized Authors},
  booktitle = {International Conference on Medical Image Computing
               and Computer-Assisted Intervention (MICCAI)},
  year      = {2026}
}