NVIDIA AI releases C-RADIOv4 vision core including SigLIP2, DINOv3, SAM3 for segmentation, dense prediction, segmentation of workloads at scale

How do you combine SigLIP2, DINOv3, and SAM3 into a single vision core without sacrificing compact or discrete functionality? NVIDIA’s C-RADIOv4 is the core of a new concept that combines three solid models, SigLIP2-g-384, DINov3-7B, and SAM3, into a single reader encoder. It extends the line of AM-RADIO and RADIOv2.5, keeping the same computing cost while improving the quality of dense projection, resolution stability, and downward compatibility with SAM3.
The main idea is simple. Instead of choosing between a visual language model, a self-monitoring dense model, and a segmentation model, C-RADIOv4 tries to balance all three simultaneously with a single core.

Agglomerative distillation in RADIO
RADIO family uses agglomerative distillation. One ViT-style learner was trained to match both dense feature maps and abstract tokens from several different teachers.
Previous RADIO models have included DFN CLIP, DINOv2, and SAM. They already supported multi-resolution training but featured ‘mode switching’, where the representation changed qualitatively as the input resolution changed. Later work like PHI-S, RADIOv2.5, and FeatSharp added better and more standard multi-resolution rendering, but the teacher’s set was still limited.
IC-RADIOv4 develops teachers:
- SigLIP2-g-384 strong image text alignment
- DINOv3-7B with dense features of high quality supervised
- SAM3 with dedicated features and compatibility with the SAM3 decoder
The reader is trained so that its dense features match DINOv3 and SAM3, while its summary tokens match SigLIP2 and DINOv3. This provides a single encoder that can support segmentation, retrieval, dense prediction, and segmentation.
Stochastic multi-resolution training
IC-RADIOv4 uses stochastic multi-resolution training with a small set of fixed solutions.
Sizes of input training samples from two partitions:
- Low resolution:
{128, 192, 224, 256, 384, 432} - High resolution:
{512, 768, 1024, 1152}
SigLIP2 runs natively at 384 pixels. Its features are compared to the 3 factor using FeatSharp to match the 1152 pixel SAM3 factors. SAM3 is trained with mosaic augmentation at 1152 × 1152.
This design smooths the performance curve over resolution and improves low resolution behavior. For example, in ADE20k linear testing, IC-RADIOv4-H reaches around:
- 55.20 mIoU at 512 px
- 57.02 mIoU at 1024 px
- 57.72 mIoU at 1536 px
The scaling trend is close to DINOv3-7B while using about an order of magnitude fewer parameters.
To remove the noise of the teachers about the equal loss of the shift and MESA
Extracting large conceptual models often copies their artifacts, not just their useful structure. SigLIP2 has boundary noise patterns, and ViTDet-style models can show window boundary artifacts. Direct feature regression can force the reader to reproduce those patterns.
IC-RADIOv4 introduces two equal ways to suppress that noise:
- Shift equal density loss: Each teacher and student can see moved independently photo plants. Before the squared error is computed, the features are aligned using a shift map and the loss only uses local overlaps. Because the student never sees complete positions like the teacher, he can’t simply memorize a fixed sound and is forced to follow the input-dependent structure instead.
- Shift equivalent to MESA: IC-RADIOv4 also uses MESA-style programming between the Internet network and the EMA copy. Again, the reader and his EMA see different crops, the features are aligned dynamically, and the loss is applied after layer normalization. This promotes a smooth loss of land and strength, while not changing the overall shape.
In addition, the training uses DAMP, which adds repetitive sound to the weights. This also improves the resilience of corruption and small distributional shifts.
Evaluating teachers with a loss of consciousness summary of angular dispersion
Summary loss in previous RADIO models used the cosine distance between student and teacher embeddings. Cosine distance subtracts magnitude but not vertical dispersion in the area. Some teachers, such as SigLIP2, produce embeddings that are concentrated in a small cone, while the DINOv3 variant produces more diffuse embeddings.
When the raw cosine distance is used, teachers with a wide angular range contribute more loss and dominate the improvement. Actually, DINOv3 tends to overshadow SigLIP2 in a short period of time.
IC-RADIOv4 replaces this i the angle is normal loss. The squared angle between the embedding of the student and the teacher is divided by the angular dispersion of the teacher. The weighted distribution shows SigLIP2-g-384 at about 0.694, while DINOv3-H+ and DINOv3-7B at about 2.12 and 2.19. Familiarity with these values moderates their influence and preserves both conceptual language and dense semantics.
Functionality: classification, density prediction, and Probe3d
Opened ImageNet-1k zero shot classificationC-RADIOv4-H reaches approx 83.09 % up-1 precision. It matches or improves on RADIOv2.5-H and C-RADIOv3-H in all resolutions, with the best performance around 1024 px.
Opened k-NN classificationC-RADIOv4-H improves on RADIOv2.5 and C-RADIOv3, and matches or exceeds DINOv3 starting at about 256 px. DINOv3 peaks around 192–256 px and then degrades, while C-RADIOv4 maintains stable or improves performance at higher resolutions.
The dense and 3D metrics show the target trade. In ADE20k, PASCAL VOC, NAVI, and SPair, C-RADIOv4-H and SO400M variants outperform previous RADIO models and compete with DINOv3-7B in dense benchmarks. For C-RADIOv4-H, the average scores are:
- ADE20k: 55.20 mIoU
- VOC: 87.24 mIoU
- NAV: 63.44
- Average: 60.57


Opened Probe3dincluding Depth Normals, Surface Normals, NAVI, and SPair, the C-RADIOv4-H achieves the best NAVI again Be quiet points in the RADIO family. Depth and Surface metrics are close to those of C-RADIOv3-H, with little difference in either direction, instead of the same improvement.
Integration with SAM3 and deployment of ViTDet mode
IC-RADIOv4 is designed to be a drop-in replacement for the core Perception Encoder in SAM3. The SAM3 decoder and memory components remain unchanged. A reference implementation is provided via the SAM3 fork. Quality examples show that the behavior of the segment is preserved in both textual instructions such as “shoe”, “helmet”, “bicycle”, “spectator” and box information, and in some reported cases the C-RADIOv4 based on SAM3 solves the failure cases from the original encoder.
For use, C-RADIOv4 features a ViTDet mode to stop. Most transformer blocks use windowed attention, while a few use global attention. Supported window sizes range from 6 × 6 to 32 × 32 tokens, subject to classification by patch size and image resolution. For the A100, the SO400M model with a window size of at least 12 is faster than the SAM3 ViT-L+ encoder at all different input sizes, and the Large model with a window size of 8 is close to catching up.
This makes C-RADIOv4 an effective backbone for dense high-resolution applications where full global attention to all layers is cost-effective.
Key Takeaways
- One integrated spine: IC-RADIOv4 integrates SigLIP2-g-384, DINov3-7B, and SAM3 into a single ViT-style encoder that supports segmentation, retrieval, density prediction, and segmentation.
- Behavior of the solution: Stochastic multi-resolution training over {128…1152} px, and FeatSharp upsampling for SigLIP2, stabilizes performance at all resolutions and tracks of DINOv3-7B scaling with very few parameters.
- Noise reduction using shift equivariance: Shift equivariant dense loss and shift equivariant MESA prevent the learner from copying teacher boundary and window artifacts, focusing learning on input-dependent semantics.
- Balanced multi-teacher distillation: The loss of the general summary of the angular dispersion balances the contribution of SigLIP2 and DINOv3, which preserves both the alignment of the text and the quality of the dense representation.
- Ready deployment for SAM3 and ViTDet: IC-RADIOv4 can directly replace the SAM3 Perception Encoder, provides windowed ViTDet mode detection to achieve faster high resolution, and is released under the NVIDIA Open Model License.
Check it out Paper, Repo, Model-1 and Model-2. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.




