Technology & AI

The Netflix AI Team Just Open Sourced VOID: An AI Model That Erases Objects From Videos – Physics and Everything

Video editing has always had a dirty secret: removing an object from images is easy; making the scene look like it never happened is brutally difficult. Take out the man with the guitar, and you’re left with a floating instrument that defies gravity. Hollywood VFX teams spend weeks fixing this type of problem. A team of researchers from Netflix and INSAIT, Sofia University ‘St. Kliment Ohridski,’ released The VOID (Video Object and Interaction Removal) model that can do it automatically.

VOID removes objects from videos and all the interactions they have in the scene — not just secondary effects like shadows and reflections, but visual interactions like objects falling when someone is moved.

What Problem Does VOID Really Solve?

Standard video-painting models – the kind used in most editing today – are trained to fill in the pixel space where the object used to be. They are very sophisticated painters. What they don’t do is think about it cause: if I remove an actor holding a prop, what should happen to that prop?

Existing methods for removing video artifacts are primarily focused on painting the content ‘behind’ the object and correcting surface-level artifacts such as shadows and reflections. However, when the ejected object has very important interactions, such as collisions with other objects, current models fail to correct them and produce unreliable results.

VOID is built on top of CogVideoX and optimized for video manipulation with connection-aware mask conditioning. The key innovation is how the model understands the scene – not just ‘which pixels should I fill?’ but ‘what is physically plausible after the disappearance of this object?’

A canonical example from a research paper: if the person holding the guitar is removed, the VOID also removes the person’s impact on the guitar – causing it to naturally collapse. That is no small thing. The model should understand that the guitar was there supported about man, and that removing man means gravity takes over.

And unlike previous work, VOID was tested head-to-head against real competitors. Tests on both synthetic and real data show that the method better preserves constant scene dynamics after object removal compared to previous video object removal methods including ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte.

Architecture: CogVideoX Under the Hood

VOID is built on CogVideoX-Fun-V1.5-5b-InP — a model from Alibaba PAI — and fine-tuned for interactive video rendering quadmask situation. CogVideoX is a 3D Transformer-based video production model. Think of it as a video version of Stable Diffusion – a diffusion model that works with a temporal sequence of frames instead of a single image. A specific base model (CogVideoX-Fun-V1.5-5b-InP) is released by Alibaba PAI on Hugging Face, which is where developers will need to download it separately before using VOID.

Details of the fine-tuned architecture: CogVideoX 3D Transformer with 5B parameters, takes video, quadmask, and text describing the scene after removal as input, works with automatic resolution of 384×672, processes a maximum of 197 frames, uses a DDIM scheduler, and works on memory 8F quant 6F efficient.

I quadmask arguably the most interesting technical contribution here. Rather than a binary mask (remove this pixel / keep this pixel), a quadmask is a 4-value mask that includes the main object to be removed, overlapping regions, affected regions (falling objects, removed objects), and the background to be preserved.

Basically, each pixel in the mask gets one of four values: 0 (the main element is removed), 63 (overlap between core and affected regions), 127 (interactive affected region — objects that will move or change due to removal), and 255 (domain, keep as is). This gives the model a structured semantic map what is happening at the scenenot just where the thing is.

A Two-Pass Inference Pipeline

VOID uses two transformer test points, trained in sequence. You can use the indicator with Pass 1 alone or combine both passes for maximum temporal consistency.

Pass 1 (void_pass1.safetensors) is a basic drawing model and is sufficient for most videos. Pass 2 serves a specific purpose: correcting a known failure mode. If the model detects object deformation – a known failure mode for small video propagation models – the optional second pass restarts the prediction using the noise flow derived from the first pass, which stabilizes the shape of the object in the newly combined paths.

It’s worth understanding the difference: Pass 2 isn’t just for long clips – mostly shape stiffness adjustment. If the diffusion model produces objects that are gradually warped or degraded in every frame (a well-documented artifact in video streaming), Pass 2 uses optical flow to warp the hidden ones from Pass 1 and feeds them as the initialization of the second diffusion implementation, which pastes the shape of the composite objects frame by frame.

How to Create Training Data

This is where things get really interesting. Training a model to understand physical interactions requires paired videos – the same scene, with or without an object, where the physics plays well in both. Paired real-world data at this scale do not exist. So the team builds artificially.

The training used paired virtual reality videos created from two sources: HUMOTO — human-object interaction rendered in Blender with physics simulation — and Kubric — only object interaction using Google Scanned Objects.

HUMOTO uses motion capture data for human-object interactions. The main mechanic is Blender’s re-simulation: the scene is set up with a person and objects, which are rendered along with the person present, then the person is removed from the simulation and the physics are restarted from that point. The result is physically correct opposition – objects that were held or supported now fall, as they should. Kubric, developed by Google Research, uses the same idea for object-object collisions. Together, they produce a dataset of paired videos where the physics is almost perfect, it can be estimated by a human annotator.

Key Takeaways

  • VOID overrides pixel saturation. Unlike existing video painting tools that only fix visual artifacts like shadows and reflections, VOID understands the physical cause — if you remove the person holding the object, the object naturally falls in the output video.
  • The quadmask is a fundamental innovation. Instead of a simple binary remove/retain mask, VOID uses a 4-valued quadmask (values ​​0, 63, 127, 255) that includes not only what needs to be removed, but what adjacent regions of space will be there. physically affected – to provide a distributed model for a structured understanding of the scene to work on.
  • Two-pass logic solves the real failure mode. Pass 1 handles most videos; Pass 2 exists to correct dynamic artifacts – a well-known weakness of video propagation models – by using controllers that recognize the warped stream from Pass 1 as a second iteration of the propagation.
  • Synthetic paired data made training possible. Since real-world paired fake video data is not at scale, the research team builds it using Blender physics re-simulation (HUMOTO) and Google’s Kubric framework, which generates ground truth before/after video pairs where the physics is accurate.

Check it out Paper, Model Weight again Repo. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button