Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI Supporting fMRI, M/EEG, Spikes, and HuggingFace Embeddings

Researchers at FAIR’s Meta lab have released NeuralSet, a Python framework designed to eliminate the most persistent bottleneck in Neuro-AI research: the painful, disparate process of getting brain data into a deep learning pipeline.

The Problem: Neuroscience Data is Stuck in the Pre-Deep Learning Era
Neuroscience already has excellent, battle-tested software. Tools like MNE-Python, EEGLAB, FieldTrip, Brainstorm, Nilearn, and fMRIPrep are the gold standard for signal processing in all of electrophysiology and neuroimaging. The problem is that these tools are designed for the world of deep learning: they rely on eager loading, assume that datasets are all RAM-sized, and lack native shortcuts to temporally align neural time series with high-dimensional embeddings from modern AI frameworks like HuggingFace Transformers.
The result? Researchers spend a lot of effort building advertising pipelines that require manual data wrangling, manual caching, and backend configuration – just to get brain signals paired with, say, GPT-2 transcript embedding for a single experiment. As public datasets on platforms like OpenNeuro now reach terabyte scale, and experimental protocols increasingly include continuous speech and video stimuli, this infrastructure gap is no longer just a nuisance — it’s a barrier to science.

What NeuralSet Actually Does
The main design goal of NeuralSet is data classification. Instead of pre-loading raw signals, NeuralSet represents the logical structure of any experiment as a lightweight, event-driven metadata – completely separate from memory- and computer-intensive extraction of real signals. The frame is arranged around five important summaries: Events, Producers, Segments, Collection Data, and the Background layer.
Basically, everything in the experiment – the fMRI run, the word spoken during the task, the video stimulus – is modeled as an event: a lightweight Python dictionary defined by typea start time, a durationand a timeline (unique identifier of the recording session in progress). A Study object combines all events in the entire dataset into a single pandas DataFrame. Importantly, NeuralSet supports BIDS-compliant datasets, although it is not limited to them. Because a DataFrame contains only lightweight metadata – not raw signals – developers can filter, analyze, and recombine large data sets using standard pandas functions without loading a single byte of raw data into memory.
It is compiled EventsTransform functions can be bound to enrich or filter events — for example, defining words with context of sentences, assigning cross-validation classifications, or combining long audio and video events into short segments. Multiple Learning and Transformational Steps can also be named together using a Chainwhich creates a single reproducible, sustainable pipe material.


When it’s time to work with the data, NeuralSet uses Extractors to bridge the gap between the metadata layer and the arrays of numbers required by machine learning models. For neural recording, NeuralSet wraps preprocessing stacks of domain-specific libraries directly: i FmriExtractor delegates to Nilearn for signal cleaning, surface smoothing, and surface or atlas-based projection, while MegExtractor or EegExtractor delegates to MNE-Python for filtering, redirection, and refactoring. The same integrated interface includes iEEG, fNIRS, EMG, and spike recording – changing modes only requires changing the configuration parameter, not rewriting the pipeline.
For experimental stimuli, NeuralSet offers native integration with the HuggingFace ecosystem. One HuggingFaceImage the extractor can embed stimulus frames using DINOv2 or CLIP; similar extractors exist for audio (Wav2Vec, Whisper), text (GPT-2, LLaMA), and video (VideoMAE). Importantly, NeuralSet can extend a static embedding – say, one vector per image – into a time series at an arbitrary frequency, so that the stimulus representations remain temporally aligned with the neural recording.
Extractors follow a three-stage extraction model: prepare (parameter validation at build time), prepare (precalculate and save weighted results for all events), and take out (lazy retrieval from cache during model training). This means expensive computations – such as running a large language model over each word in the corpus – are done once and reused for every test. The output of a single segment extractor is Collection data: a dictionary of tensors with the name of the extractor, and the corresponding segments.
Segmenter, DataLoader, and Cluster-Ready Infrastructure
A Segmenter slices the events DataFrame into Segments – contiguous temporal windows representing single training examples – either in a sliding window grid or embedded in specific trigger events such as images or word onsets. The result SegmentDataset is a standard PyTorch Dataset, which is directly compatible with it DataLoaderPyTorch Lightning, or any PyTorch-based framework.
NeuralSet is built on exca package, which handles deterministic, cache-based caching, computational provenance, and hardware-agnostic execution. Changing a single preprocessing parameter disables the affected upstream cache, leaving independent branches untouched. Full traceability is preserved, meaning that any processed tensor can be traced back to the exact version of the raw data and the specific preprocessing sequence used to generate it. Researchers can prototype a single subject on their laptop, then deploy 100 subjects to a SLURM-based HPC cluster by changing a single configuration flag – no infrastructure-specific code required.
NeuralSet uses Pydantic to enforce strict schema validation at runtime on every configurable object – Events, Lessons, Producers, Segmenters, and Transformations all in Pydantic BaseModel subclasses. This means that a poorly configured parameter (for example, a negative filter frequency or an invalid BIDS directory path) raises a clear error sooner, before any job is sent, than hours of failure in processing the job.
How It Stacks Up Against Existing Tools
In the research paper, the research team presents a detailed comparison of NeuralSet against 18 neuroscience software packages for all neural devices (fMRI, EEG, MEG, EEG, spikes, and more), types of test tasks (image, video, audio, text), and infrastructure features (Python support, memmap, batching, caching, batch extraction). NeuralSet is the only package in the comparison that achieves full support for all categories.
Key Takeaways
- NeuralSet combines brain data and AI in one line. Meta FAIR researchers built NeuralSet to bridge the gap between multimodal neural recordings (fMRI, M/EEG, spikes) and modern deep learning frameworks, delivering a single PyTorch-ready DataLoader for both.
- Structure–data separation eliminates memory constraints. NeuralSet separates lightweight event metadata from heavy signal output, so AI devs and researchers can filter and analyze terabyte-scale datasets without loading a single byte of raw data into RAM.
- Changing recording modes requires changing one configuration parameter. Extractor’s integrated interface wraps MNE-Python, Nilearn, and HuggingFace models – covering fMRI, EEG, MEG, iEEG, fNIRS, EMG, spikes, text, audio, and video – with no pipeline rewriting required.
- Pydantic authentication and fixed temporary storage prevent computer waste. Configuration errors are caught early before any work starts, and a hash-based caching system ensures expensive computations like LLM embedding are done once and reused for every test.
- The same code works on a laptop or SLURM cluster. NeuralSet’s hardware-agnostic backend, powered by
excapackage, allows researchers and AI devs to easily scale from local prototyping to high-performance cluster operations by updating a single configuration flag.
Check it out Paper and GitHub page. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



