Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Are Like a 1.3B Transformer

Anthropic has never published a technical paper on the Claude Mythos. That hasn’t stopped the research community from theorizing. A new open source project called OpenMythosreleased on GitHub by Kye Gomezit attempts something ambitious: a first-principles theoretical reconstruction of what the Claude Mythos might be, built entirely on PyTorch and based on peer-reviewed research.
The project is not a leaked model, a beautiful song, or a distillation. It’s a coded hypothesis – and the hypothesis is clear enough to be falsifiable, which is what makes it interesting.
Main Claim: The Claude Mythos is a Recurrent-Depth Transformer
OpenMythos suggests that the Claude Mythos belongs to a class of structures called Recurrent-Depth Transformers (RDTs)also referred to in the literature as Loop Transformers. The concept is significantly different from conventional transformer stacks.
In a conventional transformer – GPT, LLaMA, Mistral – the model transmits the input through a sequence of unique layers, one after the other, each with its own independent weight. More power usually means more layers and more parameters. In a Recurrent-Depth Transformer, a fixed set of weights is applied repeatedly to all steps of the T-loop within a single forward pass. The same weights work multiple times. Inference depth is not a function of how many parameters are stored, but how many iterations are run during the determination process.
Think of it less like reading a book and more like refining a draft: the model returns to the same computing block over and over, improving its internal representation with each pass.
How to Build Buildings
OpenMythos emphasizes this as a three-part structure: Prelude → Recurrent Block → Coda. Prelude and Coda are standard transformer layers that work simultaneously. The Recurrent Block is the core of the computation, accessed T=16 times.
At step t of each loop, the hidden state is updated using the following rule:
ht+1 = A·ht + B·e + Transformer(ht, e)Here, ht hidden state after loop iteration t, and e input encoded in Prelude — recoded at every step. Reinjection is intentional: without it, the hidden state would drift away from the original input signal in all the deep loops. The learned matrices A and B govern how much of the hidden prior state and encoded input is carried forward at each step.
The FFN within the Recurrent Block is not a standard feedforward layer. OpenMythos replaces it with a MoE the layer that follows the design on which it was introduced DeepSeekMoE: A large pool of well-traveled experts, with only a subset of activated top-K activated per token, next to a subset that is always active shared by experts absorbing the general patterns of the cross-domain. Most importantly, the router selects different subsets of experts in each loop depth, which means that each iteration is computationally different despite sharing the same weights. The MoE provides a range of domains; the loop provides depth of thought.
Attention is self-evident More Covert Attention from DeepSeek-V2, which stores low-level KV compressed rather than full key/value locations, providing a 10–20× reduction in KV memory at production scale.
Reasoning in Hidden Space Continued
One of the most important aspects of this structure is that thinking takes place entirely in a subtle continuous environment. There is no intermediate token output between loop steps – the model does not generate text between thought and then read it again. This is structurally different from the chain of reasoning, where reasoning is issued as a sequence of tokens, and has been formally analyzed in both Saunshi et al. (2025) and COCONUT (2024).
Saunshi et al. (2025) formally show that each loop iteration in RDT is equivalent to one logic step, but works over real-valued vectors instead of discrete tokens. Continual hidden thoughts may also include other next steps simultaneously, enabling something close to a comprehensive initial search in the thought space within a single forward pass.
This also explains the advantage of concrete ability. A standard transformer trained on a 5-hop chain of logic fails when tested on 10-hop chains during prediction – it has no way to extend its depth beyond what it saw during training. The Recurrent-Depth Transformer handles this naturally: using projection-time loops extends the chain of thought without retraining. Harder problems get more computing power; easy ones come out early.
Solving the Stability Problem
Looped training models have been buzzing. Hidden status ht it can grow indefinitely in multiples – a failure mode called residual explosion. OpenMythos mentions this using a Linear Time-Invariant (LTI) injection limit borrowed from The Parcae architecture (Prairie et al., 2026): the spectral radius of A, defined by ρ(A), is forced to be less than 1 by construction, ensuring stability regardless of the reading rate or gradient noise.
A second failure mode also exists in some extremes: beyond a certain loop depth, excessive iteration degrades the predictions – the hidden state passes through the solution and into the noise. This is the problem of ‘overthinking’. Adaptive Computation Time (ACT) stop adjusts it with a scale read at each point that dynamically determines when to stop the loop. Positions that are difficult to process receive more consideration; tokens that have already met stop early.
Finally, LoRA adapters are smart introduces a small rank-r regularization matrix at each iteration depth, giving each loop step a different behavior without adding large parameters – bridging the gap between weighted and completely different binding layers.
Why Parameter Success Matters
The Parcae paper (Prairie et al., 2026) provides a strong basis for the efficacy claim. At 770M parameters, the RDT is similar to a conventional 1.3B transformer trained on the same data – about half the parameters of the equivalent stream quality. Absolute repetition and correct token counting both follow power laws with constant exponents at all scales, establishing predictable first scaling laws for constant training.
What you’re saying is important: inference-time estimates of inference-time computation, not stored-parameter computation. This restates one of the main ideas in the measurement debate. The appropriate axis may not be the number of parameters in training, but the depth of the loop in the direction.
What OpenMythos Offers
OpenMythos provides four portable research artifacts: a fully configurable PyTorch implementation of the RDT hypothesis with MoE FFN and Hidden Multiple Attention; A stable repetitive injection of LTI is included as the first phase of training; deeply intelligent LoRA adapters that enable repeatable behavior classification; and a reproducible research base for studying looped transformer power and inference time reasoning depth.
Whether Mythos is an RDT or not, OpenMythos gives the research community something tangible and usable – an implementation of a class of architecture that the literature has consistently suggested is understudied, and one that may represent a very different approach to skillful AI than training large models.
Check it out Full Codes with Notebook here. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



