A deep dive into FNet, FFT-based mixing, and why the future of AI might belong to fixed-structure models that don’t even try to learn what they can encode.
Fascinating read—and a compelling case for FNet’s role in the future of efficient inference. But I wonder if what we’re witnessing isn’t a replacement of self-attention so much as its redistribution across a tiered architecture. In constraint-rich edge environments, fixed-structure models like FNet shine as high-speed intake layers—but their strength is also their boundary. The absence of adaptive context binding may limit resilience when asymmetry strikes.
There’s a case to be made for coupling these encoders with recursive agents downstream: systems that selectively intervene when novelty, ambiguity, or deviation from encoded priors emerges. In that sense, FNet isn’t the brain—it’s the nervous system. What’s missing is the cortex that knows when to listen harder.
Curious if you’ve considered this kind of hybrid layering in your vision of the edge.
Thanks, Devansh—this resonated. You’re right to point out the resource angle, but I keep circling something more structural. Some architectures lock early—others seem to wait for constraint to declare itself before they commit attention. That delay feels key.
There’s a broader scaffolding I’ve been developing—called VIECAF—that models how constraint asymmetries guide inference velocity through tiered systems. It’s still taking shape, but edge models like FNet feel like part of that story: fast, fixed, and efficient—yet lacking the recursive tail that lets novelty propagate significance.
Not ready to open the whole architecture just yet, but grateful for the spark. I think you’re orbiting near it.
this is a great article. good work. i love this idea. i spent most of my phd in the frequency domain so to speak - this fractal structure of language is very cool thing to think about. im seeing that this works out of the box by exploiting some of this natural structure in language and getting better performance means achieving higher order corrections via whatever tricks. its reminding me somehow of the esoteric renormalization group which i spent a bit of time thinking about. exciting stuff
I strongly suspect that you cannot solve multiple-needles-in-haystack or do in-context learning in general without quadratic cost over sequence length dimension. Authors of Mamba even have 50+ pages paper where they prove that state-space models are incapable of doing proper in-context learning. Same probably applies to Hyena which also uses FFT for long convolutions.
In essence, future is hybrid, and you cannot create a good model without 2+ full attention layers
The imaginary part may still retain some ordering information, it would be a shame to lose! Of course, it's also possible that my "assumptions" as arrogant human beings are at work, haha!
I like this. Do you think that translating this into FFT is analogous to a higher representation of 'concepts' or 'features' that anthropic has been doing work on? I feel like encoding full concepts instead of tokens is the holy grail of language understanding and real, artificial intelligence.
It's probably worth noting in the article that the FNet paper was released in May 2021. I am not aware of any significant follow-up literature, and the authors seem to have moved on to focus on work that uses traditional attention-based transformer architectures.
In the areas I'm familiar with, optimized implementations of BERT like MobileBERT are still the de facto standard for on-device text classification tasks. MobileBERT is generally "fast enough" to execute even on a mobile CPU using SIMD instructions like ARM Neon, which avoids the time and complexity required to interact with the GPU.
For text classification tasks, the context length (n in the runtime complexity table) is small enough not to be very significant. A typical configuration would be n=512 tokens. This is not exclusively a performance optimization - applying supervised classification to very large blocks of text just doesn't work very well, even if you use a huge model.
For generative tasks, you'd typically want a decoder-only model, rather than an encoder-only model like FNet. This paper does a good job describing why the FFT blocks don't work with a decoder-only architecture: https://arxiv.org/abs/2107.10932. They propose a nice adaptation to make it work, but I'm not aware of any serious follow-up research by major labs.
For the current state of the art in generative on-device models, I would take a look at the Gemma 3 technical report: https://arxiv.org/abs/2503.19786. In particular, they use a 5:1 ratio of local attention layers (1024 token context window) to global attention layers (128k token context window). This results in low values of "n" for most layers of the model, while maintaining enough long-context capability for reasonable performance.
The MobileNetV4 is also a good paper for understanding some of the practical trade-offs in adapting neural network designs to run well on diverse mobile hardware: https://arxiv.org/abs/2404.10518.
Fascinating read—and a compelling case for FNet’s role in the future of efficient inference. But I wonder if what we’re witnessing isn’t a replacement of self-attention so much as its redistribution across a tiered architecture. In constraint-rich edge environments, fixed-structure models like FNet shine as high-speed intake layers—but their strength is also their boundary. The absence of adaptive context binding may limit resilience when asymmetry strikes.
There’s a case to be made for coupling these encoders with recursive agents downstream: systems that selectively intervene when novelty, ambiguity, or deviation from encoded priors emerges. In that sense, FNet isn’t the brain—it’s the nervous system. What’s missing is the cortex that knows when to listen harder.
Curious if you’ve considered this kind of hybrid layering in your vision of the edge.
This is great thinking. I was thinking more from low resource perspective, but the hybrod might be great too
Thanks, Devansh—this resonated. You’re right to point out the resource angle, but I keep circling something more structural. Some architectures lock early—others seem to wait for constraint to declare itself before they commit attention. That delay feels key.
There’s a broader scaffolding I’ve been developing—called VIECAF—that models how constraint asymmetries guide inference velocity through tiered systems. It’s still taking shape, but edge models like FNet feel like part of that story: fast, fixed, and efficient—yet lacking the recursive tail that lets novelty propagate significance.
Not ready to open the whole architecture just yet, but grateful for the spark. I think you’re orbiting near it.
this is a great article. good work. i love this idea. i spent most of my phd in the frequency domain so to speak - this fractal structure of language is very cool thing to think about. im seeing that this works out of the box by exploiting some of this natural structure in language and getting better performance means achieving higher order corrections via whatever tricks. its reminding me somehow of the esoteric renormalization group which i spent a bit of time thinking about. exciting stuff
exctiing stuff here
Brilliant article, as always.
I believe that (F)FTs and FFNs have a huge role to play in optimizing Transformers and LLMs.
It's likely that as research becomes deeper, we will find increasingly higher performance optimizations to the existing architecture.
Of course, edge computing, local LLMs, robotics, and IoT products will enjoy much greater improvement in performance with far less compute.
I am pumped and excited by that thought!
Amazing article, Devansh.
You're my favorite AI writer, not just on Substack, but on the entire Internet.
The way you really make complex concepts simple is incredible and noteworthy.
Incredible because it educates your large audience without boring or losing them.
Noteworthy because you are helping the entire Internet to understand these concepts without getting lost in the technical details.
Great work!
Thank you.
Very exciting things up ahead
I strongly suspect that you cannot solve multiple-needles-in-haystack or do in-context learning in general without quadratic cost over sequence length dimension. Authors of Mamba even have 50+ pages paper where they prove that state-space models are incapable of doing proper in-context learning. Same probably applies to Hyena which also uses FFT for long convolutions.
In essence, future is hybrid, and you cannot create a good model without 2+ full attention layers
For performance, hybrid is key. However, not everything is going to be max performance
The imaginary part may still retain some ordering information, it would be a shame to lose! Of course, it's also possible that my "assumptions" as arrogant human beings are at work, haha!
Very true. I think the imaginary part is worth integrating
I like this. Do you think that translating this into FFT is analogous to a higher representation of 'concepts' or 'features' that anthropic has been doing work on? I feel like encoding full concepts instead of tokens is the holy grail of language understanding and real, artificial intelligence.
It's probably worth noting in the article that the FNet paper was released in May 2021. I am not aware of any significant follow-up literature, and the authors seem to have moved on to focus on work that uses traditional attention-based transformer architectures.
In the areas I'm familiar with, optimized implementations of BERT like MobileBERT are still the de facto standard for on-device text classification tasks. MobileBERT is generally "fast enough" to execute even on a mobile CPU using SIMD instructions like ARM Neon, which avoids the time and complexity required to interact with the GPU.
For text classification tasks, the context length (n in the runtime complexity table) is small enough not to be very significant. A typical configuration would be n=512 tokens. This is not exclusively a performance optimization - applying supervised classification to very large blocks of text just doesn't work very well, even if you use a huge model.
For generative tasks, you'd typically want a decoder-only model, rather than an encoder-only model like FNet. This paper does a good job describing why the FFT blocks don't work with a decoder-only architecture: https://arxiv.org/abs/2107.10932. They propose a nice adaptation to make it work, but I'm not aware of any serious follow-up research by major labs.
For the current state of the art in generative on-device models, I would take a look at the Gemma 3 technical report: https://arxiv.org/abs/2503.19786. In particular, they use a 5:1 ratio of local attention layers (1024 token context window) to global attention layers (128k token context window). This results in low values of "n" for most layers of the model, while maintaining enough long-context capability for reasonable performance.
The MobileNetV4 is also a good paper for understanding some of the practical trade-offs in adapting neural network designs to run well on diverse mobile hardware: https://arxiv.org/abs/2404.10518.