Artificial Intelligence Made Simple

This is great thinking. I was thinking more from low resource perspective, but the hybrod might be great too

Expand full comment

Contextual Mind

Thanks, Devansh—this resonated. You’re right to point out the resource angle, but I keep circling something more structural. Some architectures lock early—others seem to wait for constraint to declare itself before they commit attention. That delay feels key.

There’s a broader scaffolding I’ve been developing—called VIECAF—that models how constraint asymmetries guide inference velocity through tiered systems. It’s still taking shape, but edge models like FNet feel like part of that story: fast, fixed, and efficient—yet lacking the recursive tail that lets novelty propagate significance.

Not ready to open the whole architecture just yet, but grateful for the spark. I think you’re orbiting near it.

Expand full comment

Sirsh

Apr 21

this is a great article. good work. i love this idea. i spent most of my phd in the frequency domain so to speak - this fractal structure of language is very cool thing to think about. im seeing that this works out of the box by exploiting some of this natural structure in language and getting better performance means achieving higher order corrections via whatever tricks. its reminding me somehow of the esoteric renormalization group which i spent a bit of time thinking about. exciting stuff

Expand full comment

Apr 21

exctiing stuff here

Expand full comment

Thomas Cherickal

Brilliant article, as always.

I believe that (F)FTs and FFNs have a huge role to play in optimizing Transformers and LLMs.

It's likely that as research becomes deeper, we will find increasingly higher performance optimizations to the existing architecture.

Of course, edge computing, local LLMs, robotics, and IoT products will enjoy much greater improvement in performance with far less compute.

I am pumped and excited by that thought!

Amazing article, Devansh.

You're my favorite AI writer, not just on Substack, but on the entire Internet.

The way you really make complex concepts simple is incredible and noteworthy.

Incredible because it educates your large audience without boring or losing them.

Noteworthy because you are helping the entire Internet to understand these concepts without getting lost in the technical details.

Great work!

Expand full comment

Thank you.

Very exciting things up ahead

Expand full comment

Alex Medvedev

I strongly suspect that you cannot solve multiple-needles-in-haystack or do in-context learning in general without quadratic cost over sequence length dimension. Authors of Mamba even have 50+ pages paper where they prove that state-space models are incapable of doing proper in-context learning. Same probably applies to Hyena which also uses FFT for long convolutions.

In essence, future is hybrid, and you cannot create a good model without 2+ full attention layers

Expand full comment

For performance, hybrid is key. However, not everything is going to be max performance

Expand full comment

Remixa

The imaginary part may still retain some ordering information, it would be a shame to lose! Of course, it's also possible that my "assumptions" as arrogant human beings are at work, haha!

Expand full comment

Very true. I think the imaginary part is worth integrating

Expand full comment

I like this. Do you think that translating this into FFT is analogous to a higher representation of 'concepts' or 'features' that anthropic has been doing work on? I feel like encoding full concepts instead of tokens is the holy grail of language understanding and real, artificial intelligence.

Expand full comment

Alex

It's probably worth noting in the article that the FNet paper was released in May 2021. I am not aware of any significant follow-up literature, and the authors seem to have moved on to focus on work that uses traditional attention-based transformer architectures.

In the areas I'm familiar with, optimized implementations of BERT like MobileBERT are still the de facto standard for on-device text classification tasks. MobileBERT is generally "fast enough" to execute even on a mobile CPU using SIMD instructions like ARM Neon, which avoids the time and complexity required to interact with the GPU.

For text classification tasks, the context length (n in the runtime complexity table) is small enough not to be very significant. A typical configuration would be n=512 tokens. This is not exclusively a performance optimization - applying supervised classification to very large blocks of text just doesn't work very well, even if you use a huge model.

For generative tasks, you'd typically want a decoder-only model, rather than an encoder-only model like FNet. This paper does a good job describing why the FFT blocks don't work with a decoder-only architecture: https://arxiv.org/abs/2107.10932. They propose a nice adaptation to make it work, but I'm not aware of any serious follow-up research by major labs.

For the current state of the art in generative on-device models, I would take a look at the Gemma 3 technical report: https://arxiv.org/abs/2503.19786. In particular, they use a 5:1 ratio of local attention layers (1024 token context window) to global attention layers (128k token context window). This results in low values of "n" for most layers of the model, while maintaining enough long-context capability for reasonable performance.

Expand full comment

Alex