This is a great article, thanks for the honest eval. I've noticed a significant difference in the orchestration layer at you.com who seem to have build their business model on trying to nail the orchestration. It's not perfect, but will help the masses to use the agentic approach to work in chat-based interactions (solving most of what o1 claimed to do).
Would love to hear your eval of the agents/orchestration at you.com if possible.
I agree. I came at this from a slightly different direction recently here - https://medium.com/@mrsirsh/7-days-of-agent-framework-anatomy-from-first-principles-day-1-d54d5fb6d0a3. I was playing with building a simple agent framework from scratch to test a few ideas and as part of this i added some basic wrappers and compared those models but on a more specific task via the apis. My assessment was the same in terms of ranking. 4o is fairly solid on things relating to planning and tool use and Claude performs well. Gemini is awful, confabulating among other let-downs. Mini is reliable for well structured cases.
This is a great article, thanks for the honest eval. I've noticed a significant difference in the orchestration layer at you.com who seem to have build their business model on trying to nail the orchestration. It's not perfect, but will help the masses to use the agentic approach to work in chat-based interactions (solving most of what o1 claimed to do).
Would love to hear your eval of the agents/orchestration at you.com if possible.
I'll have to take a look there. What I say, it's a bit more rudimentary but they should have evolved by now.
I agree. I came at this from a slightly different direction recently here - https://medium.com/@mrsirsh/7-days-of-agent-framework-anatomy-from-first-principles-day-1-d54d5fb6d0a3. I was playing with building a simple agent framework from scratch to test a few ideas and as part of this i added some basic wrappers and compared those models but on a more specific task via the apis. My assessment was the same in terms of ranking. 4o is fairly solid on things relating to planning and tool use and Claude performs well. Gemini is awful, confabulating among other let-downs. Mini is reliable for well structured cases.
That's an excellent article (shared it in my recs). If you're interested, you should come aggregate your learning and do a guest post
Thank you Devansh for taking the time to check it out! And yes, I would be interested in doing a guest post on this sometime, that would be awesome.
I'm excited to see it