A follow up on o1's Medical Capabilities …

Sep 24, 2024

A conversation about marketing, evals, and AI.

8 Comments

Always fascinating, relevant, and consequential content, Devansh. I believe extreme diligence and accountability are unquestionably in order for these often 'obscure' generative machines, especially in this early, rapid growth and experimental phase of foundation models.

That being said, I also fear the risk of pursuing an abstract ideal benchmark that ignores (rather than measures against) the very flawed reality of human performance.

You mention 'zero-fault fields', but few of these are deterministic or perfectly served by experts - especially when unaided by algorithms. Here's from 'Noise' by Danny Kahneman and Olivier Sibony:

"In medicine, between-person noise, or interrater reliability, is usually measured by the kappa statistic. The higher the kappa, the less noise. A kappa value of 1 reflects perfect agreement; a value of 0 reflects exactly as much agreement as you would expect between monkeys throwing darts onto a list of possible diagnoses. In some domains of medical diagnosis, reliability as measured by this coefficient has been found to be “slight” or “poor,” which means that noise is very high. It is often found to be “fair,” which is of course better but which also indicates significant noise. On the important question of which drug-drug interactions are clinically significant, generalist physicians, reviewing one hundred randomly selected drug-drug interactions, showed “poor agreement.”" ...

“These cases of interpersonal noise dominate the existing research, but there are also findings of occasion noise. Radiologists sometimes offer a different view when assessing the same image again and thus disagree with themselves (albeit less often than they disagree with others).”...

“In short, doctors are significantly more likely to order cancer screenings early in the morning than late in the afternoon. In a large sample, the order rates of breast and colon screening tests were highest at 8 a.m., at 63.7%. They decreased throughout the morning to 48.7% at 11 a.m. They increased to 56.2% at noon—and then decreased to 47.8% at 5 p.m. It follows that patients with appointment times later in the day were less likely to receive guideline-recommended cancer screening.”

Poor diagnostic interrater reliability has been documented in the assessment of heart disease, endometriosis, tuberculosis, melanoma, breast cancer, etc.

Now, please consider that Florida's Surgeon General is telling older Floridians and others at the highest risk from COVID-19 to avoid most booster shots, saying they are potentially dangerous against the professional consensus. What if o-1 issued such a recommendation? (I could quote dozens of such cases in Florida alone; you get the point.)

Again, this is not to suggest we let these models go unchecked, but we should probably adopt the 'Exsupero Ursus ' (outrun the bear) principle and consider their outperformance of the average human a great achievement and starting point.

Please keep up the great work!

Expand full comment

Reply (1)

Devansh

Sep 25

A phenomenal comment, and why I love talking to you. Thank you for this.

When I say "zero-fault", I use it a description for a field where mistakes are costly and high levels of transparency are non-negotiable, not literally. That being said, I'm not a risk expert like you- so my thinking is always going to lean more into simplistic binaries. I do acknowledge the importance the importance of the "outrun the bear principle", but I'd like to get there in a way that minimizes human suffering. To get there in a way where we have clearly defined parameters for judgement, acceptable losses, and so on. That's something I don't being implemented anywhere.

If you're up for it, would love to have you come on here and talk about how you might frame our thinking to be "thinking in bets", so to say. It would be an exceptionally important topic, one I've been wanting to write about but have no idea what to do.

Expand full comment

Reply (1)

Filippo Marino

Sep 27

Thank you Devansh. Always happy to engage in further conversations and exploration of the topic. We’re all early explorers in this new land of AI.

Expand full comment

James

Sep 24

Thank you for the follow up Devansh, always interesting.

I'm interested in repeatability and accuracy of the models, they are confident in output instead of giving you different options. My experience when trying to use ML algorithms in engineering is that you receive stares from the classical control and safety disciplines; they would never go something they can't understand. Without a proof behind why a decision was made, any ML/AI model won't be accepted for deploying as a control system (without human in the loop, personally I think even with human it'll be difficult).

I was particularly interested in the approach of AlphaGeometry you covered back in January, a (symbolic) reasoning engine paired with an LLM. I think this approach is more similar to how our brains work in general where we have one (or more) reasoning engine running on what we're seeing, and conscious thought deciding what to do about it (and how to communicate our thoughts where applicable).

From past research projects there are multiple ML algorithms that are accurate with diagnosis in small well trained scopes, a LLM could interrogate as many are applicable at once and aggregate the answer.

Expand full comment

Reply (1)

Devansh

Sep 24

This is the way to go. LLMs as an orchestration layers, specialized models for any precision tasks. Have been preaching this for so long

Expand full comment

Sergei Polevikov

Sep 24Edited

Great analysis, Devansh. I appreciate your time and thoroughness. I still stand by my original analysis.

1. I worked with what OpenAI presented. I can’t read between the lines. The model mentioned in the launch announcement (and still is) was the o1-preview. So that’s the model I tested.

2. I didn’t say “deliberately cherry-picking,” but I did use “cherry-picking” because after just 10 or 20 tries, it’s clear the model produces a range of results, not a single outcome. In fact, my very first run with o1-preview yielded a different result than the one shown in the OpenAI blog. I'm surprised OpenAI failed to mention such an important consideration.

3. What bothers me is that the results for two models, GPT-4o and o1-preview, were put side by side to demonstrate that o1-preview hallucinates less. While that might be true on average, there are cases with non-zero probabilities where GPT-4o could actually be more accurate than o1-preview for medical diagnosis.

4. I highlighted other examples in my article from medical diagnostics where o1-preview produces a distribution of outcomes if you run the model enough times. Many of those probabilities, though not explicitly stated, are significant.

Expand full comment

Reply (1)

Devansh

Sep 24

Your analysis is absolutely spot-on. I agree with you on all points .

The reason I switched from cherry picking to lack of testing is simply because I've seen KBG pop up enough to where it's completely plausible that they only tested in on the preview once or twice, and KBG happened both times (I've been told that the o-1 was tested 21 times on this prompt, and had KBG all 21 times). So whoeever did the blog- made a faulty assumption that this held on the preview as well. I don't want to assume malice where I can think of (extreme) incompetence.

One thing I would stress is that even though I think this was an outcome of negligence rather than bad faith, we still agree on the major points-

o-1 isn't suitable for diagnosis as it is.

Doctors should be given information on actual distributions (not LLM generated ones) so that they can make more informed decisions.

OAI has been misleading in it's communications and should change their documents. If they don't change their blog-post ASAP, and acknowledge their mistakes, then it will be clear that they're bad faith actors.

Expand full comment

Dr Mahesh Naykude

Sep 25

I read all article it’s very important and mostly important part of Medical Diagnosis is a concern so once get it final Diagnosis we will be go for final protocol of treatment.

It’s very useful.

So in January or February 2025 in India 12th International Health Dialogue places spotlight on Health Equity, Sustainability, and AI Advancement conference will be held that time this article will be presenting it will be very helpful and useful for standard care of patient.

Regards

Expand full comment

Artificial Intelligence Made Simple

A follow up on o1's Medical Capabilities …