Discussion about this post

User's avatar
Filippo Marino's avatar

Always fascinating, relevant, and consequential content, Devansh. I believe extreme diligence and accountability are unquestionably in order for these often 'obscure' generative machines, especially in this early, rapid growth and experimental phase of foundation models.

That being said, I also fear the risk of pursuing an abstract ideal benchmark that ignores (rather than measures against) the very flawed reality of human performance.

You mention 'zero-fault fields', but few of these are deterministic or perfectly served by experts - especially when unaided by algorithms. Here's from 'Noise' by Danny Kahneman and Olivier Sibony:

"In medicine, between-person noise, or interrater reliability, is usually measured by the kappa statistic. The higher the kappa, the less noise. A kappa value of 1 reflects perfect agreement; a value of 0 reflects exactly as much agreement as you would expect between monkeys throwing darts onto a list of possible diagnoses. In some domains of medical diagnosis, reliability as measured by this coefficient has been found to be “slight” or “poor,” which means that noise is very high. It is often found to be “fair,” which is of course better but which also indicates significant noise. On the important question of which drug-drug interactions are clinically significant, generalist physicians, reviewing one hundred randomly selected drug-drug interactions, showed “poor agreement.”" ...

“These cases of interpersonal noise dominate the existing research, but there are also findings of occasion noise. Radiologists sometimes offer a different view when assessing the same image again and thus disagree with themselves (albeit less often than they disagree with others).”...

“In short, doctors are significantly more likely to order cancer screenings early in the morning than late in the afternoon. In a large sample, the order rates of breast and colon screening tests were highest at 8 a.m., at 63.7%. They decreased throughout the morning to 48.7% at 11 a.m. They increased to 56.2% at noon—and then decreased to 47.8% at 5 p.m. It follows that patients with appointment times later in the day were less likely to receive guideline-recommended cancer screening.”

Poor diagnostic interrater reliability has been documented in the assessment of heart disease, endometriosis, tuberculosis, melanoma, breast cancer, etc.

Now, please consider that Florida's Surgeon General is telling older Floridians and others at the highest risk from COVID-19 to avoid most booster shots, saying they are potentially dangerous against the professional consensus. What if o-1 issued such a recommendation? (I could quote dozens of such cases in Florida alone; you get the point.)

Again, this is not to suggest we let these models go unchecked, but we should probably adopt the 'Exsupero Ursus ' (outrun the bear) principle and consider their outperformance of the average human a great achievement and starting point.

Please keep up the great work!

Expand full comment
James's avatar

Thank you for the follow up Devansh, always interesting.

I'm interested in repeatability and accuracy of the models, they are confident in output instead of giving you different options. My experience when trying to use ML algorithms in engineering is that you receive stares from the classical control and safety disciplines; they would never go something they can't understand. Without a proof behind why a decision was made, any ML/AI model won't be accepted for deploying as a control system (without human in the loop, personally I think even with human it'll be difficult).

I was particularly interested in the approach of AlphaGeometry you covered back in January, a (symbolic) reasoning engine paired with an LLM. I think this approach is more similar to how our brains work in general where we have one (or more) reasoning engine running on what we're seeing, and conscious thought deciding what to do about it (and how to communicate our thoughts where applicable).

From past research projects there are multiple ML algorithms that are accurate with diagnosis in small well trained scopes, a LLM could interrogate as many are applicable at once and aggregate the answer.

Expand full comment
6 more comments...

No posts