Discussion about this post

User's avatar
Ben P's avatar

Whether the sample size is number of physicians or number of vignettes depends on which one they're claiming they can generalize across. Certainly we would also have been unimpressed by a study with 6 physicians and 100 vignettes.

In this case, I agree they can't claim to have statistically generalized across diagnosis scenarios. I don't have a problem with treating scenario as a random effect, they aren't gonna get a good variance estimate, but that still seems like a defensible modeling choice. If they could make a persuasive argument that these 6 vignettes has desirable validity properties, then they're be doing no worse than education researchers do when they give everyone in a study the same assessment. For those studies, an assumption is made about the properties of the assessment, and the statistical inference is done on the population of students. Could this reasoning but be applied here?

To be clear, I don't buy the author's claims at all, for the reasons you gave. I'm just not seeing how they made such a huge statistical error.

Expand full comment
Sudeep Bansal, MD, MS's avatar

Agree with this take on AI in healthcare.

Good doctors are like prompt engineers, but for human beings. Taking a good medical history is a skill.

"Physicians in training" is also not a good substitute for "trained physicians."

Furthermore, there is a difference when a standardized clinical vignette (i.e. complete, validated medical history) is used with ChatGPT with a trained prompt engineer asking questions. When you have all the appropriate questions, it is easier to generate a differential diagnosis i.e. probability of having a medical diagnoses--which computers are good at (and the article did not measure diagnoses but reasoning as pointed out)

However, garbage in garbage out still applies.

I do remain optimistic about future of AI in medicine, but we are nowhere close.

Expand full comment
19 more comments...

No posts