Whether the sample size is number of physicians or number of vignettes depends on which one they're claiming they can generalize across. Certainly we would also have been unimpressed by a study with 6 physicians and 100 vignettes.
In this case, I agree they can't claim to have statistically generalized across diagnosis scenarios. I don't have a problem with treating scenario as a random effect, they aren't gonna get a good variance estimate, but that still seems like a defensible modeling choice. If they could make a persuasive argument that these 6 vignettes has desirable validity properties, then they're be doing no worse than education researchers do when they give everyone in a study the same assessment. For those studies, an assumption is made about the properties of the assessment, and the statistical inference is done on the population of students. Could this reasoning but be applied here?
To be clear, I don't buy the author's claims at all, for the reasons you gave. I'm just not seeing how they made such a huge statistical error.
I wouldn't have cared if very severe limitations had been acknowledged. But sadly people are more interested in clicks and attention than contributing meaningfully
Good doctors are like prompt engineers, but for human beings. Taking a good medical history is a skill.
"Physicians in training" is also not a good substitute for "trained physicians."
Furthermore, there is a difference when a standardized clinical vignette (i.e. complete, validated medical history) is used with ChatGPT with a trained prompt engineer asking questions. When you have all the appropriate questions, it is easier to generate a differential diagnosis i.e. probability of having a medical diagnoses--which computers are good at (and the article did not measure diagnoses but reasoning as pointed out)
However, garbage in garbage out still applies.
I do remain optimistic about future of AI in medicine, but we are nowhere close.
this vivisection excellent and needed....have been a reviewer (and published academic author/co-author) of several such research initiatives....while i professionally/personally believe that there is nothing 'wrong' or improper about using 'anecdata,' the type of 'generalizations' presented in this 'study' and - yikes! - this silly new york times 'story' are - indeed - inherently misleading and demonstrably untrue....the times as a credible source and resource on matters technical is beyond rescue and hope; JAMA needs to do the easy and obvious: publish a summary and synthesis of the criticisms the reviewers contributed to the publication in their analysis....NB if said 'criticisms' and 'critiques' are less rigorous than what devanash has contributed, THAT IS AN ALARM BELL/RED FLAG/WARNING as to editoral/professional/publication quality and reliability.....this is not hard...that said, i've seen peer/cohort review make papers and reports transformatively better and - alas - i've seen them make a perfectly decent paper unreadably useless.....what we have here is something worse: mischaracterization, misrepresentation and misunderstanding of BOTH evidence and claims....my bet: chatgpt and claude could have done a better job
To my mind this kind of misinformation is dangerous. In fact I would go as far as saying it is criminal.
The fact that JAMA did not follow their own standards for submissions is shocking, and whomever is responsible in this case should be taken to task.
As for the New York Times, I'm sorry to say it is not the first time a "narrative" has been published which gives only the narrative needed to sell a story, and yet omits the more important facts regarding a subject.
There are a lot of poorly written articles being published in medical journals. Unfortunately, they go after citations to increase impact factor rather than quality of the article!
Thanks for your post and article. You’ve raised awareness and highlighted some important “caution” flags which is a great contribution to public education about AI! Having a health reporter publish this article leads us to the larger issue. One can be an expert in health reporting, but weighing in on an AI study in healthcare is a very different matter. I’ll be able to read the JAMA study this weekend. One thing to keep in mind the scientific paradigm that the study is designed in alignment with. Sample sizes in quantitative and qualitative data analysis cannot be held to the same standards as there is no inference in qualitative analyses, nor are there “sample sizes”. Thanks again!
Kathryn: I respect your argument. However, with all due respect, the authors, for example, are using a random effects model to assess uncertainty around their estimates, but the number of cases they rely on for that is at most 6 (actually 4-5 per participant, all from the same fixed sample). So what kind of random effects are we expecting with that number of observations, especially for the LLM accuracy estimate? Think of the LLM as a single participant. We’re using just 6 data points to evaluate how that participant performs against two other groups of participants. I wouldn’t even know how to construct confidence intervals around such estimates, but the authors apparently went ahead with those purely theoretical and clearly way too optimistic numbers.
I really appreciate this breakdown of the New York Times article! It's concerning how the study's small sample size and misinterpretation of results were overlooked.
I follow your newsletter and have valued your insights since first subscribing last year.
However, while it is clear you don’t like this study, that alone is not sufficient to state it shouldn’t either have been published by JAMA or reported by The NY Times.
In addition to being a Chief AI Officer myself, I am a published author, and have served as a peer reviewer on multiple data science and medical journals for over 20 years. I am currently on the Editorial Board of Taylor and Francis’ Current Medical Research and Opinion, where I serve as AI expert, and peer review papers on all aspects of AI (machine learning, machine vision, deep learning, NLP, generative, etc) on a regular basis.
I read the JAMA paper after it was published, and though I didn’t see what the authors first submitted nothing gave me pause or any concern, unlike yourself.
With regards to the NYT, who is currently embroiled in a lawsuit with OpenAI, whose Forum I am part of and have spoken at, their responsibility is to share their opinion (what others might call “news”) and sell papers. They have done both herein.
My concern is two fold- one the study has technical limitations : small sample size, unclear metrics, etc. This is is where I think JAMA shouldn't have published this, OR made those clear.
Then- is the lack of research done by NYT. While it is their job to sell news, I think they also have a duty as human beings to not actively spread harm. This is incredibly harmful
Matt: Let me respectfully disagree. The more important sample size for this study is n=6 (number of cases), not 50 (number of physicians). This is a critical distinction. The authors aimed to evaluate how well human physicians reason through complex cases and how ChatGPT could enhance that reasoning. They used medical case vignettes to explore this question, hoping that a sample of 50 physicians would provide a diverse enough pool to estimate uncertainty for the physician group and the physician + LLM group.
To illustrate why the variety and sample size of cases matter so much, let’s push the study’s setup to its extreme: imagine there was only one case instead of 6 (in fact, the physicians in this study worked with only 4-5 cases). Now imagine 1,000 physicians participated in the study, all fresh out of residency. This isn’t far-fetched given that the median experience of the 50 physicians in the study was just 3 years of practice.
Do you see the issues with this sample? With only one case in the sample, there’s a significant risk that the physicians had either already encountered that specific case or a very similar one. Additionally, with participants straight out of residency and trained in a standardized medical curriculum, lacking diverse real-world experience, there’s a high likelihood they would all approach the case in a similar way—either all solving it very well or very poorly.
But my biggest concern isn’t even with the physicians’ group. It’s with the LLM standalone group and the claim of 92% accuracy. The authors only gave ChatGPT 6 cases to solve and then used a random effects model to estimate the confidence intervals around that result. With such a small (minuscule, really) sample, confidence intervals are likely to be way too optimistic.
The sample is not only tiny but also provides no reliable basis for confidence in the estimates. For statistical reasons alone, JAMA should never have approved this article.
Whether the sample size is number of physicians or number of vignettes depends on which one they're claiming they can generalize across. Certainly we would also have been unimpressed by a study with 6 physicians and 100 vignettes.
In this case, I agree they can't claim to have statistically generalized across diagnosis scenarios. I don't have a problem with treating scenario as a random effect, they aren't gonna get a good variance estimate, but that still seems like a defensible modeling choice. If they could make a persuasive argument that these 6 vignettes has desirable validity properties, then they're be doing no worse than education researchers do when they give everyone in a study the same assessment. For those studies, an assumption is made about the properties of the assessment, and the statistical inference is done on the population of students. Could this reasoning but be applied here?
To be clear, I don't buy the author's claims at all, for the reasons you gave. I'm just not seeing how they made such a huge statistical error.
I wouldn't have cared if very severe limitations had been acknowledged. But sadly people are more interested in clicks and attention than contributing meaningfully
(sorry for the typos, on the phone app and can't edit)
Agree with this take on AI in healthcare.
Good doctors are like prompt engineers, but for human beings. Taking a good medical history is a skill.
"Physicians in training" is also not a good substitute for "trained physicians."
Furthermore, there is a difference when a standardized clinical vignette (i.e. complete, validated medical history) is used with ChatGPT with a trained prompt engineer asking questions. When you have all the appropriate questions, it is easier to generate a differential diagnosis i.e. probability of having a medical diagnoses--which computers are good at (and the article did not measure diagnoses but reasoning as pointed out)
However, garbage in garbage out still applies.
I do remain optimistic about future of AI in medicine, but we are nowhere close.
thank you for this
this vivisection excellent and needed....have been a reviewer (and published academic author/co-author) of several such research initiatives....while i professionally/personally believe that there is nothing 'wrong' or improper about using 'anecdata,' the type of 'generalizations' presented in this 'study' and - yikes! - this silly new york times 'story' are - indeed - inherently misleading and demonstrably untrue....the times as a credible source and resource on matters technical is beyond rescue and hope; JAMA needs to do the easy and obvious: publish a summary and synthesis of the criticisms the reviewers contributed to the publication in their analysis....NB if said 'criticisms' and 'critiques' are less rigorous than what devanash has contributed, THAT IS AN ALARM BELL/RED FLAG/WARNING as to editoral/professional/publication quality and reliability.....this is not hard...that said, i've seen peer/cohort review make papers and reports transformatively better and - alas - i've seen them make a perfectly decent paper unreadably useless.....what we have here is something worse: mischaracterization, misrepresentation and misunderstanding of BOTH evidence and claims....my bet: chatgpt and claude could have done a better job
Thank you
Thanks for this detailed breakdown.
To my mind this kind of misinformation is dangerous. In fact I would go as far as saying it is criminal.
The fact that JAMA did not follow their own standards for submissions is shocking, and whomever is responsible in this case should be taken to task.
As for the New York Times, I'm sorry to say it is not the first time a "narrative" has been published which gives only the narrative needed to sell a story, and yet omits the more important facts regarding a subject.
It's sad that these institutions have no integrity
There are a lot of poorly written articles being published in medical journals. Unfortunately, they go after citations to increase impact factor rather than quality of the article!
yeah even if the citation is negative, which creates weird incentives
Thanks for your post and article. You’ve raised awareness and highlighted some important “caution” flags which is a great contribution to public education about AI! Having a health reporter publish this article leads us to the larger issue. One can be an expert in health reporting, but weighing in on an AI study in healthcare is a very different matter. I’ll be able to read the JAMA study this weekend. One thing to keep in mind the scientific paradigm that the study is designed in alignment with. Sample sizes in quantitative and qualitative data analysis cannot be held to the same standards as there is no inference in qualitative analyses, nor are there “sample sizes”. Thanks again!
Kathryn: I respect your argument. However, with all due respect, the authors, for example, are using a random effects model to assess uncertainty around their estimates, but the number of cases they rely on for that is at most 6 (actually 4-5 per participant, all from the same fixed sample). So what kind of random effects are we expecting with that number of observations, especially for the LLM accuracy estimate? Think of the LLM as a single participant. We’re using just 6 data points to evaluate how that participant performs against two other groups of participants. I wouldn’t even know how to construct confidence intervals around such estimates, but the authors apparently went ahead with those purely theoretical and clearly way too optimistic numbers.
Great article!
Super super interesting
I really appreciate this breakdown of the New York Times article! It's concerning how the study's small sample size and misinterpretation of results were overlooked.
I follow your newsletter and have valued your insights since first subscribing last year.
However, while it is clear you don’t like this study, that alone is not sufficient to state it shouldn’t either have been published by JAMA or reported by The NY Times.
In addition to being a Chief AI Officer myself, I am a published author, and have served as a peer reviewer on multiple data science and medical journals for over 20 years. I am currently on the Editorial Board of Taylor and Francis’ Current Medical Research and Opinion, where I serve as AI expert, and peer review papers on all aspects of AI (machine learning, machine vision, deep learning, NLP, generative, etc) on a regular basis.
I read the JAMA paper after it was published, and though I didn’t see what the authors first submitted nothing gave me pause or any concern, unlike yourself.
With regards to the NYT, who is currently embroiled in a lawsuit with OpenAI, whose Forum I am part of and have spoken at, their responsibility is to share their opinion (what others might call “news”) and sell papers. They have done both herein.
I fail to see why you are outraged?
Hi Matt,
My concern is two fold- one the study has technical limitations : small sample size, unclear metrics, etc. This is is where I think JAMA shouldn't have published this, OR made those clear.
Then- is the lack of research done by NYT. While it is their job to sell news, I think they also have a duty as human beings to not actively spread harm. This is incredibly harmful
I agree with your point on the NYT, but not JAMA. The n that is relevant is not the number of physicians in the study.
Matt: Let me respectfully disagree. The more important sample size for this study is n=6 (number of cases), not 50 (number of physicians). This is a critical distinction. The authors aimed to evaluate how well human physicians reason through complex cases and how ChatGPT could enhance that reasoning. They used medical case vignettes to explore this question, hoping that a sample of 50 physicians would provide a diverse enough pool to estimate uncertainty for the physician group and the physician + LLM group.
To illustrate why the variety and sample size of cases matter so much, let’s push the study’s setup to its extreme: imagine there was only one case instead of 6 (in fact, the physicians in this study worked with only 4-5 cases). Now imagine 1,000 physicians participated in the study, all fresh out of residency. This isn’t far-fetched given that the median experience of the 50 physicians in the study was just 3 years of practice.
Do you see the issues with this sample? With only one case in the sample, there’s a significant risk that the physicians had either already encountered that specific case or a very similar one. Additionally, with participants straight out of residency and trained in a standardized medical curriculum, lacking diverse real-world experience, there’s a high likelihood they would all approach the case in a similar way—either all solving it very well or very poorly.
But my biggest concern isn’t even with the physicians’ group. It’s with the LLM standalone group and the claim of 92% accuracy. The authors only gave ChatGPT 6 cases to solve and then used a random effects model to estimate the confidence intervals around that result. With such a small (minuscule, really) sample, confidence intervals are likely to be way too optimistic.
The sample is not only tiny but also provides no reliable basis for confidence in the estimates. For statistical reasons alone, JAMA should never have approved this article.
✨✨✨