NYT's "AI Outperforms Doctors" Story Is Wrong [Guest]
The study cited by the New York Times has severe flaws and should never been published in the first place.
Hey, it’s Devansh 👋👋
Our chocolate milk cult has a lot of experts and prominent figures doing cool things. In the series Guests, I will invite these experts to come in and share their insights on various topics that they have studied/worked on. If you or someone you know has interesting ideas in Tech, AI, or any other fields, I would love to have you come on here and share your knowledge.
I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction. Help me democratize the most important ideas in AI Research and Engineering to over 100K readers weekly. Many companies have a learning budget that you can expense this newsletter to. You can use the following for an email template to request reimbursement for your subscription.
PS- We follow a “pay what you can” model, which allows you to support within your means, and support my mission of providing high-quality technical education to everyone for less than the price of a cup of coffee. Check out this post for more details and to find a plan that works for you.
Over the last few weeks, my social media feeds have been filled by the New York Times story that a peer-reviewed study claimed that ChatGPT alone could outperform physicians in medical diagnosis. This led to a lot of sensationalism and hype, as everyone (once again) started predicting the end of doctors.
The whole thing unfortunately has a few problems-
The study that NYT based its claims on has a sample size of 6 (yes, six)
It does not actually claim what the article pretends it does. The actual study said diagnostic reasoning was improved, not the ability to diagnose accurately. And there’s a huge problem with the way they evaluate outputs.
The study uses a weighted formula of different factors on an 18-point scale but never justifies its decisions/formulae. These dubious evals have no concrete explanations and come across as hand-waved formulas.
The physicians are not trained to use the tools, which severely reduces their potential effectiveness.
This study does not accurately reflect the nature of medical diagnosis/the work of an actual healthcare professional.
The following guest post by the brilliant
will explain the problems with the study/NYT coverage, and a larger problem with the use of stats in Medicine in a lot more detail. This is a very important conversation in one of the highest impact use-cases for AI, and I’m sure you will find a lot of use for it-Disclaimer: The opinions expressed here are solely my own and do not represent those of any sponsors, advertisers, or supervisors. Oh, wait—I don’t have any. 😂
Shame on JAMA for publishing this study, and even more shame on The New York Times for amplifying it. The reason the scientific community largely ignored this study—published four weeks ago—until The New York Times gave it a platform is simple: the study isn’t scientifically valid.
Let’s start with the basics.
Initially, I planned to include this critique as part of Part 3 in my comprehensive Substack series, “The ‘🤖machines will replace doctors👩🏽⚕️’ debate turns 70!” However, this study demands a standalone analysis or at least some immediate clarification. Why? Because too many smart people have been misled into believing this study has merit. The New York Times jumping the gun only made things worse. Furthermore, their coverage failed to accurately convey the study’s actual findings.
If you’re new here, I recommend you start with Part 1 of my series, which explores the early algorithms from 70, 60, 50, and 40 years ago—all attempts to replace clinicians’ work that ultimately fell short.
Here’s the TL;DR:
1. The Notorious JAMA Study: A Sample Size of 6 😊
2. How to Address Small Sample Bias: My Recommendations
3. NYT — A Broken Telephone
4. Even with ChatGPT-1o: AI Still Falls Short in Medical Reasoning
5. Additional Problems with the JAMA Study Beyond Small Sample Size Bias
5.1 Problems with Study Design and Methodology
5.2 Problems with Results and Interpretation
5.3 Problems with Real-World Applicability
6. Silver Linings
7. The Bigger Publishing Crisis: Fake Data and Clickbait Studies
8. Conclusion
1. The Notorious JAMA Study: A Sample Size of 6 😊
A study published in JAMA on October 28, 2024, by Stanford researchers reported findings from a randomized clinical trial involving 50 physicians. The trial concluded that incorporating a Large Language Model (LLM) alongside conventional resources (referred to as “physicians + LLM”) did not significantly improve physicians' diagnostic reasoning performance. However, the LLM alone outperformed both physicians and the combination of physicians + LLM by a statistically significant margin of 16%-18% in accuracy. Specifically, the LLM achieved 92% accuracy in diagnostic reasoning (not diagnosis, as erroneously reported by The New York Times), compared to 74% for physicians and 76% for physicians + LLM.
Note: “Diagnostic reasoning” refers to the cognitive process by which a clinician gathers, interprets, and integrates information to arrive at a differential diagnosis, considering various possibilities and weighing supporting and opposing evidence. “Diagnosis”, on the other hand, is the final conclusion reached after this reasoning process, identifying the specific condition or disease affecting a patient. While diagnostic reasoning encompasses the dynamic and iterative steps involved in evaluating symptoms, tests, and patient history, diagnosis represents the static end point of this analysis. The JAMA study under consideration attempts to highlight how tools like LLMs can influence diagnostic reasoning without necessarily changing the quality or accuracy of the diagnosis itself.
Here’s a critical detail that was conspicuously omitted from the journal abstract:
The study evaluated performance on only 6(!) hand-picked medical cases (vignettes).
Note: A “vignette” refers to a structured and concise description of a clinical case, which is designed to simulate real-life scenarios for educational or research purposes. These vignettes typically include patient history, physical examination findings, and laboratory results. In the JAMA article under consideration, vignettes are designed to test diagnostic reasoning by presenting complex and varied medical situations, avoiding overly simplistic or excessively rare cases.
How did this study pass the JAMA editorial review process?
I dedicate an entire chapter in my recent paper to discussing this peculiar phenomenon: standards for statistical power that would be deemed laughable in fields like statistics or economics are readily accepted in medicine. This is baffling, especially given that human health and lives are at stake. Can someone please explain this anomaly?
In my paper, I highlighted an example of a peer-reviewed study that used 28 vignettes to evaluate which digital symptom checker was superior—a sample size I thought was absolutely absurd. Yet, here we are with a study based on 6! And they were hand-picked, no less.
A sample size of 6 is as good as a rumor, from a statistical standpoint.
What’s interesting is that the original NEJM dataset contained 105 vignettes. Yet, since it seems the 50 physicians involved gave the authors only one hour of their time each, the authors cherry-picked just 6 out of the 105 vignettes. 6? Really?
I’ll let the authors defend themselves, but this feels fishy. I don’t care if the rationale for selecting these 6 was the “greatest ever” — they dropped 94% of the sample!
There’s likely a reason the authors avoided using the full dataset. Maybe the majority of the vignettes were “too easy,” and they thought those wouldn’t show meaningful differences. Or maybe 94% of the vignettes were “too messy” to deal with. Whatever the real reason, readers deserve transparency. Frankly, the paper offers none. Tossing out 99 out of 105 vignettes without a clear explanation leaves a glaring question mark that the authors failed to address.
The authors claim they selected these 6 vignettes based on “case selection guidelines,” which prioritized a broad range of pathological settings, avoided overly simplistic cases, and excluded exceedingly rare cases. Fine. But then these hand-picked vignettes were further “refined” by replacing Latin medical terms (like livedo reticularis) with layman-friendly phrases (like “purple, red, lacy rash”).
What the heck is going on here? The lack of clarity and the sheer level of curation make this whole process feel... off.
2. How to Address Small Sample Bias: My Recommendations
Now, let me switch to my peer reviewer hat and pose a critical question:
Why not utilize the full sample of 105 vignettes? While the constraint of assigning a maximum of 6 vignettes per physician may not be ideal, it’s still possible to leverage the entire dataset using good ol’ statistics. Two straightforward methods come to mind:
Random Sampling with Overlap: Randomly and uniformly draw 6 vignettes for each doctor from the full sample of 105. With 50 doctors and only 105 vignettes, overlaps are inevitable—and actually desirable. Overlapping assignments allow comparisons between diagnoses for the same vignettes across different physicians, providing a foundation for robust analysis.
Balanced Incomplete Block Design (BIBD): Alternatively, implement a BIBD or a similar design to ensure balanced vignette assignments. Each vignette is reviewed by an equal number of doctors (e.g., 2 or 3), while each doctor reviews no more than 6 vignettes, adhering to the constraint. This design ensures overlap between doctors’ vignette assignments, facilitating comparisons and enabling statistical adjustments for vignette difficulty. Mixed-effects models or Analysis of Variance (ANOVA) could then be used to account for variability attributable to both doctors and vignettes.
These techniques—and likely others—would allow researchers to retain the diversity of the full sample while improving the statistical rigor of the analysis.
That said, the authors likely had reasons for discarding 94% of the vignettes. It would be helpful to understand their rationale.
More importantly, using the entire sample of 105 vignettes is crucial for a fair comparison with the LLM. The LLM was run 3 times per vignette to increase the number of observations to 18, but these were drawn from the same limited sample of 6 vignettes! This small (let’s be real—minuscule) sample size compromises the validity of any comparison and undermines the potential insights that could have been drawn from a broader, more representative sample.
It’s important to note that the authors of the JAMA study “try” (I’ll explain the quotation marks below) to account for “the uncertainty from the number of cases” and the repeat prompting of the LLM by using a random effects model. This leads them to the faulty conclusion that statistical estimates already address the small (or more accurately, minuscule) sample bias.
Here’s my reaction:
Any freshman majoring in statistics could tell you it’s an absolute taboo to apply a random effects model to a dataset with just 6 cases.
There is nothing random about a sample size of 6. (And in this study, that number turned out to be even lower—only 4–5 cases per physician.) Even if those 6 cases were spread across 50 physicians (or even 500 physicians), the confidence intervals would still be way too optimistic. Why? Because you kept your sample size absurdly small. No statistical estimation model, no matter how sophisticated, can estimate the variability of your results when there’s practically no data to “vary” against.
There’s no magical statistical bullet for an absolute lack of data.
3. NYT — A Broken Telephone
The authors of the JAMA study blurred the line between diagnostic reasoning and the actual diagnosis, leaving room for misinterpretation. Naturally, The New York Times ran with the story without clarifying this nuance. Like a classic game of broken telephone, the distinction vanished—reasoning was magically equated to diagnosis, and voilà, AI was crowned a better diagnostician than doctors.
In reality, the accuracy score for the vignette cases overwhelmingly measured reasoning (78%), not the final diagnosis. The weight of the actual final diagnosis was a mere 2 out of 18 (11.1%) or 2 out of 19 (10.5%) of the total score, depending on the vignette.
Of course, the NYT piece caught fire, spreading to every corner of Earth—and apparently Mars—where even Elon Musk joined the chorus. Like most humans (myself included), Musk doesn’t mind quoting “enemy media” when it confirms his biases. 😉
4. Even with ChatGPT-1o: AI Still Falls Short in Medical Reasoning
My colleague
and I, along with many other researchers, have extensively documented the flaws in ChatGPT-1o and its exaggerated claims of medical diagnostic supremacy. We’ve written pieces like“Selling the New o1 ‘Strawberry’ to Healthcare Is a Crime, No Matter What OpenAI Says🍓” and “A Follow-Up on o1’s Medical Capabilities + Major Concerns About Its Utility in Medical Diagnosis.”So, when I came across this JAMA study showcasing a model from two generations prior, my reaction was, “I better check this out.”
Here’s the thing about AI reasoning:
ChatGPT is getting really good at reasoning. In fact, it reasoned so convincingly that the Earth is flat, I’m starting to have my doubts. 😉
5. Additional Problems with the JAMA Study Beyond Small Sample Size Bias
I don’t want to turn this into a never-ending article, but here’s a brief overview of other potential issues with the JAMA study. Some were acknowledged by the authors themselves, while others were flagged by experts like Graham Walker, MD, Jung Hoon Son, MD, and Dr. Terence Tan:
5.1 Problems with Study Design and Methodology
1️⃣ No Benchmark is No Better Than a Poor Benchmark:
In my numerous critiques, including my review of the Med-Gemini model, one recurring issue I’ve highlighted is the widespread problems with traditional medical benchmarks like MedQA/USMLE and MMLU. As I recall, this was a central critique in at least 4 other studies claiming AI dominance over physicians + LLM, a point emphasized by
. Now, this JAMA study doesn’t use any benchmark whatsoever, and I don’t think that’s a good thing. 😉 The way performance was rated—particularly how scores were assigned for Reasoning, Diagnosis, and Additional Steps—and how those factors were weighted seems highly subjective.2️⃣ Subjectivity in Evaluation:
The grading relied on blinded expert consensus but included subjective elements like reasoning quality and appropriateness of supporting factors, which could introduce evaluator bias.
3️⃣ Physician Selection Bias and Case Selection Bias:
Physician Bias: The JAMA study included only physicians with a median of 3 years of practice (range 2 to 8), not representing the broader U.S. physician population, which averages over 20 years of practice. (According to the Journal of Medical Regulation, the mean age of licensed physicians in the U.S. is 51.9 years, suggesting that licensed physicians have been practicing for at least 20 years, on average.)
Case Bias: The hand-picked clinical vignettes were intentionally complex and atypical, not reflecting a realistic caseload that includes a mix of simple, common, and rare cases.
4️⃣ Vignette Misclassification Bias:
Although the authors claimed to select the six vignettes based on “case selection guidelines” aiming for a broad range of pathological settings while avoiding overly simplistic and exceedingly rare cases, they further refined these hand-picked vignettes by replacing Latin medical terms with layman-friendly phrases. For example, “livedo reticularis” was changed to “purple, red, lacy rash.” This alteration could impact the diagnostic process by simplifying medical terminology that physicians typically use, potentially skewing the results.
5️⃣ (Mis)Use of Standardized Vignettes and Textual Data:
The reliance on structured, text-based vignettes does not correspond to real-life clinical scenarios, which involve patient interaction, physical examination, and continuous data collection. LLMs depend entirely on textual input and cannot incorporate visual, physical, and contextual cues essential for accurate diagnosis.
6️⃣ Challenges in Real-World Data Collection:
According to Dr. Jung Hoon Son, the JAMA study overlooks the complexities of real-world data gathering, such as validating lab and imaging results and dealing with incomplete or messy patient data. Errors and artifacts often complicate diagnostics, which are streamlined in controlled vignettes.
7️⃣ Time Constraints and Allocation:
Physicians were given a fixed amount of time to complete cases, which may not reflect real-world diagnostic timelines.
8️⃣ Physician Training and Familiarity with LLMs:
Physicians were not trained in using LLMs, potentially affecting their ability to integrate these tools effectively. Experts like Dr. Graham Walker note a systemic gap in LLM literacy among clinicians, impacting the JAMA study’s applicability.
5.2 Problems with Results and Interpretation
9️⃣ Negligible Improvement with LLM Assistance:
No statistically significant improvement was observed in diagnostic reasoning scores when physicians used the LLM, questioning the utility of these tools as adjuncts in their current form.
🔟 Over-Reliance on Knowledge Regurgitation:
According to Dr. Jung Hoon Son, the LLM’s performance relies on regurgitating encyclopedic information. While effective in structured prompts, it lacks the ability to gather and synthesize new information in real-time, limiting its applicability to actual patient care scenarios.
1️⃣1️⃣ Misalignment with Clinical Decision-Making:
In his summary, Dr. Terence Tan calls the JAMA study a “great big nothing,” pointing out that the LLM’s structured diagnostic approach does not align with the fluid and ambiguous nature of real-world clinical decision-making, which often requires contextual judgment and adaptability.
1️⃣2️⃣ Media Oversimplification and Overhype:
Media narratives have oversimplified the study’s findings, suggesting that “text-prompts + LLM = diagnosis,” overlooking the complexities of diagnostic reasoning. Experts like Dr. Graham Walker and Dr. Terence Tan caution against overhyping the results without considering practical limitations.
5.3 Problems with Real-World Applicability
1️⃣3️⃣ Integration and Training Gaps:
Health systems may struggle to integrate LLMs effectively without adequate physician training or workflow redesign. The generational gap and lack of familiarity with Generative AI (GenAI) tools among clinicians pose additional challenges.
1️⃣4️⃣ Liability, Ethical, and Economic Concerns:
Deploying LLMs independently raises ethical and liability issues. LLMs lack the ability to contextualize diagnostics within resource constraints and ethical principles like Occam’s Razor, critical aspects of modern healthcare. As I always say, Doctors Go to Jail. Engineers Don’t.
1️⃣5️⃣ Risk of Overdiagnosis in Simple Cases:
LLMs might overanalyze straightforward cases, leading to unnecessary testing or patient anxiety.
1️⃣6️⃣ Dependence on Pre-Structured Data:
Dr. Jung Hoon Son points out that by benefiting from pre-collected, neatly structured case reports, LLMs sidestep the bottleneck of clinical workflows—effective collection and interpretation of incomplete or messy patient data.
6. Silver Linings
While the JAMA study didn’t convince me of the validity of its results, hypothetically, I wouldn’t be surprised if AI is becoming better than doctors. And that’s exactly what I’ll be exploring in fine detail in in Part 3 of this series, coming in the next few weeks.
There’s no denying that an advanced LLM can handle complex cases with a level of thoroughness that might reduce the risk of missing critical details.
In the real world, asking a physician to handle 6 complex cases in just one hour—spending only 10–12 minutes per case—seems almost impossible. Balancing speed with accuracy, and, in the case of LLM assistance, figuring out how to interrogate the model effectively, doesn’t reflect standard practice—yet. However, in fast-paced environments like the ER, such a scenario could be more plausible—and perhaps this is where AI should shine brightest.
Here’s the good news: The JAMA study relied on the ChatGPT-4 model, which became outdated on May 14, 2024—two generations ago. ChatGPT-4o is significantly better than ChatGPT-4, and for reasoning (the crux of this study, as I mentioned earlier), ChatGPT-1o is state-of-the-art (SOTA) today. Built specifically for reasoning, it’s designed to surpass the capabilities of its predecessors—though, as I mentioned earlier, it is far from perfect.
7. The Bigger Publishing Crisis: Fake Data and Clickbait Studies
There’s a much bigger problem at play here. Publishing is the only industry where the suppliers of the core product—authors—are penalized with massive manuscript submission fees instead of being rewarded for their work. But the publishers are oligopolies, and they know the power they hold. As a researcher, if you want to advance your career, you have no choice but to play their game. So they exploit the system: raising submission fees, banning authors from submitting to multiple journals simultaneously, restricting researchers from sharing their own work during peer review, refusing to pay reviewers (because “prestige” is supposed to be compensation enough), and imposing countless other absurd policies.
This is the crux of the antitrust lawsuit recently filed in the U.S. District Court for the Eastern District of New York against six of the largest academic publishers.
These unfair practices create a perverse feedback loop. Reviewers, unpaid and undervalued, lose motivation. They stop taking their role seriously because, after all, there are no consequences for doing a bad job. Mistakes slip through. Editors, often lacking the specific expertise needed to evaluate highly specialized studies, fail to catch these errors. The result? An increasing number of papers with questionable or outright “fake” findings make it to publication. This, in turn, incentivizes authors to cut corners. Why spend two years and $100,000 rigorously testing and validating a study when you can produce a publishable result in two months for $1,000 with minimal effort or scrutiny?
Enter the mass media, which amplifies the problem. Journalists, chasing clicks and virality, don’t—or can’t—take the time to verify a study’s validity. Integrity takes a back seat to sensationalism because what drives advertising revenue isn’t accuracy, it’s engagement. Clickbait headlines and flashy findings rule the day.
It’s a wicked chain reaction.
This dysfunction is part of why we’re not seeing comprehensive AI studies in medicine. The journal review process takes months, but AI advances at lightning speed. By the time a study is published, it’s often outdated. And when people like me point out its irrelevance, researchers are left asking themselves: “Why bother?” The result? A proliferation of superficial, “fly-by-night” reports. For anyone with a shred of academic integrity, this is profoundly discouraging.
Clickbait and flashy findings also fuel data manipulation and outright fraud. Take the recent Harvard Business School debacle, which escalated into a full-blown fraud investigation. The rewards for authors with attention-grabbing results are enormous: consulting gigs, lucrative speaking fees, and cushy academic salaries. Starting pay at business schools can hit $240,000 a year—double what many campus psychology departments offer.
As Dr. Jeffrey Funk aptly observes:
“This is really an indictment of social science research because this explanation is applicable to many social science research disciplines. The fact is that flashy conclusions, including hyping AI, can lead to consulting gigs and speaker’s fees while crazy theories, easy data manipulation, and “more cheating and hype leading to more cheating and hype” are common in many research disciplines.”
The incentives are broken, and the fallout isn’t just academic—it’s societal.
8. Conclusion
Let’s get our sh*t together in academia. Why are we acting like teenagers chasing likes and crafting clickbait headlines? Have we forgotten what scientific integrity means?
This isn’t about whether AI is better than doctors. For all I know, the next time I have a proctology exam, an AI might tell me to “just relax.” 😉
What this is about is getting the facts straight.
Sure, maybe the JAMA study results will eventually hold water. But validity requires a foundation of facts, not a sample size of 6. Six isn’t science. It’s a footnote in a brainstorming session.
We need a scientifically robust study that compels the clinical community to say, “You know what? That’s solid research. Great job.” Until then, this debate will continue to swirl aimlessly.
Oh, and as I’ve been saying for years: even if AI achieves 99.999% accuracy in medicine—whatever that means—it still wouldn’t matter until we define liability and responsibility for its use in healthcare, both within and outside the boundaries of “standard of care.” Precision is pointless if we haven’t sorted out who takes the fall when things go wrong.
And let’s not gloss over the larger issue here: the rise of fake publications and clickbait studies. It’s become a stain on academia, turning science into a popularity contest instead of a pursuit of truth.
We need to bring ethics and professionalism back to science. That’s the only path to meaningful progress.
I invite the authors of the JAMA study to join the conversation. Let’s have a productive, fact-based discussion. And if I’ve crossed a line or botched my review of the study, I’ll own up to it. At the end of the day, I’m just a guy from a third-world country, sitting in a cold, smelly basement, trying to make sense of it all. What do I know, right? 😉
All right, that’s a wrap.
Don’t miss Part 3 of this series—a super deep dive and probably the most comprehensive breakdown of Why AI Will Replace Doctors—coming in the next few weeks. Bad news for doctors. Great news for engineers. Stay tuned.
Acknowledgment: This article owes its clarity, polish, and sound judgement to the invaluable contributions of Rachel Tornheim. Thank you, Rachel!
I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.
I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).
If you liked this article and wish to share it, please refer to the following guidelines.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. You can share your testimonials over here.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
Whether the sample size is number of physicians or number of vignettes depends on which one they're claiming they can generalize across. Certainly we would also have been unimpressed by a study with 6 physicians and 100 vignettes.
In this case, I agree they can't claim to have statistically generalized across diagnosis scenarios. I don't have a problem with treating scenario as a random effect, they aren't gonna get a good variance estimate, but that still seems like a defensible modeling choice. If they could make a persuasive argument that these 6 vignettes has desirable validity properties, then they're be doing no worse than education researchers do when they give everyone in a study the same assessment. For those studies, an assumption is made about the properties of the assessment, and the statistical inference is done on the population of students. Could this reasoning but be applied here?
To be clear, I don't buy the author's claims at all, for the reasons you gave. I'm just not seeing how they made such a huge statistical error.
Agree with this take on AI in healthcare.
Good doctors are like prompt engineers, but for human beings. Taking a good medical history is a skill.
"Physicians in training" is also not a good substitute for "trained physicians."
Furthermore, there is a difference when a standardized clinical vignette (i.e. complete, validated medical history) is used with ChatGPT with a trained prompt engineer asking questions. When you have all the appropriate questions, it is easier to generate a differential diagnosis i.e. probability of having a medical diagnoses--which computers are good at (and the article did not measure diagnoses but reasoning as pointed out)
However, garbage in garbage out still applies.
I do remain optimistic about future of AI in medicine, but we are nowhere close.