AAlthough artificial intelligence is entering healthcare with great promise, clinical AI tools are prone to bias and underperformance in the real world, from inception to deployment, including acquisition stages, labeling or annotating datasets, training algorithms and validation. These biases can reinforce existing disparities in diagnosis and treatment.
To explore the extent to which biases are identified in the FDA review process, we reviewed virtually all healthcare AI products approved between 1997 and October 2022. Our audit of data submitted to the FDA to clear clinical AI products for the market reveals major flaws in how this technology is regulated.
The FDA has approved 521 AI products between 1997 and October 2022: 500 under the 510(k) pathway, meaning the new algorithm mimics existing technology; 18 under the de novo route, which means that the algorithm does not imitate existing models but comes with checks that make it safe; three have been submitted with premarket approval. Since the FDA only includes summaries for the first two, we analyzed the rigor of the submission data underlying 518 approvals to understand how well the submissions accounted for how bias may enter the picture. ‘equation.
In submissions to the FDA, companies are typically asked to share performance data that demonstrates the effectiveness of their AI product. One of the major challenges for the industry is that the 510(k) process is far from formulaic, and the FDA’s ambiguous position must be deciphered on a case-by-case basis. The agency has historically not explicitly asked for supporting datasets; in fact, there are products with 510(k) approval for which no data was offered on potential sources of bias.
We see four areas where bias can enter an algorithm used in medicine. This is based on best practices in computer science for training any type of algorithm and the realization that it is important to consider the degree of medical training possessed by the people creating or translating the raw data into something. which can train an algorithm (the data annotators, in AI parlance). These four areas that can skew the performance of any clinical algorithm – patient cohorts, medical devices, clinical sites, and the annotators themselves – are not routinely considered (see table below).
Percentages of 518 FDA-approved AI products that submitted data covering sources of bias
|Aggregated reports||Stratified reports|
|Patient cohort||less than 2% completed multiracial/gender validation||less than 1% endorsements with performance numbers by gender and race|
|Medical device||8% performed multi-vendor validation||less than 2% reported performance figures among manufacturers|
|Clinical site||less than 2% performed multi-site validation||less than 1% approvals with performance figures across all sites|
|Annotators||less than 2% reported annotator/reader profiles||less than 1% reported performance figures among annotators/readers|
Aggregate performance occurs when a vendor reports that it has tested different variables but only offers the performance as an aggregate, not the performance of each variable. Stratified performance offers more information and means that a provider gives performance for each variable (cohort, device or other variable).
It is actually the extreme exception to the rule if an AI clinical product has been submitted with data that confirms its effectiveness.
A proposal for basic submission criteria
We are proposing new mandatory transparency minimums that must be included for the FDA to review an algorithm. These cover the performance of dataset sites and patient populations; performance measures across patient cohorts, including race, age, gender, and comorbidities; and the different devices the AI will work on. This granularity should be provided for both training and validation datasets. Results on the reproducibility of an algorithm under conceptually identical conditions using externally validated patient cohorts must also be provided.
It is also important to know who is labeling the data and with what tools. Basic qualification and demographic information about the annotators – are they certified doctors, medical students, certified foreign doctors or non-medical professionals employed by a private data labeling company? — should also be included as part of a submission.
Proposing a baseline performance standard is a deeply complex undertaking. The intended use of each algorithm determines the threshold level of performance needed – high-risk situations require a higher standard of performance – and is therefore difficult to generalize. As the industry strives to better understand performance standards, AI developers need to be transparent about the assumptions made in the data.
Beyond Recommendations: Technology Platforms and Wider Industry Conversations
It takes up to 15 years to develop a drug, five years to develop a medical device, and in our experience, six months to develop an algorithm, which is designed to go through many iterations not only in those six months, but also for its entire life cycle. In other words, the algorithms fall far short of the rigorous traceability and auditability needed for drug and medical device development.
If an AI tool is to be used in decision-making processes, it must be held to similar standards as physicians who go through not only initial training and certification, but also continuing education, recertification and quality assurance for the duration of their medical practice. .
The Coalition for Health AI (CHAI) recommendations raise awareness of the issue of clinical AI bias and effectiveness, but technology is needed to apply them. Identifying and overcoming the four buckets of bias requires a platform approach with visibility and rigor at scale—thousands of algorithms are piling up at the FDA for review—that can compare and contrast submissions to predicates as well only evaluate applications de novo. Report workbooks will not help with version control of data, models, and annotations.
What might this approach look like? Consider the progression of software design. In the 1980s, it took considerable expertise to create a graphical user interface (the visual representation of software), and it was a solitary, siled experience. Today, platforms like Figma encapsulate the expertise needed to code an interface and, just as importantly, connect the ecosystem of stakeholders so everyone sees and understands what’s going on.
Clinicians and regulators shouldn’t be expected to learn how to code, but rather be given a platform that makes it easy to open up, inspect, and test the various ingredients that make up an algorithm. It should be easy to evaluate algorithmic performance using local data and to retrain onsite if needed.
CHAI calls for the need to look at the black box that is AI through some kind of metadata nutrition label that lists essential facts so clinicians can make informed decisions about using a particular algorithm. without being experts in machine learning. This can make it easy to know what to watch, but it doesn’t take into account the inherent evolution – or devolution – of an algorithm. Physicians need more than a glimpse of how it worked when it was first developed: they need continued human intervention supplemented by automated controls even after a product. A Figma-like platform should make it easy for humans to review performance manually. The platform could also automate some of this by comparing doctors’ diagnoses to what the algorithm predicts.
In technical terms, what we are describing is called a Machine Learning Operations Platform (MLOps). Platforms in other areas, such as Snowflake, have shown the power of this approach and how it works in practice.
Finally, this discussion of biases in clinical AI tools needs to encompass not only large tech companies and elite academic medical centers, but also community and rural hospitals, veterans hospitals, startups, groups advocating for underrepresented communities, health professional associations, as well as international FDA counterparts.
No voice is more important than the others. All stakeholders must work together to forge the fairness, safety and effectiveness of clinical AI. The first step towards this goal is to improve transparency and approval standards.
Enes Hosgor is the founder and CEO of Gesund, a company that promotes fairness, safety and transparency in clinical AI. Oguz Akin is a radiologist and director of body MRI at Memorial Sloan Kettering in New York and a professor of radiology at Weill Cornell Medical College.
First Notice Bulletin: If you enjoy reading opinion and perspective essays, get a weekly top opinion digest delivered to your inbox every Sunday. Register here.