Get Published | Subscribe | About | Write for Our Blog    

Posted on October 26, 2020 at 6:48 PM

by Jonathan H. Chen MD, PhD and Abraham Verghese MD, MACP

The post originally appeared as an editorial in the November 2020 issue of the American Journal of Bioethics.

Clinical medicine is an inexact science. In situations of uncertainty, we often ask an experienced colleague for a second opinion. But what if one could effectively call upon the experience of thousands? This might seem counterintuitive—too many cooks and “consultant creep” can spoil the broth. Yet Condorcet’s jury theorem, a centuries-old mathematical formulation, explains why we entrust juries to decide guilt or innocence rather than judges, and why we prefer voting democracies over dictators. It is also why we increasingly are willing to entrust machine learning (ML) algorithms that learn by mass example from large databases to help us in the care of the sick.

The coming together of advances in artificial intelligence (AI) and ML, exponentially increasing computing capability, and vast and growing repositories of digital data have brought us to a moment in time where machines can defeat Grand Masters in chess, and the best players in Go, and poker; moreover, they can play all day and all night without a hint of fatigue. Such technologies have transformed industries and are pervasive in our everyday lives, filling our social media feeds with the perfect clickbait, while online product recommenders seem to know what we want before we even knew to ask. In health care, ML applications are now emerging with the potential to automatically diagnose medical images and drive medical decision making (Kumar et al. 2020).

The potential of such systems to improve quality, efficiency, and access in healthcare is great. Yet, when automated processes drive decisions that affect human beings in other spheres of life, be it loan applications or bail-bonding, algorithms that learn from historical data have not always reflected the good intentions of those who deployed them. Instead they have exacerbated issues such as racial profiling. Such experiences make us cautious in deploying ML in potentially life-and-death medical decisions.

When Microsoft deployed an artificially intelligent chatbot, Tay, onto Twitter to demonstrate a computer having naturalistic conversations with people, the bot went from making friendly banter such as “humans are super cool” into overtly disturbing comments ranging from “I f@#%&*# hate feminists and they should all die and burn in hell” to “Bush did 9/11 and Hitler would have done a better job…” in less than 24 hours! Microsoft quickly shut down the bot and issued an apology, yet no one programmed the bot to take on a racist, misogynistic persona. The algorithms Tay used were not programmed with rules and logic, but were designed to learn by example, to learn from data, observing and incorporating the natural speech patterns of Twitter users so Tay could “become smarter” and “more human.” Tay simply picked up and magnified the sort of rhetoric that is alas all too common.

In Tay’s next iteration, Zo, her designers explicitly programmed her to be blind, deaf, and dumb to race, religion, gender, and any other politically sensitive subject. Yet even innocuous comments like, “They have good falafel in the Middle East,” triggered Zo to prudishly shut down the conversation, refusing to engage in any “political topics.”

When Amazon sought to develop a ML algorithm to automatically screen the resumes of job candidates and prioritize those likely to be interviewed and hired, it had to scrap the project as the algorithm learned to be overtly biased against women. What the algorithm likely did was learn to emulate existing unconscious hiring practices. Arguably, a great value of the algorithm was to make plain the existing structural biases in the underlying system that could then be addressed.

With the rising costs of healthcare, health systems and insurance companies are turning to ML to predict which of their patients are at high risk for expensive medical care, the stated intention being to preemptively deploy extra resources to keep such patients healthy and out of the hospital. But a key study found that one insurer’s algorithm was more likely to offer help to white patients over equally sick black patients. State regulators subsequently investigated the insurer for unethical conduct and bias in the use of such technology. Once again, it is likely there was no malicious intent and that the algorithm designers had no particular interest in race. They may have even sought to have a “color-blind” system by not allowing patient race to be an input into the algorithm. These are not algorithm problems, but data problems. Yet the data that algorithms are trained on are often nothing more than a record of human history, a record of inequitable distribution and access to healthcare resources.

These examples illustrate how ML algorithms and automation have an amplifying effect. They will make us better at doing, whatever it is we are already doing. If that’s helping us make good medical diagnoses, then we’re glad to have the assistance. But if one of the things we’re doing, even subconsciously, is perpetuating social biases, then such data-driven algorithms may insidiously (and even innocently) exacerbate existing ethical flaws in society.

Beam and Kohane (2020) in a paper on AI and ML in healthcare, said “…algorithmic decision-making tools come with no guarantees of fairness, equitability, or even veracity.” While this is true, we must also realize that clinicians (being human) can hardly provide a guarantee of fairness, equitability, or even veracity. The behavior of machine learning systems simply mirrors our behavior, going by what we actually do, and not by what we say we do, or think we do, or that we aspire to do.
With AI Healthcare Applications (HCA) rapidly being approved by the FDA for select diagnostic issues, even allowed to function autonomously, clinicians who are ultimately responsible for the patient must worry. A machine learning algorithm may rapidly adapt to incoming streams of data, getting “better” at the assigned task, but if we cannot easily audit and examine just how they did so—the “black box” problem—who can be medicolegally responsible for final clinical decisions?

In this issue of the journal, Char et al offer a framework, a means to approach the ethical concerns about ML HCA at every step, from the initial premise, the diversity of the team building the instrument, the validity of the data set used in the training algorithms, to their eventual deployment. Their effort represents a useful first step for what is a rapidly growing area of study. Among other things they put forward the importance of explainability and auditability, particularly in the development and debugging phase of such systems. Yet the invisible nature of the “black box” of a machine learning system makes that recommendation challenging. Moreover, we must remind ourselves that many of the tools and modalities we use in medicine are just as invisible or opaque. Do we see or know how acetaminophen or anesthesia works, or how the machine generating the number for the serum sodium or creatinine works?

Nevertheless, the framework Char et al offer is a valuable first step, a useful epistemological tool for ethicists, a way to help developers creating or deploying new AI systems in clinical care. Clearly we must begin with the certainty that any newly invented ML HCA application has the potential for unintended consequences; what works in the closed test system can easily deviate when exposed to a different or larger population. Unintended consequences of AI or ML deserves its own acronym, and it should be heard as often as its better-known siblings, particularly when AI and ML are used to drive clinical decisions. Unintended Consequences (UC) are the known unknown—we know they will be there, but not what shape they will take. But by anticipating them, we might build better systems, get better at preventing them, monitoring for them, mitigating their effects, or correcting them.


Kumar, A. , et al. 2020. OrderRex clinical user testing: A randomized trial of recommender system decision support on simulated cases. Journal of the American Medical Informatics Association.

Comments are closed.