An AI-based health system in Copenhagen, which helps identify cardiac arrests during emergency calls, suffers from 10 ethical issues, according to an ethics self assessment of the system based on the AI High Level Group’s Trustworthy AI guidelines. In the future, all AI-systems should be assessed by independent experts before deployed, according to the assessment paper, which also gives five recommendations for improving the system. Update: December 2021: Comment below from Corti
An independent team of philosophers, policy makers, social scientists, technical, legal, and medical experts from all over the world with professor Roberto V. Zicari from Arcada University of Applied Sciences in Helsinki, as a lead, did the assessment of the AI-based health system in Copenhagen. It was a self assessment conducted together with the key stakeholders , namely the medical doctors of the Emergency Medical Services Copenhagen, and the Department of Clinical Medicine, University of Copenhagen, Denmark.
The main contribution with their new paper was to demonstrate how to use the EU Trustworthy AI guidelines in practice in the healthcare domain. For the assessment, they used a process to assess trustworthy AI, called Z-Inspection® to identify specific challenges and potential ethical trade-offs when we consider AI in practice. Thus, this is not a ‘certification’ or a green light concluding what is ethical or not – it is just
– It is not giving it a green light or concluding it is unethical highlighting the ethical issues and giving recommendations
The case study was an AI system in use in Copenhagen, Denmark. The system uses machine learning as a supportive tool to recognize cardiac arrest in emergency calls by listening to the calls and the patterns of conversation. An AI system was designed, trained and tested by using the archive of audio files of emergency calls provided by Emergency Medical Services Copenhagen in the year 2014. The prime aim of this AI system, which was introduced in the fall of 2020, is to assist medical dispatchers when answering 112 emergency calls to help them to early detect OHCA (out-of-hospital cardiac arrest) during the calls, and therefore possibly saving lives.
The research questions of the paper were: Is the AI system trustworthy? Is the use of this AI system trustworthy?
The team defined three mayor phases.
1. Set-Up phase starts by verifying that no conflict of interest exists, both direct and indirect, between independent experts and the primary stakeholders of the use case. This phase continues by creating a multi-disciplinary assessment team composed of a diverse range of experts.
2. The Assess Phase is composed of four tasks:
A. The creation and analysis of Socio-Technical Scenarios for the AI system under assessment.
B. A list of ethical, technical, and legal “issues” is identified and described using an open vocabulary.
C. To reach consolidation, such “issues” are then mapped to some of the four ethical principles and the seven requirements defined in the EU framework for trustworthy AI.
D. Execution of verification of claims is performed. A number of iterations of the four tasks may be necessary in order to arrive to a final consolidated rubrics of issues mapped into the trustworthy AI framework.
3. The resolve phase could be called ‘ethical maintainance’ and is about monitoring that the AI system fulfill the requirement over time.
Ten ethical issues arose in the assessment:
- It is unclear whether the dispatcher should be advised or controlled by the AI, and it is unclear how the ultimate decision is made. Obviously, the system was not accompanied by a clear definition of its use.
- To what extent is the caller’s personally identifying information protected, and who has access to information about the caller? Despite the fact that the AI-system follows GDPR standards there were no description of how data will be used and stored, for how long this will occur before its disposal, and what form(s) of anonymization will be maintained.
- There were no formal ethical review or a community consultation process to address the ethical implications regarding trial patients as is common in comparable studies reviewed by institutional review boards in the United States and United Kingdom.
- The training data is likely not sufficient to account for relevant differences in languages, accents, and voice patterns, potentially generating unfair outcomes.There is likely empirical bias since the tool was developed in a predominantly white Danish patient group. It is unclear how the tool would perform in patients with accents, different ages, sex, and other specific subgroups.
- The algorithm did not appear to reduce the effectiveness of emergency dispatchers but also did not significantly improve it. The algorithm, in general, has a higher sensitivity but also leads to more false positives.
- Lack of explainability. The system outputs cannot be interpreted, leading to challenges when dispatcher and tool are in disagreement. This lack of transparency may have contributed to the noted lack of trust among the dispatchers, as well as the limited training of the users.
- The data was not adequately protected against potential cyber-attacks. In particular, since the model is not interpretable, it seems hard to determine resistance to adversarial attack scenarios, such as the importance of age, gender, accents, bystander’s type, etc.
- The AI system did not significantly improve the dispatcher’s ability to recognize cardiac arrests. AIs should improve medical practice rather than disrupting it or making it more complicated.
- The trials conducted did not include a diverse group of patients or dispatchers.
- It is unclear whether the Danish authorities and the involved ethics committees assessed the safety of the tool sufficiently.
The paper comes with five recommendations covering the ethical issues above; 1) Use a model for explainability, 2) Use data sets that are built to represent the whole population and thus avoid bias, 3) Involve stakeholders, 4) Deploy a better protocol on what does or does not influence the accuracy, and 5) Assess the legal aspects of the AI system.
“We are very grateful that the medical doctors at the Emergency Medical Services Copenhagen decided to work with us in order to learn the implications of the use of their AI system and to improve it in the future. They were very collaborative and it was a really interesting experience”, said Roberto V. Zicari.
An important lesson from this use case is that there should be some requirement that independent experts can assess the system before its deployment.
Comment from Corti December 2021
General comment:
- The machine learning technology works better in the prospective trial as compared to the retrospective trial.
- The user experience was not tested properly to ensure that the end-user responded to the findings of the machine learning.
- Thus, it is important to note what we are testing for during a trial such as this. It is also important to accurately state where the trial was a success and where it was a failure.
To point to the paper’s mistakes we had our PhD student look at it:
Usage of the term language model
There is a general confusion with regard to the term language model. A list of quotes using the term is included below. For the most part of the technology description (“The technology used”, pp. 7-8), it seems that the term is mistaken for automatic speech recognition (ASR) model. To clarify, the ASR model takes as input the audio after applying a short-term Fourier transformation and a few other preprocessing steps (e.g., mel-scaling, feature bin normalization, etc.) and produces a human readable text. The text is then fed to a noisy-channel model, as correctly depicted in Figure 3, which incorporates a Danish language model in order to do spell correction. Thus, the language model is just a component in the noisy-channel spell correction model. The first two quotes below are wrong, whereas the latter two are somewhat inaccurate. I have proposed some corrections to highlight the misunderstanding, but it might also be necessary to add more details for an accurate description. In contrast to the text in the section “The technology used”, Figure 3 provides a fairly accurate high-level overview of the model pipeline.Sentences from the paper with wrong or misleading use of the term language model:
- “Also, at that time (2018), no Danish language model was readily available.”, p. 7
- “They used a language model for translating the audio to text based on a convolutional deep neural network (LeCun et al., 1989).”, p. 7
- “The text output of the language model was then fed to a classifier that predicted whether a cardiac arrest was happening or not (Figure 3).”, p. 7
- “Using a Danish language model means that calls in other languages were interpreted in a way that the cardiac arrest model could not work with (i.e., trying to understand Danish words from English speech).”, p. 8
Corrected sentences:
- “Also, at that time (2018), no Danish automatic speech recognition (ASR) model was readily available.”, p. 7
- “They used an ASR model for translating the audio to text based on a convolutional deep neural network (LeCun et al., 1989).”, p. 7
- “The text output of the spell correction model was then fed to a classifier that predicted whether a cardiac arrest was happening or not (Figure 3).”, p. 7
- “Using a Danish ASR and Danish language model means that calls in other languages were interpreted in a way that the cardiac arrest model could not work with (i.e., trying to understand Danish words from English speech).”, p. 8
Explainability with regard to English callsThere is no reason to discuss the models ability to handle any other language than Danish. If the model accurately transcribes English, or any other language for that sake, it is only expected to the extent that the words transcribed are commonly adopted by Danish speakers in informal speech. If the model turns out to make correct OHCA predictions on English calls, this should be seen as chance. The model is designed for Danish language only and should not be held accountable for its lack of explainability on English calls.Sentence indicating that the model lacks explainability for English calls:
- “In many cases, the model understood the calls anyways, but in some cases not. So far, there is no explanation why some calls were seemingly not understood.”, p. 8
Another of our PhD studerende writes.:The citation of research papers produced by Corti and published at peer reviewed venues is somewhat flawed. In total, there are three citations of two research papers.Two of the citations referring to the paper by Havtorn et al. (2020) seemingly miscredit it as being closely related to the cardiac arrest detection system while in fact it is only related to a limited extent, mainly in terms of the employed ASR system. While the authors make it clear that the implementation and algorithmic details of the AI system are unknown to the authors in other paragraphs of the paper, in these specific citations, the authors seem to be making poorly supported and inaccurate assumptions about the inner workings of the system.The third citation refers to the part of Corti’s work that exists in the public domain in a general sense. While this citation is correct, it is rather incomplete and omits several peer reviewed research papers published by Corti in the past. We elaborate our critique in relation to the individual citations below.“The AI system was applied directly on the audio stream where the only processing made was a short-term Fourier transformation (Havtorn et al., 2020), hence no explicit feature selection was made.” – p. 7
- While it is true that the AI system, specifically the ASR, was applied directly on the audio stream and uses a Fourier transformation without explicit feature selection, the cited paper is only related to the cardiac arrest detection system to a very limited extent, mainly in terms of the employed ASR system.
“There is no explanation of how the ML makes its predictions. The company that developed the AI system has some of their work in the open domain (Maaløe et al., 2019; Havtorn et al., 2020). However, the exact details on the ML system used for this use case are not publicly available.” – p. 8
- While this citation is correct, the cited papers only constitute a rather small part of the work by Corti that exists in the public domain. Additional work in the public domain that could be mentioned includes the following peer reviewed papers:
- On the Inductive Bias of Word-Character-Level Multi-Task Learning for Speech Recognition (NeurIPS IRASL workshop 2018)
- Do End-to-End Speech Recognition Models Care About Context? (InterSpeech 2020)
- Hierarchical VAEs Know What They Don’t Know (ICML 2021) (published at the same time as the paper)
- On Scaling Contrastive Representations for Low-Resource Speech Recognition (ICASSP 2021) (published at the same time as the paper)
“The general principles used for this AI system are documented in the study by (Havtorn et al., 2020). The paper describes the AI model implemented for this use case. However, the paper presents the model trained using different data sets and therefore the results are not representative for this use case. The details of the implementation of the AI system for this case are proprietary, and therefore not known to our team.” – p. 9
- While some of the principles used in the cited paper overlap with the use case described in the paper at hand, i.e. mainly the ASR system and the processing of the audio signal, the paper does not deal with cardiac arrest detection but rather presents a proof of concept for a different use case namely a real-time question tracker that can be used to identify the locations of questions asked in an audio signal. Hence, the reason that the results are not representative for the cardiac arrest use case is not the use of a different dataset, but rather the use case itself.
Lars Maaløe Chief Technology Officer |