Risk is closely linked to areas of use. The consequences of the inherent challenges with large language models, such as hallucinations, bias, and lack of transparency (chapter 3.1), can vary from insignificant to critical, depending on how and where the models are used. Some challenges arise in the use situation while others emerge over time and with increased use.
Some risk areas can have consequences for patient safety and quality in health services (Figure 2): the dialogue between human and machine, dependency that can arise with extensive use of AI tools, and privacy in handling sensitive data. Other risks are more overarching and relate to complexity in and interaction between various regulations, unintended illegal use of large language models, discrimination against individuals or groups based on biases in the models, environment and sustainability, and where or with whom responsibility lies if errors occur.

Through the work on the report, it has become clear that there are particular uncertainties associated with development and use of generative AI with high risk, such as in diagnostics and treatment. Among other things, WHO questions whether large language models will be able to achieve sufficient accuracy to justify costs and resources associated with development, and safe and effective implementation of such tools in given areas within health care [56].
The American authorities (Food and Drug Administration (FDA)) have assessed medical equipment with generative artificial intelligence (AI). They point to several challenges and risks such as hallucination, lack of transparency, continuous learning, challenges with scientific documentation, and the need for new evaluation methods (see fact box about FDA report below) [57]. We are not aware of any similar assessments published by the EU.
Assessments related to medical equipment with generative AI conducted by FDA
FDA has assessed medical equipment that uses generative artificial intelligence (AI) and points to several challenges and risks:
- Hallucinations: Generative AI can produce incorrect content ("hallucinations"). For example, equipment designed to summarize a conversation between a patient and healthcare personnel may inadvertently generate a false diagnosis that was never discussed.
- Lack of transparency: Equipment using foundation models developed by third parties often has limited access to information about the models' architecture, training methods, and datasets. This can make it difficult for manufacturers to ensure quality and safety.
- Continuous learning: Generative models can either be static or change continuously ("continuous learning"). Continuous learning is associated with uncertainty about the model's performance over time, and as of November 2024, FDA had not approved medical equipment using continuous learning.
- Challenges with scientific documentation: It can be difficult to determine what type of valid scientific documentation ("evidence") FDA should request to be able to assess the equipment's safety and effectiveness throughout the lifecycle.
- Challenges with FDA's classification system: Generative AI can introduce new or different risks that challenge FDA's current classification system for medical equipment. How equipment is classified affects what regulatory measures are necessary to ensure the equipment becomes safe and effective.
FDA also highlights challenges related to evaluation and testing of generative models:
- Pre-market evaluation: Large language models have complex parameters and can give different responses based on small changes in wording or prompts. It is not possible to test all possible prompts before launch. Moreover, lack of transparency and the potential for unforeseen responses can make it particularly difficult to evaluate such systems before they come to market. They (FDA) point to the importance of plans and methods for monitoring after the equipment is put into use (postmarket monitoring).
- Need for new evaluation methods: Current methods for quantitative performance evaluation may be insufficient to ensure safe use of generative AI. New, qualitative evaluation methods can help characterize/describe the model's autonomy, transparency, and explainability. Which evaluation methods are required will vary based on the product's specific use area and design.
FDA recommends manufacturers of medical equipment to consider the following compared to a non-generative alternative:
- Will inclusion of generative AI increase the risk class of medical equipment?
- Could a product with generative AI provide misinformation and thus pose a risk to public health?
- Is generative AI appropriate for the intended use area?
Inadequate competence in dialogue with language models
Queries, also called prompting, using natural language makes it easy for users with different competence and backgrounds to use large language models. However, responses can vary based on the exact wording of the question asked, and it is therefore not irrelevant how questions are formulated [58].
Certain phrasings, imprecise queries, and insufficient description of context can give incorrect or inaccurate responses.
Prompt and system prompt
Prompt: A text-based instruction sent to a language model (like ChatGPT) to get a response. A prompt is the user's question or request, and the quality of a prompt directly affects the relevance and quality of the response.
System prompt: An overarching instruction that defines the language model's role, limitations, and behavior. A system prompt is usually not visible to the end user but controls how the model interprets and responds to all user prompts. A system prompt can, for example, define that the model should act as a health professional assistant, respond concisely, and always refer to professional guidelines.
Knowledge about dialogue with language models, along with practical training can, however, improve the quality of responses from language models. For example, a study has shown that structured prompts gave better results than unstructured ones [59] and that two-step reasoning (first asking for a summary, then asking for diagnosis) gave better diagnostic accuracy than simple prompting [60]. A study demonstrated how more advanced models (like GPT-4) handled complex prompts well, while weaker models (like GPT-3.5) performed worse with such prompts on clinical reasoning tasks [61]. Another study has shown that large language models had limited utility for clinicians as decision support tools, but the authors suggested greater potential if users receive good training in phrasing of good prompts [62].
Reduced professional knowledge and skills
Increasing use of AI and large language models in health and care services can lead to gradual dependence on the tools. Healthcare personnel and patients may come to rely on the technology, which in the long term can affect both the ability to make independent assessments and acquired skills.
Confirmation bias (algorithmic appreciation) means that users unconsciously weight information that confirms their existing beliefs [63]. Both the phrasing of a prompt and interpretation of the response given by a language model can be influenced by confirmation bias. This can be challenging when using AI tools in health and care services, as healthcare personnel may become more inclined to accept responses from a language model if it aligns with prior expectations [64]. Consequently, there is a risk that incorrect responses go undetected, especially if the model presents information with great confidence, both with and without source references. Training in critical use of AI tools will be crucial to reduce this risk.
Automation bias is excessive trust in AI tools that can impair professional integrity and quality. As AI systems improve and give correct answers more frequently, it will become harder for healthcare personnel to catch the occasional wrong answers. This can lead to users not questioning or considering other sources of information, because they assume the AI system is correct.
Deskilling. Several sources, including WHO and the National Health Service (NHS) in England, point to deskilling as a risk with extensive use of AI systems in health services [65][66]. Loss of skills, knowledge, or weakened decision-making ability can occur if skills are not in regular use, such as if tasks are left to machines, including language models. The result can be that healthcare personnel do not check or challenge a decision proposed by a model. They may also become unable to perform certain types of tasks or procedures in cases where the model is unavailable, for example during network failures or security breaches.
An international survey among healthcare personnel shows that a large fraction are concerned that use of generative AI can weaken critical thinking and lead to increased dependence on AI in clinical decisions [67]. Loss of skills will also apply to the next generation of healthcare personnel who will increasingly encounter AI tools as part of education, training, and changes in clinical practice [68]. Increased experience with using large language models will lay the foundation for new research that may eventually change our understanding of this risk.
Reduced privacy and anonymous information
There will be risk related to compliance with privacy requirements when large language models are in active use. If sensitive personal data is entered into a prompt to a language model that stores data and/or learns continuously, for example in a web application, there is a risk that sensitive data is used to train and update the model, and that it can go astray.
One way to secure language models that handle personally sensitive information can be to store and run the model in private cloud services or locally (on-prem solutions). Settings that ensure encryption of data (in use) and controlled (or lack of) data storage are other methods to ensure that input or output data is not exposed.
A common technique for complying with privacy requirements is to only share and process anonymous information. The risk to privacy is reduced if personal data is anonymized because the data subjects can no longer be recognized in the dataset. Anonymous information cannot be linked to an individual and is therefore not personal data regulated by the General Data Protection Regulation (GDPR).
Personal information is considered anonymized when it is handled or processed so that it can no longer be linked to an identified or identifiable natural person. In assessing whether the information is anonymous or not, one must see if it is possible to trace the information back to the individuals the information relates to.
Datasets for anonymization must be processed to avoid possibilities for re-identification (method for finding back to a person's identity from anonymized data) or reverse identification (method for re-identification, often using combination with other data sources). Assessing whether the information in a dataset is to be considered genuinely anonymous as a result of the anonymization process will depend on a comprehensive and risk-based assessment that rests with the data controller. Real anonymization can be difficult to achieve for certain datasets, for example for very comprehensive datasets with many variables.
With current access to large amounts of data and powerful analysis technology, it will be possible to a greater extent than before to re-identify individuals through information in the dataset [69]. This risk is important to assess when sharing information.
Complex regulations
Development and use of artificial intelligence, including large language models in health services, must always occur within the framework of applicable law. Users (deployers) as well as developers and providers of AI systems to be used in health and care services in Norway must comply with a range of laws and regulations, including both general and sector-specific regulations. For example, sector-specific regulations such as Norwegian health legislation and the EU regulation on medical devices apply. Furthermore, general regulations such as the EU's General Data Protection Regulation, copyright law, and equality and anti-discrimination law apply. Additionally, the AI Act, which has entered into force in the EU and should be implemented in Norwegian law as soon as possible [70].
The relationship between the various and sometimes overlapping regulations is complex. The complexity can lead to different regulatory understanding and interpretation, that processes take time, or that the scope for action in the regulations is not fully utilized.
Uncertainties about the legal scope for action within applicable law can lead to non-compliance. This may result in businesses, healthcare personnel, or citizens using AI systems with large language models in violation of regulatory requirements, or conversely, adopting overly restrictive interpretations that don't fully utilize the legal scope for action. For example, there may be uncertainty about whether an AI system should be classified as medical equipment or not, and if so, which risk class it belongs to [71].
Risk class according to medical device regulations is leading for which risk class the tool gets according to the AI Act, and thus decisive for which requirements apply to, among other things, quality assurance and documentation according to both laws. AI systems classified as medical devices often [72] also become classified as high-risk systems according to the AI Regulation [73]. Uncertainty about AI Act requirements therefore arise for AI tools where it is unclear whether they are covered by the definition of medical device, or which class of medical device applies. Examples of this can be chatbots used in therapy, or applications that use generative AI to create drafts for patient records [74].
Unintended illegal use
General purpose large language models can be used for a range of tasks, also by healthcare personnel and patients. Because the technology is easily accessible and simple to use, there is a risk that AI systems are used for purposes they were not intended or regulated for. Some examples are given below.
Illegal use of large language models in health and care services. If healthcare personnel are to use an AI tool for medical purposes, they must according to the handling regulations use CE-marked equipment [75]. Without quality-assured AI tools (such as CE-marked products), healthcare personnel may use large language models for purposes they were not developed for, making such use potentially illegal. Impatience, ignorance, or lack of clear guidelines can lead to healthcare personnel using language models for tasks that are not in line with the area(s) of intended use as defined by manufacturers or providers of AI tools.
Irresponsible use of AI tools by citizens. Citizens and patients have easy access to AI tools marketed as lifestyle tools, and AI models for general purposes that do not have specific medical purposes. Citizens may still use the tools for, for example, self-diagnosis or to get medical advice. This use can create risk of incorrect or misleading information and/or that the information is misinterpreted. Such misinformation can affect health decisions and in the worst case threaten patient safety if they are not in contact with health services [76].
There is also risk related to illegal use of health information for training large language models. Section 29 of the Healthcare Personnel Act allows dispensation from confidentiality obligations for using sensitive information to train large language models for specified purposes [77]. If the model is later used for other tasks, which means using health information for a purpose other than what the dispensation decision covers, a legal basis for using information for the new purpose is required. Failure to ensure this can result in use of health information for new purposes without sufficient legal basis.
Risk of discrimination
Unrepresentative training data for large language models can introduce biases that risk discrimination in health and care services, a concern also highlighted by WHO in its report [78]. If the biases affect someone's right to benefits or access to health services, it can constitute illegal discrimination. AI systems that may pose this risk fall under the definition of high-risk AI systems in in AI Act [79]. Under the AI Act, both public and private users (deployers) of high-risk AI systems in such cases must perform a Fundamental Rights Impact Assessment (FRIA). This is an assessment of how an AI system can affect human rights, at both individual and group levels [80]. If use of an AI system reveals risk of discrimination against a special group of patients, one measure could be to offer this group an alternative without AI.
Environment and sustainability
Development, testing, and use of large language models often require high energy consumption and will leave a significant environmental footprint. Energy consumption is linked to pre-training and post-training of models [81], and to use [82]. The environmental impact also includes high consumption of water for cooling large data centers and extraction of rare minerals necessary for today's hardware used for model training [83][84].
The health and care sector is estimated to account for 4-5% of global greenhouse gas emissions, and both procurement and use of equipment and information technology contribute to emissions. The Norwegian Directorate of Health has published a roadmap containing concrete measures for how health and care services can become more environmentally friendly [85]. AI is highlighted as a technology that, used wisely, can facilitate a more sustainable health service, despite its considerable resource demands.
There is a requirement to weight climate and environment at 30% in new public procurements where relevant [86]. In practice, however, it can be difficult to get an overview of the climate footprint when procuring AI solutions. Suppliers often provide little information about energy consumption and carbon emissions, and there is no common standard for how such calculations should be conducted. Under Article 53 of the AI Act, providers of large language models to be used in the EU are required to include computational resource and energy consumption information in their technical documentation. The Transparency chapter of the voluntary Code of Practice provides a guides and template to this requirement [87].
As an alternative to large language models, small models are also being developed that require significantly less energy and resources both for training and in use [88]. Using techniques such as model distillation [89] and targeted fine-tuning, these models perform increasingly better and can be a good alternative for solving various tasks [90].
Scientific evidence and responsibility
Implementation of AI tools with consequences for patients' medical treatment requires caution. The consequences of using the tools are comparable to new medications or other medical equipment. It will be necessary to develop new tests and areas for evaluation of large language models to be used in health, as well as studies and documentation [91][92]. For example, WHO recommends conducting randomized clinical trials to prove that an AI system to be used in clinical practice has better performance than alternatives, not only perform testing in a laboratory or controlled environments [93]. Without a good scientific foundation, proper training, and correct implementation, the burden of responsibility will be unclear if errors occur and healthcare personnel have relied on responses from AI tools. Such situations can have major ethical and legal consequences. Note that business leaders are responsible for facilitating correct implementation, training, and organizational measures.
The European Commission recognized challenges related to liability issues when using artificial intelligence and proposed a separate AI Liability Directive in 2022 [94]. The goal of the directive was to provide increased legal certainty for both patients and healthcare personnel if damage should occur as a result of using AI tools. In February 2025, the proposal was withdrawn due to lack of agreement among member countries, but the issue is still equally relevant.
[66] https://digital-transformation.hee.nhs.uk/building-a-digital-workforce/dart-ed/horizon-scanning/developing-healthcare-workers-confidence-in-ai/chapter-4-workforce-transformation/the-risk-of-deskilling
[67] https://assets.ctfassets.net/o78em1y1w4i4/kWTSca6VXZ54DBhAIYxJU/386d36dc0c03c4fa8de0365bbb2043e1/Insights_clinician_key_findings_toward_ai.pdf
[69] https://www.datatilsynet.no/rettigheter-og-plikter/virksomhetenes-plikter/informasjonssikkerhet-internkontroll/hvordan-anonymisere-personopplysninger/
[72] Medical devices of class iia or higher according to MDR
[77] https://www.helsedirektoratet.no/rundskriv/regelverket-for-utvikling-av-kunstig-intelligens/regler-for-de-ulike-prosjekttypene/reglene-for-utvikling-og-bruk-av-klinisk-beslutningsstotteverktoy
[78] https://www.who.int/publications/i/item/9789240084759 p.xi, Table 2, p.21
[81] https://hbr.org/2023/07/how-to-make-generative-ai-greener Training GPT-4 is estimated to have generated around 300 tons of CO₂ emissions, compared to an estimated 5 tons annually for a human.
[82] According to the ecological impact calculator available on Hugging Face it is estimated that one query to GPT-4o with a relatively short response (400 tokens) consumes 35.1 Wh of electricity, approximately seven times more than charging a mobile phone. https://huggingface.co/spaces/genai-impact/ecologits-calculator visited February 2025
[85] https://www.helsedirektoratet.no/rapporter/veikart-mot-en-baerekraftig-lavutslipps-og-klimatilpasset-helse-og-omsorgstjeneste
[86] https://anskaffelser.no/verktoy/veiledere/veileder-til-regler-om-klima-og-miljohensyn-i-offentlige-anskaffelser
[87] https://artificialintelligenceact.eu/annex/11/ and https://code-of-practice.ai/?section=transparency#model-documentation-form