The potential and expectations for large language models in health and care services are high, while the risks are significant. It is crucial that AI systems with large language models used in health and care services are trustworthy, and the report has identified five measures that will contribute to this.
The American National Institute for Standardization and Technology (NIST) characterizes trustworthy AI as: Valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed [4]. The European Union published ethical guidelines in 2019 describing that a trustworthy AI system has three components that should be fulfilled throughout the system's lifecycle: (1) it should be lawful, complying with all applicable laws and regulations, (2) it should be ethical, ensuring adherence to ethical principles and values, and (3) it should be robust, both from a technical and social perspective since, even with good intentions, AI systems can cause unintentional harm [5].
Objectives
The objective of this report is threefold:
- To describe and clarify risks associated with the use of generative AI and large language models in health and care services
- To describe how large language models can be adapted to the Norwegian health and care services, as one of several possible risk-reducing measures
- To propose measures that will contribute to making large language models well adapted to conditions in the Norwegian health and care sector
The report is to be considered as a basis for discussions about prioritization of and responsibility for relevant measures that will contribute to this.
Target audience
The main target audience for the report is the Ministry of Health and Care Services, the AI Advisory board for the health and care sector, the health and care sector, collaboration partners (such as the National Library of Norway and R&D institutions), and other authorities relevant for initiating the recommended measures.
The report will also be of interest to those working with generative AI and health.
Scope
This report concerns generative artificial intelligence, mainly large language models. Large multimodal models are also briefly covered.
In May 2024, the Norwegian Directorate of Health published the knowledge base "Large language models in health and care services" [6]. The document describes how such models can contribute to improving health services, potential benefits and challenges, and how large language models can be adapted for use in health and care services. This report does not cover the opportunity space and elaborates on challenges and how language models can be adapted to health services.
In January 2025, the Norwegian Directorate of Health published "Report on quality assurance: Use of artificial intelligence in health and care services" [7]. The report covers, among other things, risk management and quality assurance related to implementation and use of AI in general. This report goes deeper into risks and quality assurance related to generative AI models and especially large language models.
The Norwegian Directorate of Health develops AI services that use large language models, including in Helsesvar and "Enklere tilgang til informasjon" (ETI). Privacy issues related to these projects are addressed in the Data Protection Authority's regulatory sandbox and subsequent report (which was not published when this report was completed) [8]. This report also addresses privacy-related risks but does not go into detail on specific use cases.
Method
The work on the report has been done through a combination of interviews and literature studies of Norwegian and international research articles and relevant reports.
We have interviewed professionals with experience or deep knowledge related to research on, development, and use of large language models in Norway. The interviews were conducted from October 2024 to February 2025. The following institutions have been involved in the work: NTNU, NorwAI, National Library of Norway, Sørlandet Hospital, University of Oslo, Helse Vest ICT, a general practice clinic, University of Bergen, Simula Center, OsloMet, Norwegian Digitalisation Agency, Center for Patient-centered Artificial Intelligence (SPKI), and the Language Council of Norway.
Key terms
The definition of an AI system according to the AI Act, and as we use the term in this report is: "...a machine-based system that is designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment, and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments" [9].
An AI system can take various forms: as a standalone application on a PC or mobile application, a web service, part of a professional system, or a health record system.
An AI system is a medical device under the Act on medical devices if it is intended to be used on humans for the purpose of contributing to diagnosis, prevention, monitoring, prediction, prognosis, treatment, or alleviation of disease [10][11].
An AI system can consist of one or more AI models, in addition to improvements such as knowledge grounding.
An algorithm is a recipe that tells how something is done, and simply explained consists of a set of instructions that transform input data into output data. A machine learning algorithm is the recipe the system uses to build an AI model of reality, based on the data that is fed in (the training data). This model can then be used to make new decisions about new incoming data [12].
Foundation models are large neural networks trained on large general datasets that may consist of text, images, audio, etc. Such models can form the basis for a range of different AI solutions [13].
Generative AI encompasses solutions that primarily generate or produce new material, such as text or images [14]. Generative AI models differ from non-generative AI models in the following ways:
Generative AI models create new content, such as text, images, or music, based on training data. For example, a generative model can create new images of cats based on previous cat images.
Non-generative, including classification models, AI models distinguish between different classes or predict probabilities for specific outcomes. They focus on finding boundaries between categories in the data. A classification model can determine whether a given image contains a cat or not.
Large language models (LLM) are generative AI models. Large language models, like the well-known GPT models, are a form of foundation models trained on enormous amounts of text to predict what comes after a word or syllable in a given context. Language models do not store the text they are trained on. They "know" nothing about the world, and they do not browse websites to find facts. They only know language. Nevertheless, it is easy to believe that the models think, or know something, because they are so good at writing text that we humans experience as meaningful [15].
More recently, large language models have been extended to handle other modalities, such as images, audio, and video. These are called multimodal AI models, but the term "LLM" is still often used. This is due to several factors:
The core of the model is linguistic: Even though they can process multiple modalities, the text interface is still the primary way users interact with the model.
Underlying technology: The models build on architectures developed for language processing (like transformers), and much of the training occurs on large text datasets.
General professional terminology: Within the AI field, the term "LLM" is still used broadly, as most multimodal models spring from LLMs that have been extended to interpret other data types.
In this report, we use the same definition of risk as in the AI Act: "Risk means the combination of the probability of an occurrence of harm and the severity of that harm" [16].
Other terms used in the report are described in the AI fact sheet "AI Terms" [17].
[5] https://op.europa.eu/en/publication-detail/-/publication/d3988569-0434-11ea-8c1f-01aa75ed71a1/language-en
[7] https://www.helsedirektoratet.no/rapporter/report-on-quality-assurance-use-of-artificial-intelligence-in-health-and-care-services
[8] https://www.datatilsynet.no/regelverk-og-verktoy/sandkasse-for-kunstig-intelligens/pagaende-prosjekter2/helsedirektoratet/
[9] Aticle 3 in the AI Act https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=OJ:L_202401689
[12] https://media.wpd.digital/teknologiradet/uploads/2022/02/Kunstig-intelligens-i-klinikken.pdf p.16
[13] https://www.regjeringen.no/contentassets/c499c3b6c93740bd989c43d886f65924/en-gb/pdfs/digitaliseringsstrategi_eng.pdf p.69
[14] https://www.regjeringen.no/contentassets/c499c3b6c93740bd989c43d886f65924/en-gb/pdfs/digitaliseringsstrategi_eng.pdf p.69