Large language models are very complex, and the generated responses are based on statistical patterns in training data rather than actual understanding of content. Errors, biases, and lack of transparency can have consequences for patient safety and quality of health and care services. This chapter describes some of the underlying causes that come from training and characteristics of large language models.

Poor data quality and bias
Training large language models requires enormous amounts of data collected from sources such as books, journals, websites, and social media. Good sources can also be patient records and medical images.
The quality of the data will affect the performance of the models. Some factors that affect data quality are:
- Incomplete or deficient data. Dropout of enrolled patients in clinical studies can lead to incomplete data. Lack of registered information leads to incomplete and deficient data [21]. This can be because the data has not been collected systematically. Patient records may also lack information due to varying practices in data registration [22]. Another reason may be that privacy considerations can result in omission of data.
- Misinformation refers to information that is false or misleading [23]. Disinformation is a systematic and deliberate form of misinformation [24]. Small amounts of misinformation in training data have been shown to lead to misleading responses from medical AI models [25]. Large language models may be trained on data from the internet and/or third-party data sources of varying quality. These sources may contain misinformation that can be difficult to detect.
AI models can perpetuate and amplify biases found in training data, as well as contribute to reinforcing biases that affect human assessments and judgment [26][27]. Below are some examples of types of biases in data that are important to be aware of, especially in the health field:
- Overrepresentation of historical data compared to newer data makes the responses from models not updated according to current practices and research. In medicine and health, new knowledge should be weighted heavily compared to historical practice.
- Selection bias in, for example, publication of scientific articles. Positive findings and data are published more often than negative findings [28]. Clinical studies with significant results are reported more often than negative studies [29]. This means that the texts and publications used to train large language models are systematically biased toward positive findings and studies that confirm hypotheses, while negative results are underrepresented. When reporting intervention studies, for example, benefits are often emphasized more than disadvantages.
- Lack of local context. Training data for large, general language models has a strong overrepresentation of text and information in English. In some cases, models are trained only on English text. Norwegian content from the internet will in any case constitute a very small part of training data [30][31]. Watson Oncology is an example of how lack of local context can affect a model's performance [32]. When the model was tested in Denmark, it had considerably worse performance than previously reported. Danish clinicians pointed to both major differences between Danish and American treatment guidelines for cancer, as well as poorer quality of data from clinical studies conducted in the USA as possible causes.
- Skewed or incomplete data for certain groups and minorities. Not all groups and minorities are equally represented in the data used to train large language models [33]. Historically, for example, women have been underrepresented in clinical studies [34]. People with disabilities or minority backgrounds may also be underrepresented in clinical studies. This may be due to deliberate exclusion from studies, or practical barriers such as transport, economy, or other obstacles to participation [35][36.] This is also relevant internally in Norway where there may be less available data from minorities and the Sami indigenous population [37].
Lack of transparency, explainability, and interpretability ("black-box" problem)
Large language models can function as "black boxes," where neither users nor developers can fully explain how a specific response (output) from an AI model is generated. Lack of transparency, explainability, and interpretability can contribute to this "black box" problem [38]. These are distinct characteristics that build on each other.
Transparency refers to what happened in the model. Indicators that contribute to transparency include available information about:
- Data used for training
- Source code
- Training methods
- Evaluation
- Performance measures
The Foundation Model Transparency Index is based on 100 different indicators of transparency related to training, the model itself, or use of the model [39]. For example, Mistral 7B, which can be used by third parties without restrictions, has an index of 55%, meaning that there is no information available for 45 of the 100 indicators (including data sources) [40].
Explainability refers to how an AI system works and what mechanisms govern the decisions it makes. With many millions or billions of parameters and complex neural networks, large language models can operate as black boxes, even with openness around both data and source code [41].
Interpretability refers to why the model arrived at a response [42]. This can be showing users what instructions were given to arrive at the answer, which current AI systems rarely show. A method that contributes to better interpretability is reasoning (see AI fact sheet Intelligence enhancement and control) [43].
Large language models operate probabilistically and generate the most likely response based on training data, but often without clear understanding of when responses are correct. Lack of explainability and interpretability can, for example, make it difficult for clinicians to assess whether recommendations from language models are reliable, because the underlying mechanisms and data are opaque [44].
Methods for grounding models in facts, making reasoning visible, or showing which parameters are weighted (see AI fact sheet Intelligence enhancement and control [45]) can make it easier for healthcare personnel to assess the quality of the generated response [46] At the same time, use of grounding methods can reduce attention to the model's limitations that still require critical evaluation of outputs [47].
The lack of explainability and interpretability also makes it difficult to detect when a model generates incorrect or fabricated responses, also known as hallucinations.
Hallucination
Because language models do not have built-in understanding of knowledge and truth, but operate based on probability calculations, they can generate incorrect, incomplete, inaccurate, or misleading information with great confidence. This phenomenon, known as hallucination, means that the model can present information that looks credible but is either wrong, taken out of context, or has no basis in training data.
The probability of hallucination increases if training data has low quality or lacks representation, for example if important medical perspectives or updated guidelines are missing. Even models trained on high-quality data can hallucinate, as this is an inherent limitation of large language models [48].
Hallucination can be divided into two categories:
- Fact-hallucination, where the model generates false or fabricated claims
- Context-hallucination, where information is misinterpreted or placed in an incorrect context
In addition, omission of critical information can pose a risk, particularly when strict accuracy is required such as in health contexts.
Temperature setting (see fact box about temperature), post-training, and methods for knowledge grounding can reduce the occurrence of hallucination (see chapter 4.2).
Temperature (degree of variation)
In the field of AI, temperature is a parameter that controls how random the generated text should be. Temperature can be adjusted in the selection phase, where the model chooses the next word based on calculated probabilities. Low temperature (for example 0.1-0.3) makes the model more often choose the most probable words, giving more predictable and conservative responses. High temperature (for example 0.8-1.0) increases the chance that it chooses more unexpected or less probable words, giving more creative and varied responses. For medical applications, choosing lower temperature would ensure more precise and consistent responses.
Example (generated by ChatGPT-4o): Question: "What are common causes of headache?"
- Low temperature (0.2) gives a concise, accurate, and objective response: "Common causes of headache include stress, dehydration, muscle tension, sleep deprivation, and migraine. Some medical conditions, such as high blood pressure or sinusitis, can also cause headache."
- High temperature (0.8) gives a more varied, informal, and descriptive response: "Headache can be caused by everything from stress and dehydration to poor sleep or tense neck muscles. Some get headaches from strong smells or loud sounds, while others may experience it as a symptom of underlying diseases such as sinusitis or migraine. In rare cases, headache can be a sign of more serious conditions."
Degradation of model performance over time
Large language models are dynamic and influenced by both external and internal factors over time. Three challenges related to degradation of model performance are model drift, catastrophic forgetting, and model collapse. These phenomena can reduce the model's ability to provide accurate, relevant, and reliable responses, which is critical in health contexts.
Model drift describes how a model's performance can weaken over time, either because it encounters new data that differs from the training foundation, or because the relationship between input and output changes [49]. Model drift can lead to AI systems giving outdated or incorrect recommendations or information. This is relevant in the health field, where patient populations, treatment methods, and medical guidelines constantly evolve. Some underlying causes are:
- Data drift which occurs when the distribution of input data the model encounters in daily use differs from the data used to train and optimize the model. For example, a model developed for a younger patient population may give less precise recommendations if used instead on an older patient group.
- Concept drift which happens when the relationship between input and output changes over time. For example, if a new treatment or lifestyle change proves to reduce the relationship between age and diabetes risk, the model will give outdated or misleading recommendations because it still relies on the old relationship.
- Outdated information. It is resource-intensive to train large language models, and they will not always be trained on the newest available information. This can lead to models giving outdated responses that affect model performance in medicine and health [50][51].
Catastrophic forgetting (catastrophic interference) occurs when a model loses or degrades previously learned knowledge as a result of training on new data. For example, a large language model may lose precision for medical diagnostics if it is post-trained on general text.
Catastrophic Forgetting
Causes of catastrophic forgetting include:
- updating all weights in the model during re-training
- sequential learning where old data is no longer available
- overfitting to new data, displacing previous learning
There are many strategies for continual learning over time that can counteract catastrophic forgetting. These include selective weight freezing to preserve critical knowledge, experience replay techniques that mix historical data with new training data, or architecture-based solutions such as progressive neural networks where new tasks are handled in separate layers. These methods can be implemented individually or in combination to achieve more robust learning over time.
Source: https://link.springer.com/article/10.1007/s11063-024-11709-7
Model collapse is a long-term challenge that can occur when an increasingly large proportion of training data is generated by language models rather than humans. This can lead to models learning from previous models instead of from genuine human knowledge, which gradually weakens the quality and diversity of training data [52]. Model collapse is not just about synthetic data, but also about loss of model complexity and variation. When models learn from each other, subtle patterns, nuances, and extreme values in the data can disappear. If this leads to gradual degradation of the model's ability to capture complex relationships and produce varied content, it can result in irreversible errors or defects. This risk is not acute but may become a challenge in the future if large amounts of future data are synthetically generated.
Vulnerabilities in cybersecurity and information security
Cybersecurity must be part the supply chain assessment of a language model. Traditional cybersecurity measures such as access control, logging, and testing are still relevant but must be adapted to the unique vulnerabilities of language models. Risk assessments should additionally include the model's access levels, data handling, and protection against input manipulation to ensure safe solutions in the health and care sector.
Large language models differ from traditional software in their complexity and lack of transparency (see under "Lack of transsparancy, explainability, and interpretability ("black-box" problem) in this chapter), which creates particular cybersecurity challenges. Unlike traditional software, where source code can be reviewed for security assessment, much of the functionality of language models is based on enormous datasets that can be difficult to verify (see under "Poor data quality and bias" in this chapter). Often the foundation model itself is a "black box," and the supply chain is complex with various actors for model development, data, and software. This makes it challenging to detect vulnerabilities related to, for example, poisoned training data or backdoors in the model.
Language models have broad areas of use and can be used in unforeseen ways. If they have access to documents or systems, they can inadvertently expose sensitive information. An example is Copilot, which has access to all data the user has access to, which places great demands on control routines [53]. In the health sector, where integrity and confidentiality are crucial, this is a particular challenge.
A known risk is prompt injection, where attackers manipulate the model's responses using specially designed queries to the system. This way, an attacker can, for example, bypass security mechanisms or reveal training data. Indirect prompt injection, where malicious input comes from external documents or websites, is particularly concerning as it can affect the response without the user becoming suspicious.
Large language models require significant resources, making them vulnerable to Distributed Denial of Service attacks (DDoS) through many or resource-intensive queries.
The Open Worldwide Application Security Project (OWASP) [54] and MITRE [55] point to a range of specific security risks with language models and update them regularly in pace with technological development.
[21] https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-missing-data-confirmatory-clinical-trials_en.pdf
[23] https://tenk.faktisk.no/ordbok Norwegian term: Feilinformasjon
[24] https://tenk.faktisk.no/ordbok Norwegian term: Desinformasjon
[31] URLs for the .no domain constituted 0.3% of all URLs in the web-crawl for Common Crawl February 2025: https://commoncrawl.github.io/cc-crawl-statistics/plots/tld/latestcrawl.html
[37] https://uit.no/research/sshf-no?p_document_id=674134&Baseurl=/research/#region_752662 The Centre for Sami Health Research (SSHF) was established in 2001 because of lacking knowledge about health and living conditions of the Sami population.
[19] Center for Research on Foundation Models (CRFM) at Stanford: https://crfm.stanford.edu/fmti/paper.pdf
[47] ed3fea9033a80fea1376299fa7863f4a-Paper-Conference.pdf