Through work on the report, we have received input on what should be in place to adapt large language models and their use to Norwegian conditions in health and care services. The input includes access to data, computing power, quality frameworks, competence, and a governance regime. This chapter goes through these areas, which also form the basis for recommendations for measures in the Joint AI Plan (see Chapter 5).

Sufficient data and data quality
To adapt language models to the Norwegian health and care sector, there is a need for significant amounts of high-quality data for both training and testing.
The digitalization strategy highlights the importance of better and easier access to data for the health and care sector: " The health and care services sector has a great deal of information that can be useful for developing AI, such as registry data, medical imagery and patient records. It must become easier for relevant actors to access health data to use it with AI. Improved and easier access to health data is important for the further development of our common health service, and for research and business development, but national security interests must be protected." [122]
A key factor for using data for adaptation to Norwegian conditions is data quality. We have described risks related to poor data quality in large language models in section 3.1.1. Data quality refers to how reliable, accurate, and applicable data is for a given purpose. High data quality is crucial for good analyses, decisions, and models.
Key factors for good data quality are:
- accuracy: the data correctly reflects reality
- representativeness: the data represents the entire population in Norway, including minorities and the Sami indigenous population
- completeness: no important data is missing
- consistency: the data is coherent across sources and systems
- reliability: the data comes from a credible source and is verifiable
- timeliness: the data is updated and relevant
- uniqueness: no duplicated or contradictory entries
- relevance: the data is useful for the purpose it is used for
- balance: the various health professional areas must be covered
Few data sources will be perfect, but various frameworks for describing datasets can be used to increase transparency around the data used to develop language models for use in the health and care sector [123]. Both Model Documentation Form for large language models related to the AI Act [124] and model cards from HuggingFace [125] have dedicated fields for describing data.
Collection and processing of data requires substantial effort. This also applies to clarifying whether there are legal, technical, or other limitations.
One can distinguish between labeled data and unlabeled data. Labeled data often has higher quality since it can contain semantic or contextual information that gives added value.
The question of availability of necessary data has been significant for several informants in this report.
Data categories
Data for language models can overall be divided into three main groups:
A. Freely available data
B. Data with limited access due to rights considerations
C. Data with limited access due to confidentiality and privacy considerations
A Freely available data
The Government's AI strategy has as a "goal that data that can be made openly available shall be shared so that they can be used by others." [126]
There are a number of datasets and sources that can freely be used for training language models, for example websites from public health authorities:
- Health professionally quality-assured knowledge and information, for example from NIPH, specialist health services, Helsenorge.no, Helsebibliteket.
- National requirements and recommendations from the Norwegian Directorate of Health
- Method books from specialist health services
- Free professional language resources, for example health professional terminologies and classifications without license requirements, for example future ICD-11 and terms in SNOMED CT
B Data with limited access due to rights considerations
Some data may have limited access due to rights considerations, such as copyright, for example:
- Professional and textbooks (See "Mímir project" below)
- Knowledge bases and reference works, for example Felleskatalogen and Store medisinske leksikon
- Professional language resources, for example health professional terminologies and professional dictionaries with license requirements
Mímir project and training on copyright-protected material
On behalf of the Government, the National Library collaborated with NorwAI at NTNU and Language Technology Group at UiO in 2024 to analyze the significance of copyright-protected material on linguistic quality in language models.
The language in language models with and without copyright-protected material such as Norwegian newspapers and books was compared. The results showed that language models trained on content where rights-protected Norwegian material is included achieve better quality.
The goal of the project was to gather empirical evidence that could form the basis for possible agreements between the state and rights holders on the use of copyrighted content for AI purposes. Work is now underway to establish principles related to such agreements (as of March 1, 2025).
Source: https://www.nb.no/content/uploads/2024/08/Mimirprosjektet_teknisk-rapport.pdf
Training on professional language resources
There are several professional language resources that can be used to train language models, including terminologies and classifications.
SNOMED CT is an international machine-readable terminology with nearly 370,000 health professional concepts. Approximately 130,000 of the concepts are translated into Norwegian and cover health professional areas such as anatomy, findings, symptoms, diagnoses, procedures, substances, and medicines. SNOMED CT is mainly used for documentation and interaction of patient information.
The translation is done in Bokmål, while an increasing part is also available in Nynorsk. The resource is multilingual in that there is a direct connection between English and Norwegian terms.
The Norwegian translations are freely available from the Language Bank at the National Library. Internationally, the terms in SNOMED CT have been used to train several language models (pubmed.ncbi.nlm.nih.gov).
The Norwegian Directorate of Health manages the Norwegian version of SNOMED CT with regular upgrades.
SNOMED CT also contains a deeper machine-readable structure (ontology) with, among other things, concept relationships. SNOMED CT's ontology has been used for RAG in language models, also in Norway, but requires a license from the Norwegian Directorate of Health.
Another relevant resource is the International Statistical Classification of Diseases and Related Health Problems (ICD), which is owned and managed by WHO. The health service in Norway currently uses ICD-10 as a code system for diseases and causes of death and therefore contains terms and concepts that are in established use internationally.
The Norwegian Directorate of Health manages the Norwegian version of ICD. WHO has now updated ICD-10 to ICD-11. ICD-11 has updated medical content that is much more comprehensive and also includes a terminology. The resource is now being translated into Norwegian and will be freely available for use.
Similarly, ICPC-2 is the international classification for health problems, diagnoses, and other reasons for contact with primary health services. ICPC-2 is in use in Norway and the Norwegian Directorate of Health maintains both this and the website where the updated international edition of ICPC-2 (English version, ICPC-2e-English) is published on behalf of the Wonca International Classification Committee (WICC).
The Norwegian Directorate of Health also manages a number of other relevant classifications, for example the Norwegian Procedure Code System, Norwegian Laboratory Code System with associated code systems (Test Material, Anatomical Localization, Textual Result Values, and Examination Method), Norwegian Pathology Code System (NORPAT), and Activity Codes for pathology laboratories (APAT). The development of the Norwegian Procedure Code System was initiated by the Nordic Council and still has a Nordic-Baltic core (NCSP). This contains, for example, terms for procedures and procedure groups in use in specialist health services in the Nordic-Baltic countries.
C Data with limited access due to confidentiality and privacy considerations
Several data sources have limited access due to confidentiality and privacy considerations. This applies, for example, to:
- Texts from patient records (electronic health records)
- Data from quality registries and other health registries
- Health surveys such as HUNT
- Applications and responses related to Norwegian health administrative businesses
Training on text from patient records, Clinical NorBERT
Helse Vest IKT has, in collaboration with the health enterprises in Western Norway, trained a language model using, among other things, journal texts. To safeguard privacy, the texts have been anonymized by replacing place names with another place name, first and last names with another first and last name, dates with another date, etc. As part of the research project, the quality of the anonymization has been analyzed.
To gain access to the journal texts for training the model, an application has been made and granted exemption from confidentiality obligations for research purposes in the Health Personnel Act by the Regional Ethics Committee (REC).
It is envisioned that Clinical NorBERT can be used for, for example, automated text analyses and machine-assisted coding. The language model is a BERT model that differs from generative language models. Therefore, there is little risk that the raw data on which the language model was originally trained can be recreated.
Clinical NorBERT can be made available for use by other actors in health and care services under certain conditions. If the language model is to be fine-tuned for a specific area of use, new approval from REC or the Norwegian Directorate of Health is required to use health data for this purpose.
Helse Vest IKT is now looking at possibilities for training generative language models.
Source: Helse Vest IKT
Data sources such as texts from patient records can be particularly relevant for training large language models because they reflect actual language use and healthcare personnel's use of knowledge. However, it can be time-consuming to access such data [127]. This is partly because healthcare personnel are subject to confidentiality obligations, as per Chapter 5 of the Healthcare Personnel Act. This limits free processing of health information from patient records and other health registries for training and use of language models. Section 29 of the Healthcare Personnel Act nevertheless opens for dispensation from confidentiality obligations for use of health information from treatment-oriented health registries (patient records) when certain conditions are met [128]. An example of such dispensation being granted is texts from patient records for training the language model Clinical NorBERT at Helse Vest IKT and NorDeClin-BERT at the National Center for E-health Research.
Synthetic data
Limited access to sensitive personal data has led to discussion about the need to create synthetic texts [129]. Synthetic data that cannot be traced back to individuals is not personal data and is therefore not covered by confidentiality obligations. However, there are also a range of challenges related to synthetic data.
Research on ASR systems (Automatic Speech Recognition) and downstream AI models shows that synthetic data can have significant limitations. Research shows that synthetic transcriptions can contain errors, hallucinations, and unnatural language structures that affect the performance of models using these data [130]. For example, a study shows that using simulated ASR output to train models (by using text-to-speech and then speech-to-text) could make models more robust against errors, but quality still varied compared to authentic data [131].
A particular concern with using synthetic health data is that hallucinations in language models can lead to false information being incorporated into datasets [132]. One study shows that even advanced systems like Whisper can "invent or 'hallucinate' entire phrases and sentences" in about 1% of transcriptions [133]- In health contexts, such erroneous additions can lead to models being trained on medical information that was never spoken, which can have serious consequences for diagnosis and treatment recommendations.
In addition, one is, as mentioned, dependent on authentic data to create synthetic data, so the problem related to privacy can not necessarily be avoided. The quality of synthetic data depends heavily on the quality and representativeness of source data, and the study indicates that certain patterns of error in original data can be amplified or create new problems in synthetic datasets.
There is therefore still a need for more knowledge to be able to say whether synthetic datasets can be sufficiently suitable for training language models, particularly when it comes high risk domains such as the health sector. Researchers recommend a combination of improved ASR technology, error-correcting methods, robust training, and cross-modal validation to reduce problems with synthetic data, but these challenges are still not fully solved [134].
Infrastructure for computational resources
Adaptation of language models to Norwegian conditions will require extensive computational power and associated infrastructure. The need will depend on many factors, including how the adaptation is carried out.
The Research Council of Norway has conducted a concept selection study on the need for computing power and organization of national infrastructure [135]. The report points to an investment need of 3.4 billion kroner over the next five years. A cost-benefit assessment for each administrative area will be conducted in step 2. However, this step does not include the health and care sector and points out that a separate study is required since the sector is extensive and complex with special requirements for privacy.
Recent technological developments suggest that much more efficient ways of developing language models are now being developed than previously, which may reduce previously assumed needs for computational power.
However, the large amount of sensitive data will still present challenges related to infrastructure for computing power. There may therefore be a need for suitable infrastructure capable of handling such data for training and fine-tuning of language models.
Quality framework for large language models
Evaluation of language models is complex and will encompass several types of tests (benchmarks). Relevant research has been conducted on tests for adapted language models to general Norwegian conditions, for example language, at the University of Oslo [136].
The evaluation should encompass both general and user-oriented testing. General quality measurement can be based on standardized tests that evaluate basic medical knowledge, for example through adapted versions of medical examination questions. User-oriented and context-specific quality measurement can be testing in real or simulated use situations that are representative of the model's intended use in Norwegian health and care services.
Testing using medical examination questions
International research often uses medical examination questions to test the quality of language models. A common dataset is MedQA, which includes over 60,000 questions from several medical examinations in the USA and China.
Such datasets can be relevant for testing general medical competence, for example anatomy and diagnoses. However, the clinical everyday life where AI tools are to function will differ significantly from an examination situation. Testing using examination questions will therefore not necessarily be sufficient to measure the quality of language models.
Another question is whether examination questions from the USA and China will work well in Norwegian context, and it is possible that separate datasets with examination questions based on Norwegian curriculum should be created.
An additional point is that several international language models may already be trained on datasets like MedQA. The dataset would then be contaminated and unsuitable for testing language models.
Currently, there is no generally accepted quality framework for evaluating language models for the health and care sector, but several point out that this is important to be able to use language models responsibly in health [137]. Such a framework will need to be developed gradually and be based on proven methods, best practices, and standards. A coalition aimed at promoting responsible development of AI for health (Coalition for Health AI (CHAI™)) outlines, for example, a framework and emphasizes the importance of developing good tests to evaluate large language models with regard to five fundamental principles: usefulness, fairness and equal treatment, transparency, safety, and security and privacy [138].
Through work on this report, it has been proposed, among other things, to establish a layered model to evaluate and test various characteristics that are important for the Norwegian health and care sector. Below is a description of possible areas for testing and what a layered model for evaluation can entail.
Possible areas for testing
The quality framework can encompass tests of models within several areas.
Health professional language:
- medical terminology, including technical terms, abbreviations, and jargon
- Bokmål, Nynorsk, and possibly Sami in health professional context
- linguistic precision, language adapted to different target groups such as healthcare personnel and patients, including people with different cultural backgrounds
Health professional knowledge and practice:
- national professional guidelines
- established medical procedures and treatment protocols
- acute versus non-acute situations
- identification and explanation of medical relationships
Health administrative knowledge and practice:
- structure and organization of Norwegian health and care services
- administrative routines and procedures
- referral and documentation practice
- interaction between different levels in health services
Values and ethics:
- respect for patient rights
- privacy and confidentiality
- ethically defensible recommendations
- complex ethical dilemmas
Legislation:
- legislation relevant to the health and care sector
- legal requirements for patient treatment
- healthcare personnel's duties and responsibilities
- documentation requirements and reporting obligations
For all these areas, it is essential to establish both quantitative and qualitative methods [139].
The evaluation can include various dimensions:
- systematic evaluation of the model's accuracy and reliability
- assessment of the model's ability to acknowledge its own uncertainty
- testing the model's robustness under different conditions
- continuous evaluation of the model's performance over time to detect any performance degradation
-
-
- Layered model for testing language models
-
-
To ensure thorough and systematic evaluation, a layered model has been proposed through the work with this report, encompassing evaluation of basic, domain-specific, use-case-specific, and context-specific properties (see layered model figure 7).

It may be appropriate to start by developing the framework for the lower levels, as well as use-cases with low risk. Where the use case falls under regulations for medical devices and/or the AI Act, standards, both existing and under development, will constitute separate frameworks that can be used to meet legal requirements.
The description below is a sketch illustrating what can typically be tested at each level and can form a starting point for further specification.
- Basic evaluation (level 1) encompasses fundamental properties that are critical for applications in the health and care sector. This includes testing linguistic quality in both Bokmål and Nynorsk and Sami where relevant, and how the model handles general medical language and terminology, basic security and privacy, as well as technical performance such as response time and stability. At this level, it can be considered to establish minimum requirements that all models must meet to be used in the health sector.
- Domain-specific evaluation (level 2) encompasses domain-specific properties that are important for the health and care sector. This can include testing the model for specialized medical professional language, clinical guidelines, health administrative processes, and ethical frameworks. At this level, the model's ability to handle regional and local conditions in the Norwegian health and care sector is also assessed.
- Use-case-specific evaluation (level 3) encompasses specific use cases within the health and care sector. For example, a model to be used in emergency care would be tested specifically for the ability to assist with triage assessments, emergency medical procedures, and coordination with other departments. A model for creating speech-to-summary of patient conversations should be tested for exactly this use case. Other use cases will have other specific requirements that are evaluated. Testing at this level helps assess how suitable the model is for its intended purpose.
- Context-specific evaluation (level 4) concerns the model's suitability in the specific implementation context. This encompasses testing integration with local systems, adaptation to established work processes, and handling of specific documentation and privacy requirements. Testing at this level also assesses the model's ability to meet local quality indicators and specific needs in the implementation environment.
Some advantages of a commonly accepted testing framework are that it:
- provides a basis for comparison between different models
- enables systematic evaluation from the general to the specific
- simplifies identification of weaknesses and areas for improvement
- facilitates targeted optimization
- ensures that critical aspects are evaluated
- streamlines the testing process by revealing fundamental shortcomings early
Testing should not only be done once, but continuously over time, to capture changes in performance, identify new challenges, and ensure lasting quality in the service. The framework should be regularly revised and updated in line with technological development and new experiences from practical use.
Competence in development and use
It is crucial that the health and care sector has sufficient access to necessary competence throughout the entire lifecycle of language models, from development, adaptation, testing, and use of language models adapted to Norwegian conditions [140]. Several informants have provided feedback that more competence in the health and care sector is necessary, including in artificial intelligence (AI), information and communication technology (ICT), health sciences, law, linguistics, and economics.
For the use of AI tools as part of healthcare, the requirements that follow from health legislation apply as elsewhere. The health and care service has a duty to ensure that the healthcare provided is sound. Among other things, it must ensure that the AI tools used contribute to making the healthcare provided safe and secure. According to the AI Act, there is a responsibility on organizations that deploy AI solutions to train users. The organization also is also responsible for delegating tasks related to ensuring human oversight to persons with the necessary competence and who have undergone the necessary training to perform this function [141].
AI literacy is a central concept in the AI Act. There are also attempts to operationalize and specify the concept at the levels of 'knowledge', 'understanding', and 'skills' [142].
Research is also being conducted on AI literacy. An international meta-study indicates that healthcare personnel and students have low AI literacy [143]. We are not aware of a comprehensive overview of Norwegian conditions, but it may be reasonable to assume that there is insufficient AI literacy for safe and effective use, development, and testing of language models in Norwegian health and care services.
Governance of language models
Today there is no national governance, including infrastructure, for large language models. The technology is still rapidly evolving, and new international language models are constantly being introduced. Language models are also being developed and adapted to Norwegian conditions by actors in both the public sector and industry. However, there is no comprehensive national overview of which models exist and which are used in the health and care sector.
A stable governance structure can facilitate safe introduction of language models in health and care services by ensuring quality, relevance, and legal frameworks, among other things.
The Technology Council recommends to define selected Norwegian language models as a national common service, similar to ID-porten, Altinn, and the Population Register, to ensure good operation, management, and access [144]. Both Norwegian pre-trained language models and adapted language models can be included. The Technology Council further points out that a new function should be established for development and governance of such a common service. This could also apply to the health and care sector, like other common services within the sector. National management could address several of the risks described above through, for example, testing and evaluation of language models for the sector.
Common governance could encompass several functions, including:
- governance and further development common data resources for adaptation to Norwegian conditions
- governance and further development of quality frameworks
- governance of common language models
- regulatory guidance for the health and care sector
- collaboration with research institutions and participation in relevant research projects
- making language models available for use in the sector or further development by, for example, suppliers or the public sector
- coordinating efforts in the sector related to language models
Common governance would not replace industry's role as supplier, but rather facilitate safe frameworks for innovation and use, for private and public organizations. Further development by commercial actors can contribute to a broader range of solutions. For example, a governance organization can test, adapt, and make available one or more pre-trained healthcare language models. Such models can be further developed for specific use cases by, for example, private or public actors.
A governance structure can ensure long-term sustainability, quality, and coordinated development of language models for the sector. The organization should be responsible for following technological development, assessing new opportunities, but also risks, and establishing clear frameworks for how models can be applied, with particular emphasis on patient safety and privacy. This way, it can provide advice on implementation and use of language models in the health and care sector.
Governance can take place in an existing organization, or a separate organization can be established, for example, a dedicated center. There will be a need for closer investigation of needs for and how such management should be organized.
[126] https://www.regjeringen.no/contentassets/1febbbb2c4fd4b7d92c67ddd353b6ae8/en-gb/pdfs/ki-strategi_en.pdf
[127] https://ehealthresearch.no/nyheter/2024/en-revolusjon-for-helsesektoren-den-forste-norske-kliniske-sprakmodellen-utviklet
[128] https://www.helsedirektoratet.no/rundskriv/helsepersonelloven-med-kommentarer/taushetsplikt-og-opplysningsrett#paragraf-29-opplysninger-til-forskning-mv
[129] https://tidsskriftet.no/2024/11/kronikk/syntetiske-datasett-kan-gi-bedre-ki-modeller-helsetjenesten
[139] Examples of dimensions and axis described in https://jamanetwork.com/journals/jama/fullarticle/2825147 and https://arxiv.org/abs/2211.09110