Training on text from patient records, Clinical NorBERT

Training on text from patient records, Clinical NorBERT

Helse Vest IKT has, in collaboration with the health enterprises in Western Norway, trained a language model using, among other things, journal texts. To safeguard privacy, the texts have been anonymized by replacing place names with another place name, first and last names with another first and last name, dates with another date, etc. As part of the research project, the quality of the anonymization has been analyzed.

To gain access to the journal texts for training the model, an application has been made and granted exemption from confidentiality obligations for research purposes in the Health Personnel Act by the Regional Ethics Committee (REC).

It is envisioned that Clinical NorBERT can be used for, for example, automated text analyses and machine-assisted coding. The language model is a BERT model that differs from generative language models. Therefore, there is little risk that the raw data on which the language model was originally trained can be recreated.

Clinical NorBERT can be made available for use by other actors in health and care services under certain conditions. If the language model is to be fine-tuned for a specific area of use, new approval from REC or the Norwegian Directorate of Health is required to use health data for this purpose.

Helse Vest IKT is now looking at possibilities for training generative language models.

Source: Helse Vest IKT