Testing using medical examination questions

Testing using medical examination questions

International research often uses medical examination questions to test the quality of language models. A common dataset is MedQA, which includes over 60,000 questions from several medical examinations in the USA and China.

Such datasets can be relevant for testing general medical competence, for example anatomy and diagnoses. However, the clinical everyday life where AI tools are to function will differ significantly from an examination situation. Testing using examination questions will therefore not necessarily be sufficient to measure the quality of language models.

Another question is whether examination questions from the USA and China will work well in Norwegian context, and it is possible that separate datasets with examination questions based on Norwegian curriculum should be created.

An additional point is that several international language models may already be trained on datasets like MedQA. The dataset would then be contaminated and unsuitable for testing language models.