QingNangTCM：一种面向中医领域的参数高效微调大语言模型

通旭明; 刘利岩; 袁艳红; 丁晓征; 贾慧茹; 杨旭; 严肇基; 王涵; 熊璋; 王雅鹏

doi:10.1016/j.dcmed.2026.02.002

QingNangTCM：一种面向中医领域的参数高效微调大语言模型

QingNangTCM: a parameter-efficient fine-tuning large language model for traditional Chinese medicine

摘要

摘要:
目的针对通用大语言模型在中医专业问答与临床推理中存在领域知识、专业对齐程度有限等问题，构建一种面向中医应用场景的专用大语言模型 QingNangTCM。
方法构建了一个包含10万条样本的中医领域语料库QnTCM_Dataset，该语料库在整合 ShenNong_TCM_Dataset 和 SymMap v2.0 的基础上，引入检索增强生成与角色驱动生成策略进行数据扩展，覆盖中医诊断问答、处方建议及中药知识等核心任务。以 GLM-4-9B-Chat 为基座模型，采用 P-Tuning v2 方法进行参数高效微调，得到 QingNangTCM 模型。本研究建立了多维评测体系，从准确性、覆盖性、一致性、安全性、专业性与流畅性等方面进行综合评估，采用双语评估替补（BLEU）、面向召回的摘要评估研究（ROUGE）、机器翻译评估的度量（METEOR）等自动指标，并结合基于专家校验的 LLM-as-a-Judge 评测方法。同时设计症状分析、疾病诊疗、中药查询和失败案例四类模拟临床场景开展定性分析，并与 GLM-4-9B-Chat、DeepSeek-V2、HuatuoGPT-II（7B）及GLM-4-9B-Chat（freeze-tuning）模型进行对比。
结果 QingNangTCM 在 BLEU-1/2/3/4（0.425/0.298/0.137/0.064）、ROUGE-1/2（0.368/0.157）及 METEOR（0.218）指标上均取得最优表现，在准确性、覆盖性与一致性维度上的归一化综合性能达到 0.900。尽管其 ROUGE-L 指标（0.299）略低于 HuatuoGPT-II（7B）（0.351），但在专家验证的专业性与安全性胜率评估中分别达到 86% 和 73%。定性分析显示，该模型能够较好遵循“症状−证候−病机−治法”的中医诊疗推理链条，但在处理罕见中药及复杂证候组合时仍存在一定误判与幻觉现象。
结论通过将中医领域语料构建与参数高效的提示微调方法相结合，可增强大语言模型在中医相关任务中的推理与领域适配能力。相关工作为中医知识的数字化与智能化提供了一种技术框架，对辅助中医诊疗与教育具有一定的应用价值。

Abstract:
Objective To develop QingNangTCM, a specialized large language model (LLM) tailored for expert-level traditional Chinese medicine (TCM) question-answering and clinical reasoning, addressing the scarcity of domain-specific corpora and specialized alignment.
Methods We constructed QnTCM_Dataset, a corpus of 100 000 entries, by integrating data from ShenNong_TCM_Dataset and SymMap v2.0, and synthesizing additional samples via retrieval-augmented generation (RAG) and persona-driven generation. The dataset comprehensively covers diagnostic inquiries, prescriptions, and herbal knowledge. Utilizing P-Tuning v2, we fine-tuned the GLM-4-9B-Chat backbone to develop QingNangTCM. A multi-dimensional evaluation framework, assessing accuracy, coverage, consistency, safety, professionalism, and fluency, was established using metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), metric for evaluation of translation with explicit ordering (METEOR), and LLM-as-a-Judge with expert review. Qualitative analysis was conducted across four simulated clinical scenarios: symptom analysis, disease treatment, herb inquiry, and failure cases. Baseline models included GLM-4-9B-Chat, DeepSeek-V2, HuatuoGPT-II (7B), and GLM-4-9B-Chat (freeze-tuning).
Results QingNangTCM achieved the highest scores in BLEU-1/2/3/4 (0.425/0.298/0.137/0.064), ROUGE-1/2 (0.368/0.157), and METEOR (0.218), demonstrating a balanced and superior normalized performance profile of 0.900 across the dimensions of accuracy, coverage, and consistency. Although its ROUGE-L score (0.299) was lower than that of HuatuoGPT-II (7B) (0.351), it significantly outperformed domain-specific models in expert-validated win rates for professionalism (86%) and safety (73%). Qualitative analysis confirmed that the model strictly adheres to the “symptom-syndrome-pathogenesis-treatment” reasoning chain, though occasional misclassifications and hallucinations persisted when dealing with rare medicinal materials and uncommon syndromes.
Conclusion Combining domain-specific corpus construction with parameter-efficient prompt tuning enhances the reasoning behavior and domain adaptation of LLMs for TCM-related tasks. This work provides a technical framework for the digital organization and intelligent utilization of TCM knowledge, with potential value for supporting diagnostic reasoning and medical education.

HTML全文

参考文献(32)

施引文献

资源附件(0)