QingNangTCM:一种面向中医领域的参数高效微调大语言模型

QingNangTCM: a parameter-efficient fine-tuning large language model for traditional Chinese medicine

  • 摘要:
    目的 针对通用大语言模型在中医专业问答与临床推理中存在领域知识、专业对齐程度有限等问题,构建一种面向中医应用场景的专用大语言模型 QingNangTCM。
    方法 构建了一个包含10万条样本的中医领域语料库QnTCM_Dataset,该语料库在整合 ShenNong_TCM_Dataset 和 SymMap v2.0 的基础上,引入检索增强生成与角色驱动生成策略进行数据扩展,覆盖中医诊断问答、处方建议及中药知识等核心任务。以 GLM-4-9B-Chat 为基座模型,采用 P-Tuning v2 方法进行参数高效微调,得到 QingNangTCM 模型。本研究建立了多维评测体系,从准确性、覆盖性、一致性、安全性、专业性与流畅性等方面进行综合评估,采用双语评估替补(BLEU)、面向召回的摘要评估研究(ROUGE)、机器翻译评估的度量(METEOR)等自动指标,并结合基于专家校验的 LLM-as-a-Judge 评测方法。同时设计症状分析、疾病诊疗、中药查询和失败案例四类模拟临床场景开展定性分析,并与 GLM-4-9B-Chat、DeepSeek-V2、HuatuoGPT-II(7B)及GLM-4-9B-Chat(freeze-tuning)模型进行对比。
    结果 QingNangTCM 在 BLEU-1/2/3/4(0.425/0.298/0.137/0.064)、ROUGE-1/2(0.368/0.157)及 METEOR(0.218)指标上均取得最优表现,在准确性、覆盖性与一致性维度上的归一化综合性能达到 0.900。尽管其 ROUGE-L 指标(0.299)略低于 HuatuoGPT-II(7B)(0.351),但在专家验证的专业性与安全性胜率评估中分别达到 86% 和 73%。定性分析显示,该模型能够较好遵循“症状−证候−病机−治法”的中医诊疗推理链条,但在处理罕见中药及复杂证候组合时仍存在一定误判与幻觉现象。
    结论 通过将中医领域语料构建与参数高效的提示微调方法相结合,可增强大语言模型在中医相关任务中的推理与领域适配能力。相关工作为中医知识的数字化与智能化提供了一种技术框架,对辅助中医诊疗与教育具有一定的应用价值。

     

    Abstract:
    Objective To develop QingNangTCM, a specialized large language model (LLM) tailored for expert-level traditional Chinese medicine (TCM) question-answering and clinical reasoning, addressing the scarcity of domain-specific corpora and specialized alignment.
    Methods We constructed QnTCM_Dataset, a corpus of 100 000 entries, by integrating data from ShenNong_TCM_Dataset and SymMap v2.0, and synthesizing additional samples via retrieval-augmented generation (RAG) and persona-driven generation. The dataset comprehensively covers diagnostic inquiries, prescriptions, and herbal knowledge. Utilizing P-Tuning v2, we fine-tuned the GLM-4-9B-Chat backbone to develop QingNangTCM. A multi-dimensional evaluation framework, assessing accuracy, coverage, consistency, safety, professionalism, and fluency, was established using metrics such as bilingual evaluation understudy (BLEU), recall-oriented understudy for gisting evaluation (ROUGE), metric for evaluation of translation with explicit ordering (METEOR), and LLM-as-a-Judge with expert review. Qualitative analysis was conducted across four simulated clinical scenarios: symptom analysis, disease treatment, herb inquiry, and failure cases. Baseline models included GLM-4-9B-Chat, DeepSeek-V2, HuatuoGPT-II (7B), and GLM-4-9B-Chat (freeze-tuning).
    Results QingNangTCM achieved the highest scores in BLEU-1/2/3/4 (0.425/0.298/0.137/0.064), ROUGE-1/2 (0.368/0.157), and METEOR (0.218), demonstrating a balanced and superior normalized performance profile of 0.900 across the dimensions of accuracy, coverage, and consistency. Although its ROUGE-L score (0.299) was lower than that of HuatuoGPT-II (7B) (0.351), it significantly outperformed domain-specific models in expert-validated win rates for professionalism (86%) and safety (73%). Qualitative analysis confirmed that the model strictly adheres to the “symptom-syndrome-pathogenesis-treatment” reasoning chain, though occasional misclassifications and hallucinations persisted when dealing with rare medicinal materials and uncommon syndromes.
    Conclusion Combining domain-specific corpus construction with parameter-efficient prompt tuning enhances the reasoning behavior and domain adaptation of LLMs for TCM-related tasks. This work provides a technical framework for the digital organization and intelligent utilization of TCM knowledge, with potential value for supporting diagnostic reasoning and medical education.

     

/

返回文章
返回