基于图卷积网络的中药寒热属性分类研究

Classification of cold and hot medicinal properties of Chinese herbal medicines based on graph convolutional network

  • 摘要:
    目的 为了对中药寒热属性进行高效分类,提出了基于图卷积网络(GCN)的分类模型。
    方法 本研究在对已发表文献提供的数据集进行筛选后,最后纳入了495种中药及其8 075个化合物数据。使用三种分子描述符来表示化合物,分别是分子访问系统(MACCS)、扩展连通性指纹(ECFP)和RDKit开源工具包计算的二维(2D)分子描述符(RDKit_2D),构建以中药为节点的同质图,并以化合物分子描述符信息为节点特征,基于图卷积网络提出一种中药寒热属性分类模型。最后,采用准确率和F1值评估模型性能,将GCN模型与决策树(DT)、随机森林(RF)、K-邻近(KNN)、朴素贝叶斯(NBC)和支持向量机(SVM)进行对比实验,并将MACCS、ECFP和RDKit_2D分子描述符作为特征进行对比实验。
    结果 实验结果表明,相较于机器学习方法,GCN取得了较好的性能,使用MACCS作为特征准确率和F1值分别达到了0.836 4和0.845 3,并且与性能最低的特征组合OMER(仅是MACCS、ECFP、RDKit_2D的组合)相比,准确率和F1值分别提升了0.86900.8120。而DT、RF、KNN、NBC和SVM的准确率和F1值分别为0.505 1和0.501 8、0.616 2和0.601 5、0.676 8和0.624 3、0.616 2和0.607 1、0.636 4和0.622 5。
    结论 本研究通过引入分子描述符作为特征,验证了在对中药寒热属性进行分类时,分子描述符与指纹起到了关键作用。同时,利用GCN模型实现了出色的分类性能,为深入研究中药的“结构-性质”关系提供了重要的算法依据。

     

    Abstract:
    Objective  To develop a model based on a graph convolutional network (GCN) to achieve efficient classification of the cold and hot medicinal properties of Chinese herbal medicines (CHMs).
    Methods  After screening the dataset provided in the published literature, this study included 495 CHMs and their 8 075 compounds. Three molecular descriptors were used to represent the compounds: the molecular access system (MACCS), extended connectivity fingerprint (ECFP), and two-dimensional (2D) molecular descriptors computed by the RDKit open-source toolkit (RDKit_2D). A homogeneous graph with CHMs as nodes was constructed and a classification model for the cold and hot medicinal properties of CHMs was developed based on a GCN using the molecular descriptor information of the compounds as node features. Finally, using accuracy and F1 score to evaluate model performance, the GCN model was experimentally compared with the traditional machine learning approaches, including decision tree (DT), random forest (RF), k-nearest neighbor (KNN), Naïve Bayes classifier (NBC), and support vector machine (SVM). MACCS, ECFP, and RDKit_2D molecular descriptors were also adopted as features for comparison.
    Results  The experimental results show that the GCN achieved better performance than the traditional machine learning approach when using MACCS as features, with the accuracy and F1 score reaching 0.836 4 and 0.845 3, respectively. The accuracy and F1 score have increased by 0.8690 and 0.8120, respectively, compared with the lowest performing feature combination OMER (only the combination of MACCS, ECFP, and RDKit_2D). The accuracy and F1 score of DT, RF, KNN, NBC, and SVM are 0.505 1 and 0.501 8, 0.616 2 and 0.601 5, 0.676 8 and 0.624 3, 0.616 2 and 0.607 1, 0.636 4 and 0.622 5, respectively.
    Conclusion  In this study, by introducing molecular descriptors as features, it is verified that molecular descriptors and fingerprints play a key role in classifying the cold and hot medicinal properties of CHMs. Meanwhile, excellent classification performance was achieved using the GCN model, providing an important algorithmic basis for the in-depth study of the “structure-property” relationship of CHMs.

     

/

返回文章
返回