Study on the effect of molecular fingerprints on the performance of machine learning model predicting hiv integrase bioactivty

Published in Vietnamese Pharmacy Journal, 2023

Recommended citation: Tieu Long Phan, Thanh An Pham, Bao Vy Doan Ngoc, Hoang Son Le Lai, Ngoc Tuyen Truong (2023). Study on the effect of molecular fingerprints on the performance of machine learning model predicting HIV integrase bioactivty. Vietnamese Pharmacy Journal (Accepted).

Summary HIV-1 has been causing severe pandemics since the 1980s by attacking the host immune system. If HIV-1 is not treated, it can eventually lead to AIDS (acquired immunodeficiency syndrome), where death is inevitable due to opportunistic infections. Therefore, discovering new HIV-1 antiviral drugs is urgent. One of the most prominent approaches in Computer-Aided Drug Discovery is constructing Quantitative structure– activity relationship (QSAR) models, especially with the rising of Artificial Intelligence, including machine learning and deep learning. In applied machine learning to create QSAR models, the quality of dataset is a primary criterion to achieve a well-performed binary predictor. The combination of molecular fingerprints can optimize machine learning models, improve prediction accuracy, save time and computational resources. Regarding this research, 3110 HIV integrase inhibitors were extracted from the ChEMBL library and calculated to seven types of molecular fingerprint datasets. After that, a data mining workflow was performed to create a training set and an external validation set. Several machine learning models were initiated to determine the most impactful molecular fingerprint on the model’s performance. After finding the best dataset and suitable algorithm, the following stages optimize and evaluate the model’s generalized capability on the external validation set. Eventually, RDKit fingerprints had a strong influence on the model’s performance. After tuning hyperparameters, the XGBoost algorithm achieved an external validation precision of 0.91, an F1 score of 0.83, and a recall of 0.814. Other metrics, such as an area under the ROC curve of 0.96 and an accuracy of 0.93, were also promising. This result was highly generalizable and reliable for virtual screening potential HIV-1 integrase inhibitors. Keywords: QSAR, molecular fingerprint, machine learning, HIV integrase, data-centric, average precision. Preprint

Recommended citation: Tieu Long Phan, Thanh An Pham, Bao Vy Doan Ngoc, Hoang Son Le Lai, Ngoc Tuyen Truong (2023). Study on the effect of molecular fingerprints on the performance of machine learning model predicting HIV integrase bioactivty. Vietnamese Pharmacy Journal (Accepted)