Mol2DSimi

Published:

Mol2DSimi

  • Calculate molecular fingerprints (12 types) and similarity based on reference compounds
  • Automated validation with several types of metrics: AUC, EF1%, EF5%, EF10%, F1 score, GH
  • Statistical test for decision making: which type of fingerprint is suitable or reference compound that enhance performance
  • Ensemble learning: stacking technique

Tanimoto similarity

\[T _{c}(A,B) = \frac{c}{a+b-c}\]
  • a: number of features present in molecule A
  • b: number of features present in molecule B
  • c: number of features shared by molecules A and B

Requirements

This module requires the following modules:

Installation

Clone this repository to use

Folder segmentation

Finally the folder structure should look like this:

Mol2DSimi (project root)
|__  README.md
|__  Mol2DSimi
|__  |__ Similarity.py
|    |__ Simivalid.py
|    |__ enrichment_factor.py
|    |__ significantplot.py
|__  Image (saved images)
|__  Mol2DSimi.ipynb 
|__  LICENSE
|......

Usage

import math
import numpy as np
import pandas as pd
from rdkit import Chem
import sys
sys.path.append('Mol2DSimi')
from Similarity import similarity_calculate
from enrichment_factor import Enrichment_Factor
from validation import  similarity_validation
from tqdm import tqdm # progress bar
tqdm.pandas()




# 1. Similarity Calculation
simi = similarity_calculate(data = data, query= query, smile_col="CanonSmiles", active_col='Active')
simi.fit()
simi.plot()

# 2. Valudation
valid = similarity_validation(data, active_col = 'Active', scores = 'tanimoto',plot_type = 'roc', figsize = (14,10), query =i )
valid.validation()
valid.visualize()

# 3. Compare fingerprints - automated pipeline
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import RepeatedStratifiedKFold, train_test_split
path = './Similarity/Data/Raw_data'
cv = RepeatedStratifiedKFold(n_repeats = 3, n_splits=10, random_state=42)
list_AUC = []


# query is a list of molecules format: CMF_019, AMG_986, BMS_986224 
for i in query:
    data = pd.read_csv(path+f'/{i.GetProp("_Name")}.csv')
    data = data[col]
    data_train, data_test = train_test_split(data, test_size=0.2, random_state=42, stratify = data.Active)
    for train_index, test_index in cv.split(data_train.drop(['Active'], axis =1), data_train['Active']):
        list_auc = []
        model = []
        test = data_train.iloc[test_index,:]
        for i in col[1:]:
            model.append(i)
            fpr, tpr, _ = roc_curve(test['Active'], test[i])
            roc_auc = round(auc(fpr, tpr),3)
            list_auc.append(roc_auc)
        list_AUC.append(list_auc)
AUC = pd.DataFrame(list_AUC, columns = model)
AMG_986 = AUC.iloc[30:60,:].reset_index(drop=True)
# post hoc
df_melt = pd.melt(AMG_986.reset_index(), id_vars=['index'], value_vars=AUC.columns)
df_melt.columns = ['index', 'Model', 'AUC']
pc =sp.posthoc_wilcoxon(df_melt, val_col='AUC', group_col='Model', p_adjust='holm')
plt.figure(figsize = (14,8))
plt.title("AUC-Wilcoxon - AMG_986", fontsize = 24, weight = 'semibold')
heatmap_args = {'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
sign_plot(pc, **heatmap_args)

Contributing

Please visit the Mol2DSimi repository. Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. Please make sure to update tests as appropriate.