2. kNN for (Q)SPR modeling ⚛️

2. kNN for (Q)SPR modeling ⚛️#

Goals of this exercise 🌟#

We will learn how to construct a simple SPR model using kNN
We will learn the importance of data normalization (scaling)
We will review the concepts of training and test split and cross-validation
We will review some of the performance metrics for assesing classification models

A quick reminder ✅#

Probably the simplest data-driven model that you can think of is k-nearest neighbours (kNN). It simply predicts future data as the average (or mode) of the “k” nearest neighbours of the queried point.

As simple as this idea might be, it works relatively good in various applications. One of them is the generation of (quantitative) structure-property relationships ((Q)SPR) models [SXG+03, YJQ+19]. Whether the word “Quantitative” is sometimes included or not depends on whether the model in question is a regression model or a classification model. Do you remember the difference?

The key question in kNN is what do we consider a neighbour and what not? This indicates us that we need to define a sort of similarity or distance metric that allows us to distinguish neighbouring points from points that are far away.

Common distance metrics use the different mathematical norms. For example, the Euclidean distance:

\[ d(\textbf{x}, \textbf{x'}) = \sqrt{\sum_i^D (x_i - x'_i)^2} \]

kNN — Fig. 5 Among the k-nearest neighbours (k=5), the majority of points are red 1s. Therefore, the queried point (green x) will be labeled as “red 1”. Image taken from [Mur22].#

Let’s exemplify the use of kNN by constructing a SPR model that predicts the whether a molecule is mutagenic or not.

Mutagenicity is the property of substances to induce genetic mutation. It is one of the most important environmental, health and safety (EHS) properties to check when dealing with novel chemicals (e.g., drugs or solvents). In this case, we are going to use the data of mutagenicity on Salmonella typhimurium (Ames test). This dataset was collected by the Istituto di Ricerche Farmacologiche Mario Negri, merging experimental data from a benchmark dataset compiled by [HMS+09] from a collection of data made available by the Japan Health Ministry within their Ames (Q)SAR project.

Let’s fist import some libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import numpy as np
%matplotlib inline

Get data 📚#

We have previously computed some molecular descriptors that will serve as input to our model. However, this is also an important step to consider when facing a problem like this: what are the important inputs to model mutagenicity? how do we know if these pre-computed features are enough for modeling mutagenicity? can we generate relevant molecular features automatically? 🤔

Ok, let’s use pandas to import the data as a DataFrame…

if 'google.colab' in str(get_ipython()):
  df = pd.read_csv("https://raw.githubusercontent.com/edgarsmdn/MLCE_book/main/references/mutagenicity_kNN.csv")
else:
  df = pd.read_csv("references/mutagenicity_kNN.csv")

df

	Unnamed: 0	Id	CAS	SMILES	Status	Experimental value	Predicted value	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
0	0	1	100-00-5	O=[N+]([O-])c1ccc(cc1)Cl	Training	1	1	52	0.463602	43.14	38.1064	3.003401	244.429658	157.556	2.24820
1	1	2	100-01-6	O=[N+]([O-])c1ccc(N)cc1	Training	1	1	52	0.359544	69.16	37.5088	3.003401	242.429658	138.126	1.17700
2	2	3	100-02-7	O=[N+]([O-])c1ccc(O)cc1	Training	0	1	52	0.470728	63.37	34.7612	3.003401	241.674771	139.110	1.30040
3	3	4	100-11-8	O=[N+]([O-])c1ccc(cc1)CBr	Training	1	0	58	0.432586	43.14	45.7274	2.913802	257.648013	216.034	2.48970
4	4	5	100-12-9	O=[N+]([O-])c1ccc(cc1)CC	Training	0	0	58	0.479785	43.14	42.4744	2.913802	253.299498	151.165	2.15720
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5759	5759	5767	20395-16-8	O=C1N(C(=O)N(C(=O)N1CC=C)CC2OC2)CC=C	Training	1	0	102	0.485090	78.53	69.3560	2.668492	627.435628	265.269	-1.05750
5760	5760	5768	34718-47-3	O=C(C(Br)(Br)Br)Cl	Training	1	1	42	0.495987	17.07	40.3720	3.791118	85.425922	315.186	2.59030
5761	5761	5769	43204-63-3	N(CCBr)CCBr	Training	1	1	44	0.568556	12.03	40.4577	2.447473	28.870765	230.931	1.36580
5762	5762	5770	52583-35-4	N#Cc2cc(cc(c2(N=Nc1ccc(cc1(NC(=O)C))N(CCOC)CCO...	Training	1	1	184	0.264581	185.59	125.3525	2.552977	1169.342047	485.457	3.84768
5763	5763	5771	188021-38-7	O(c1c(cc(cc1Br)C)C(C)(C)C)CC=C	Training	1	0	88	0.735392	9.23	73.2710	3.263016	388.457194	283.209	4.61982

5764 rows × 15 columns

The library pandas has many useful functions for data analytics. For example, we can print the type of the data we have…

df.dtypes    

Unnamed: 0               int64
Id                       int64
CAS                     object
SMILES                  object
Status                  object
Experimental value       int64
Predicted value         object
NumValenceElectrons      int64
qed                    float64
TPSA                   float64
MolMR                  float64
BalabanJ               float64
BertzCT                float64
MolWt                  float64
MolLogP                float64
dtype: object

And have a look at the first rows of our data to see how it looks like

df.head()

	Unnamed: 0	Id	CAS	SMILES	Status	Experimental value	Predicted value	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
0	0	1	100-00-5	O=[N+]([O-])c1ccc(cc1)Cl	Training	1	1	52	0.463602	43.14	38.1064	3.003401	244.429658	157.556	2.2482
1	1	2	100-01-6	O=[N+]([O-])c1ccc(N)cc1	Training	1	1	52	0.359544	69.16	37.5088	3.003401	242.429658	138.126	1.1770
2	2	3	100-02-7	O=[N+]([O-])c1ccc(O)cc1	Training	0	1	52	0.470728	63.37	34.7612	3.003401	241.674771	139.110	1.3004
3	3	4	100-11-8	O=[N+]([O-])c1ccc(cc1)CBr	Training	1	0	58	0.432586	43.14	45.7274	2.913802	257.648013	216.034	2.4897
4	4	5	100-12-9	O=[N+]([O-])c1ccc(cc1)CC	Training	0	0	58	0.479785	43.14	42.4744	2.913802	253.299498	151.165	2.1572

and access rows by index

first_rows = df.iloc[:4]
first_rows

	Unnamed: 0	Id	CAS	SMILES	Status	Experimental value	Predicted value	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
0	0	1	100-00-5	O=[N+]([O-])c1ccc(cc1)Cl	Training	1	1	52	0.463602	43.14	38.1064	3.003401	244.429658	157.556	2.2482
1	1	2	100-01-6	O=[N+]([O-])c1ccc(N)cc1	Training	1	1	52	0.359544	69.16	37.5088	3.003401	242.429658	138.126	1.1770
2	2	3	100-02-7	O=[N+]([O-])c1ccc(O)cc1	Training	0	1	52	0.470728	63.37	34.7612	3.003401	241.674771	139.110	1.3004
3	3	4	100-11-8	O=[N+]([O-])c1ccc(cc1)CBr	Training	1	0	58	0.432586	43.14	45.7274	2.913802	257.648013	216.034	2.4897

We can access columns in the DataFrame by the column’s name

y_experimental = df['Experimental value']
y_experimental

     1
     1
     0
     1
     0
       ..
  1
  1
  1
  1
  1
Name: Experimental value, Length: 5764, dtype: int64

If we would like to get the subset of data that is labeled as mutagenic (i.e., ‘Experimental value’ equal to 1), we could do it like this

mutagenic_data = df[df['Experimental value']==1]
mutagenic_data

	Unnamed: 0	Id	CAS	SMILES	Status	Experimental value	Predicted value	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
0	0	1	100-00-5	O=[N+]([O-])c1ccc(cc1)Cl	Training	1	1	52	0.463602	43.14	38.1064	3.003401	244.429658	157.556	2.24820
1	1	2	100-01-6	O=[N+]([O-])c1ccc(N)cc1	Training	1	1	52	0.359544	69.16	37.5088	3.003401	242.429658	138.126	1.17700
3	3	4	100-11-8	O=[N+]([O-])c1ccc(cc1)CBr	Training	1	0	58	0.432586	43.14	45.7274	2.913802	257.648013	216.034	2.48970
6	6	7	100-13-0	O=[N+]([O-])c1ccc(C=C)cc1	Training	1	0	56	0.477660	43.14	43.1874	3.000887	276.648462	149.149	2.23780
7	7	8	100-14-1	O=[N+]([O-])c1ccc(cc1)CCl	Training	1	1	58	0.389482	43.14	42.6534	2.913802	257.648013	171.583	2.33360
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5759	5759	5767	20395-16-8	O=C1N(C(=O)N(C(=O)N1CC=C)CC2OC2)CC=C	Training	1	0	102	0.485090	78.53	69.3560	2.668492	627.435628	265.269	-1.05750
5760	5760	5768	34718-47-3	O=C(C(Br)(Br)Br)Cl	Training	1	1	42	0.495987	17.07	40.3720	3.791118	85.425922	315.186	2.59030
5761	5761	5769	43204-63-3	N(CCBr)CCBr	Training	1	1	44	0.568556	12.03	40.4577	2.447473	28.870765	230.931	1.36580
5762	5762	5770	52583-35-4	N#Cc2cc(cc(c2(N=Nc1ccc(cc1(NC(=O)C))N(CCOC)CCO...	Training	1	1	184	0.264581	185.59	125.3525	2.552977	1169.342047	485.457	3.84768
5763	5763	5771	188021-38-7	O(c1c(cc(cc1Br)C)C(C)(C)C)CC=C	Training	1	0	88	0.735392	9.23	73.2710	3.263016	388.457194	283.209	4.61982

3251 rows × 15 columns

Let’s have a look at the predictions by VEGA which we’ll use to compare our results with:

df['Predicted value']

     1
     1
     1
     0
     0
       ..
  0
  1
  1
  1
  0
Name: Predicted value, Length: 5764, dtype: object

We notice the data type says object even though it should be all 0 and 1. Let’s try to find out why:

df['Predicted value'].drop_duplicates()

                 1
                 0
  Non Predicted
Name: Predicted value, dtype: object

df[ (df['Predicted value'] != "0") & (df['Predicted value'] != "1") ]

	Unnamed: 0	Id	CAS	SMILES	Status	Experimental value	Predicted value	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
1844	1844	1846	16709-86-7	C=C[Si](C)(C)CCl	Training	0	Non Predicted	42	0.401439	0.00	38.399	3.575471	68.480406	134.682	2.19800
2194	2194	2197	2179-59-1	C=CCSSCCC	Training	0	Non Predicted	48	0.333839	0.00	45.404	2.616293	52.490225	148.296	2.96380
4118	4118	4125	624-92-0	CSSC	Training	0	Non Predicted	26	0.452840	0.00	27.030	1.974745	6.000000	94.204	1.62740
4180	4180	4187	6317-18-6	N#CSCSC#N	Training	0	Non Predicted	36	0.321595	47.58	31.275	2.768386	95.083765	130.197	1.37246
5633	5633	5641	7783-54-2	FN(F)F	Training	1	Non Predicted	26	0.383980	3.24	5.163	2.323790	8.000000	71.001	0.94190
5655	5655	5663	676-83-5	CP(Cl)Cl	Training	1	Non Predicted	26	0.426990	0.00	24.550	2.323790	10.754888	116.915	2.40570

For some inputs the VEGA model decides to output non predicted. As detailed in the documentation, it does this for particularly uncertain predictions. Let’s remove these from the data set:

non_predicted = df[ (df['Predicted value'] != "0") & (df['Predicted value'] != "1") ].index
df_clean = df.drop(non_predicted)

Let’s now collect all the input features of our dataset and remove the unnecessary ones:

X = df_clean.drop(['Unnamed: 0', 'Id','CAS','SMILES','Status','Experimental value','Predicted value'],axis=1)
X

	NumValenceElectrons	qed	TPSA	MolMR	BalabanJ	BertzCT	MolWt	MolLogP
0	52	0.463602	43.14	38.1064	3.003401	244.429658	157.556	2.24820
1	52	0.359544	69.16	37.5088	3.003401	242.429658	138.126	1.17700
2	52	0.470728	63.37	34.7612	3.003401	241.674771	139.110	1.30040
3	58	0.432586	43.14	45.7274	2.913802	257.648013	216.034	2.48970
4	58	0.479785	43.14	42.4744	2.913802	253.299498	151.165	2.15720
...	...	...	...	...	...	...	...	...
5759	102	0.485090	78.53	69.3560	2.668492	627.435628	265.269	-1.05750
5760	42	0.495987	17.07	40.3720	3.791118	85.425922	315.186	2.59030
5761	44	0.568556	12.03	40.4577	2.447473	28.870765	230.931	1.36580
5762	184	0.264581	185.59	125.3525	2.552977	1169.342047	485.457	3.84768
5763	88	0.735392	9.23	73.2710	3.263016	388.457194	283.209	4.61982

5758 rows × 8 columns

Exercise - manipulate a DataFrame ❗❗#

How many molecules in our dataset have a qed less than 0.5?
What is the molecule with the largest molecular weight MolWt?
What is the average number of valance electrons NumValenceElectrons of the molecules in our dataset?

# Your code here

Training and test split ✂️#

Remember, our goal is to create a model that generalize well to unseen data and not simply fit some seen data perfectly. To achieve this, our goal while splitting the data should be to ensure that the distribution of the test set is as close as possible to the expected distribution of the future data. Hence, often what we want is to make the training and test datasets to have a similar distribution.

For example, let’s see if how many molecules in our dataset are mutagenic vs. non-mutagenic.

y = df_clean['Experimental value'].to_numpy()
perc_mutagenic = y.sum()/len(y)*100
print('Percentage of mutagenic molecules    :', perc_mutagenic)
print('Percentage of non-mutagenic molecules:', 100-perc_mutagenic)

Percentage of mutagenic molecules    : 56.42584230635638
Percentage of non-mutagenic molecules: 43.57415769364362

In this case, the proportion of mutagenic and non-mutagenic data is very similar. Then, we will go ahead and split the data randomly. However, when you have a much more imbalanced dataset, how would you split the data better? You can look for instance at what stratified splitting is and why is it important.

In summary, when splitting your data, always think about the distribution of your splits!

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print('Training points: ', len(y_train))
print('Training points: ', len(y_test))

Training points:  4606
Training points:  1152

Exercise - splits distribution ❗❗#

Check what is the proportion of mutagenic molecules in your train and test set? Was the random splitting good?

# Your code here

Feature scaling 📏#

It is always a good practice to scale your data before starting modeling. This helps the training process of many machine learning models. This is specially true for kNN which works on distances! The distance between two points is naturally afected by the dimensions of the input space. Look for example at the Euclidean distance, if one dimension ranges from 0 to 10,000 and another ranges from 0 to 1, the former one will potentially impact the distance value much more!

We do not want this to happen. Therefore, we need to scale all input features in order to give the same importance to all dimensions regardless of their original scale.

Here, we will use the method known as standardization. Here, we move the distribution of the data to have unit variance and a mean equal to zero.

\[ \hat{\textbf{x}} = \frac{\textbf{x}-\mu_x}{\sigma_x} \]

Of course there are other scaling methods are available and you might want to review them and check which one is better than the other in which conditions.

from sklearn.preprocessing import StandardScaler

We initialize our scaler function

scaler = StandardScaler()

and fit it to our data (i.e., get the mean vector \(\mu_x\) and the standard deviation vector \(\sigma_x\).

scaler.fit(X_train)

StandardScaler()

Now, let’s scale our data

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_train

array([[ 6.91725031, -2.50154035,  6.01891181, ...,  3.43592554,
         6.12598659, -1.84190261],
       [-0.39909237, -1.36770143,  0.74694415, ..., -0.14444052,
        -0.41184123, -0.84860786],
       [ 0.34652217,  0.01697949, -0.31275131, ...,  0.18125274,
         0.22156128,  0.73178752],
       ...,
       [ 0.29992126,  2.01678401,  0.34399709, ...,  0.69463898,
         0.66401444, -0.23869147],
       [ 0.53292581, -0.04711992, -0.74958466, ...,  0.09406034,
         0.34165461,  1.34128367],
       [ 1.83775125, -2.00364986,  0.47608443, ...,  1.57463088,
         6.0606014 ,  2.62264517]])

Exercise - read documentation ❗❗#

What are exactly the mean an standard deviation vectors that we used to scaled the data? Go to the sklearn documentation and learn how you can access these. Reading the documentation of a library is a super important skill to learn!

# Your code here

kNN model 🏘️#

from sklearn.neighbors import KNeighborsClassifier

We initialize the kNN model by specifying the parameter “k”. Later, we will review some ways that help us in determining this parameter better. For now, let’s set it to 3.

knn = KNeighborsClassifier(n_neighbors=3)

we train it

knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

we predict the test set

y_pred = knn.predict(X_test)

Let’s now evaluate our kNN model!

We will use several metrics for classification, for a quick reminder on them check the documentation.

from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score, 
                             recall_score, f1_score, roc_auc_score, ConfusionMatrixDisplay) 

Let first look at the confusion matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[327, 176],
       [139, 510]], dtype=int64)

to see a prettier confusion matrix

disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x29162a1d308>

_images/a8f9a973072cd83911f86eab998f2603be11d6222ce189b1c0c93a5cecd95ae3.png

Now, let’s look at the other metrics

print('{:<10}  {:<15}'.format('Accuracy:', accuracy_score(y_test, y_pred)))
print('{:<10}  {:<15}'.format('Precision:', precision_score(y_test, y_pred)))
print('{:<10}  {:<15}'.format('Recall:', recall_score(y_test, y_pred)))
print('{:<10}  {:<15}'.format('F1:', f1_score(y_test, y_pred)))

Accuracy:   0.7265625      
Precision:  0.7434402332361516
Recall:     0.785824345146379
F1:         0.7640449438202247

Comparison to VEGA model#

What do you think about the metrics? Is the kNN model performing good or bad? Let’s compare it to the predictions of a kNN model trained on this same dataset and published as part of the VEGA platform. In the VEGA model, they have choosen k=4 with a similarity threshold of 0.7 (according to an internal similarity metric) Why is such threshold important?

Also, the developers of the VEGA kNN model used the leave-one-out approach to assess the performace of their model. Why is this approach specially suitable when using the kNN algorithm?

VEGA model#

y_clean = df_clean['Experimental value'].astype(int)
y_vega_clean = df_clean['Predicted value'].astype(int)

cm = confusion_matrix(y_clean, y_vega_clean)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
print('{:<10}  {:<15}'.format('Accuracy:', accuracy_score(y_clean, y_vega_clean)))
print('{:<10}  {:<15}'.format('Precision:', precision_score(y_clean, y_vega_clean)))
print('{:<10}  {:<15}'.format('Recall:', recall_score(y_clean, y_vega_clean)))
print('{:<10}  {:<15}'.format('F1:', f1_score(y_clean, y_vega_clean)))

Accuracy:   0.7999305314345259
Precision:  0.8168631006346329
Recall:     0.8319482917820868
F1:         0.8243366880146386

_images/f3c71acf71bc53f63bc0df532d125f4a769f6eeb16aabb983c0a63b440f3e1a5.png

Exercise - leave-one-out cross-validation ❗❗#

Can you now assess our previous kNN model configuration using the leave-one-out approach on the cleaned dataset?

from sklearn.model_selection import LeaveOneOut

X_clean = df_clean.drop(['Unnamed: 0', 'Id','CAS','SMILES','Status','Experimental value','Predicted value'],axis=1)

loo = LeaveOneOut()
print('Number of folds: ', loo.get_n_splits(X_clean))

Number of folds:  5758

y_pred_loo_our_model = np.zeros(y_clean.shape[0])
for i, (train_index, test_index) in enumerate(loo.split(X_clean)):
  # Get training data for the current fold
  X_train_loo = X_clean.iloc[train_index]
  y_train_loo = y_clean.iloc[train_index]

  # Get test data for the current fold
  X_test_loo = X_clean.iloc[test_index]

  # Train kNN
  # Your code here

  # Get prediction on the test molecule
  # Your code here

  # Store the prediction in `y_pred_loo_our_model`
  # Your code here

cm = confusion_matrix(y_clean, y_pred_loo_our_model)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
print('{:<10}  {:<15}'.format('Accuracy:', accuracy_score(y_clean, y_pred_loo_our_model)))
print('{:<10}  {:<15}'.format('Precision:', precision_score(y_clean, y_pred_loo_our_model)))
print('{:<10}  {:<15}'.format('Recall:', recall_score(y_clean, y_pred_loo_our_model)))
print('{:<10}  {:<15}'.format('F1:', f1_score(y_clean, y_pred_loo_our_model)))

Accuracy:   0.4357415769364363
Precision:  0.0            
Recall:     0.0            
F1:         0.0            

C:\Users\DELL\Anaconda3\envs\ML4ChemEng\lib\site-packages\sklearn\metrics\_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

_images/6c042c82ac31f6632db53ac653281d3b906b8dc195e5a557473853860b2b1e7a.png

Finding the best k in kNN 🔎#

What about the parameter k? How do we find the best k for our model? Why the VEGA model developers used k=4? Let’s try to answer this…

k is a hyperparameter of the model and should be chosen using a validation set.

Important

Remember to reserve your test set exclusively for assesing your model! Never use it for training or hyperparameter tuning!

import numpy as np
from tqdm.notebook import tqdm

num_ks = np.arange(1, 100, 2).astype(int)

X_train_hyp, X_valid, y_train_hyp, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=0)

train_accuracy = []
valid_accuracy = []

for i in tqdm(range(len(num_ks))):
  knn = KNeighborsClassifier(n_neighbors=num_ks[i])
  knn.fit(X_train_hyp, y_train_hyp)

  pred_train = knn.predict(X_train_hyp)
  pred_valid  = knn.predict(X_valid)

  train_accuracy.append(1-accuracy_score(y_train_hyp, pred_train))
  valid_accuracy.append(1-accuracy_score(y_valid, pred_valid))

plt.figure(figsize=(10,6))
plt.plot(num_ks, train_accuracy, 'bs--', label='Train')
plt.plot(num_ks, valid_accuracy, 'rx--', label='Validation')
plt.xlabel('k')
plt.ylabel('Misclasification rate')
plt.legend()
plt.show()

_images/64d995c13d71dbbc742dc227dd1977b62ee7be82cf687f1a27024082b228f286.png

This graph is pretty similar to the one that we saw on slide 9 of Lecture 2. Here, we can see the expected general trend of the performance curves.

Which k do you think is the best?

Cross-validation#

Problem: if validation set is small and noisy, it might be misleading Idea: Increase the size of the validation set Problem: This would reduce the size of the training set

Then, let’s use all data for training and validation using k-fold cross-validation!

from sklearn.model_selection import cross_validate

num_ks = np.arange(1, 50, 1).astype(int)

train_misclassification = []
valid_misclassification = []
for i in tqdm(range(len(num_ks))):
  knn = KNeighborsClassifier(n_neighbors=num_ks[i])
  cv_dict = cross_validate(knn, X_train, y_train, cv=10, 
                                 scoring='accuracy', return_train_score=True)
  
  k_fold_train_scored = cv_dict['train_score']
  k_fold_valid_scored = cv_dict['test_score']
  
  train_misclassification.append(1-k_fold_train_scored.mean())
  valid_misclassification.append(1-k_fold_valid_scored.mean())
  

plt.figure(figsize=(10,6))
plt.plot(num_ks, train_misclassification, 'bs--', label='Train')
plt.plot(num_ks, valid_misclassification, 'rx--', label='Validation')
plt.xlabel('k')
plt.ylabel('Misclasification rate')
plt.legend()
plt.show()

_images/c9826ddc0e1df2fdd304eb8048eaf2baa597265b4c287ae430691cb67a5b57fe.png

print('k with minimum validation misclassification: ', num_ks[np.argmin(valid_misclassification)])

k with minimum validation misclassification:  17

plt.figure(figsize=(10,6))
plt.plot(num_ks, valid_misclassification, 'rx--', label='Validation')
plt.yscale('log')
plt.xlabel('k')
plt.ylabel('Misclasification rate')
plt.legend()
plt.show()

_images/c0921b5d75e09bd80b70876fe931cb226daf24d11ec0e666fc8f649455371148.png

You can now go back to your implementation of leave-one-out and train your kNN model using the best k that you found. How does it perform compared to the VEGA kNN? Do you got a better model? Why?

Challenge - kNN QSPR for predicting BCF 🥇#

Develop a kNN model for regression to predict the bioconcentration factor (BCF).

if 'google.colab' in str(get_ipython()):
  df_bcf = pd.read_csv("https://raw.githubusercontent.com/edgarsmdn/MLCE_book/main/references/BCF_training.csv")
else:
  df_bcf = pd.read_csv("references/BCF_training.csv")

# Your code here

References#

HMS+09: Katja Hansen, Sebastian Mika, Timon Schroeter, Andreas Sutter, Antonius Ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich, and Klaus-Robert Muller. Benchmark data set for in silico prediction of ames mutagenicity. Journal of chemical information and modeling, 49(9):2077–2081, 2009.
Mur22: Kevin P Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.
SXG+03: Min Shen, Yunde Xiao, Alexander Golbraikh, Vijay K Gombar, and Alexander Tropsha. Development and validation of k-nearest-neighbor qspr models of metabolic stability of drug candidates. Journal of medicinal chemistry, 46(14):3013–3020, 2003.
YJQ+19: Shuai Yuan, Zeren Jiao, Noor Quddus, Joseph Sang-II Kwon, and Chad V Mashuga. Developing quantitative structure–property relationship models to predict the upper flammability limit using machine learning. Industrial & Engineering Chemistry Research, 58(8):3531–3537, 2019.