Preparation of the chemicals and AD-related proteins
We used BindingDB database as our training datasets. BindingDB is a public, web-accessible database of measured binding affinities,
focusing chiefly on the interactions of proteins considered to be candidate drug-targets with ligands that are small, drug-like molecules.
For AD-related proteins, activity data were filtered to keep only activity end-point points that had half-maximum inhibitory concentration (IC50),
half-maximum effective concentration (EC50) or Ki values. Herein, to ensure that enough number of molecules could be used in model building,
we previously selected those targets with larger than 200 biological activity data. Following this procedure, 109061 compounds associated with
161 AD-related proteins remained with 115257 activity end-points, which were used for model building.
Preparation of the positive and negative set
For those compounds with more than one activity values, we took the mean value of their activity values as the final activity value.
A compound was considered active when the mean activity value was below 10 uM. All compounds higher than 10 uM are considered inactive.
Following this split, maybe some AD-related proteins have very little number of negative samples. To balance the number between positive
samples and negative samples for each AD-related protein, we randomly selected certain number of compounds from other AD-related proteins
to generate the negative samples for these AD-related proteins. The number of these selected negative samples together with inactive samples
should be basically equal to the number of the active samples for these AD-related proteins.
These prepared positive set and negative set were used as the subsequent model building.
Model training and validation
A series of high confidence QSAR models were built by Naïve Bayes and different fingerprint representations for 161 proteins. The Naïve Bayes method for
predicting the associations between AD-related proteins and chemicals was chosen as it provided both good performance
for noisy data sets and a high speed of calculation. Herein, to obtain the best model performance,
we compared 6 types of molecular fingerprints when establishing the prediction models, including FP2, MACCS, Daylight,
ECFP2-2048, ECFP4-2048, and ECFP6-2048. To obtain the better prediction ability, we also ensemble all fingerprint models to obtain the average output.
For each model, we applied five-fold cross validation and external validation to evaluate the prediction performance of models. For 5-fold cross validation,
the data set is split into 5 roughly equal-sized parts firstly, and then we fit the model to four parts of the data and calculate the error rate of the other part.
The process is repeated 5 times so that every part can be predicted as a validation set. To observe the stability of models, we repeated the cross validation program
10 times to report standard deviations of each statistics. For the external validation, the data were split in two parts for the validation step: compounds were clustered
and assigned a cluster number. Clusters with an odd number were assigned to the test set, and the clusters with an even number were assigned to the training set.
Models were built with the training set, and the test set was scored. Finally, a model was built with all data and scored against itself – the training set and whole set
should provide similar validation statistics. Statistics on the performance of the models were reported, including commonly used ones in classification schemes: accuracy,
sensitivity, specificity, AUC, and F-score values. ROC provides an overall score and does not need to specify a cut-off for distinguishing active from inactive compounds.
The area under the receiver operating characteristic (ROC) curve provides an indication of the ability of the model to prioritize active compounds over inactive compounds.
The ROC curve is the plot of the true positive versus the false positive rate.
Copyright @ 2012-2014 Computational Biology & Drug Design Group,
School of Pharmaceutical Sciences, Central South University.
All rights reserved. The recommended browsers: Safari, Firefox, Chrome,IE(Ver.>8). E-mail: email@example.com