However, all these trees suffered in ability to gen eralize to novel kinases. the P2kin for various descriptions ranging only 0. 30 0. 43. Distribution of prediction errors in SVM, PLS and k NN models The performance of the SVM, PLS, and k NN models exploiting z scale descriptors of aligned sellectchem sequences is further illus trated in Figure 3. The figure presents histograms for the prediction errors calculated in the outer loop of cross validation for 1 5 of the kinases that had been entirely excluded from the modelling. The distributions of errors in the SVM Inhibitors,Modulators,Libraries and PLS models are very similar. The cumulative plot demonstrates that in the SVM model the difference between predicted and observed pKd values range 0 0. 25 logarithmic units for 57% of the kinase inhibitor combi nations.
for Inhibitors,Modulators,Libraries 75% of the combinations they fall below 0. 5 logarithmic units. for 89% they are less than one logarith mic units, and for 99% less than two logarithmic units. The corresponding fractions in the PLS model are 49%, 70%, 88%, and 98%. To interpret these results one should keep in mind that the total span of kinase inhibitor activ ities exceeded five Inhibitors,Modulators,Libraries logarithmic units, namely from pKd 5 to 10. 62, and all non interacting entities were assigned the numerical value pKd 4. hence mispredictions by more than six units could be theoretically possible. For the k NN model the pattern of error distribution is quite different. Here the prediction error was zero for more than one half of the non interact ing pairs.
However, 14% of the prediction errors exceed one logarithmic unit and 4% exceed two logarithmic units, thus indicating that predictions of the k NN model Inhibitors,Modulators,Libraries are less accurate compared to those obtained by SVM and Inhibitors,Modulators,Libraries PLS. In other words, activities for inhibitors interacting with overall quite similar kinases may vary a lot and regression models can better explain this than the nearest neighbour approach. Dependence of modelling performance on the size of the dataset Although both SVM, PLS, and k NN models showed good predictive ability they were based on more than 12,000 data points. It would thus be of obvious interest to know the robustness of the proteochemometric approach when less data are available. We therefore assessed the relationship between the sparseness of the data matrix used and the performance of the model. To this end we created models using 60, 40, 20, and 10 percent of all data.
For example, when 10% of the data was used to cal culate the P2kin value, the set of 317 kinases was randomly split into ten partitions of about equal size. Modelling was then performed using only one of these partitions at a time and the selleckchem Sorafenib nine remaining partitions were used to evaluate the model obtained. The procedure of splitting the dataset was iterated ten times in order to assure reproducibility of the results.