Internet Electronic Journal of Molecular Design - IEJMD, ISSN 1538-6414, CODEN IEJMAT
ABSTRACT - Internet Electron. J. Mol. Des. April 2002, Volume 1, Number 4, 203-218 |
Support Vector Machine Classification of the Carcinogenic Activity of Polycyclic
Aromatic Hydrocarbons
Ovidiu Ivanciuc
Internet Electron. J. Mol. Des. 2002, 1, 203-218
|
Abstract:
Structure-activity relationships (SAR) can be efficiently used to predict the carcinogenic
hazard of new chemicals, before producing them on a large scale or even before
synthesizing them. SAR models that detect potential carcinogens can also supplement
short-term tests of genotoxicity, long-term tests of carcinogenicity in rodents, or
epidemiological evidence in humans. Support vector machine (SVM) is an efficient
classification algorithm that can provide highly predictive SAR models for the
carcinogenic hazard. We have applied the SVM model to identify the carcinogenic
activity of 46 methylated and 32 non-methylated polycyclic aromatic hydrocarbons
(PAH). The PAH chemical structure was encoded by four theoretical descriptors
computed with PM3, namely the energy of the highest occupied molecular orbital EHOMO,
the energy of the lowest unoccupied molecular orbital ELUMO, the hardness HD, and the
difference between EHOMO and EHOMO-1. A wide range of SVM experiments were
performed using the dot, polynomial, radial basis function, neural, and anova kernels.
The results obtained for the classification of PAH carcinogenicity demonstrate that the
performances of SVM depend strongly on the kernel type and various parameters that
control the kernel shape. The best prediction results were obtained with the radial basis
function kernel with γ = 0.5, the anova kernel with γ = 0.5 and d = 1, and the anova
kernel with γ = 0.5 and d = 2. In the first case, from 34 carcinogenic compounds, 28 were
correctly classified, while from 44 non-carcinogenic compounds, 40 were correctly
classified. SAR models for predicting the carcinogenic hazard can benefit from the use of
support vector machines, which determine a maximum separating hyperplane between
carcinogenic and non-carcinogenic compounds. The solution of the SVM model is a
unique hyperplane which can be computed very fast, but the classification results heavily
depend on the kernel type and structural descriptors. Extensive cross-validation tests
should be made to find the kernel with the optimum predictive power.
|