Internet Electronic Journal of Molecular Design - IEJMD, ISSN 1538-6414, CODEN IEJMAT
ABSTRACT - Internet Electron. J. Mol. Des. December 2005, Volume 4, Number 12, 882-910 |
Highly Correlating Distance-Connectivity Based Topological Indices 3:
PCR and PC-ANN Based Prediction of the Octanol-Water Partition
Coefficient of Diverse Organic Molecules
Mojtaba Shamsipur, Raoof Ghavami, Bahram Hemmateenejad, and Hashem Sharghi
Internet Electron. J. Mol. Des. 2005, 4, 882-910
|
Abstract:
Recently, we proposed some new topological indices (Shamsipur
indices) based on the distance sum and connectivity of a molecular
graph for use in QSAR/QSPR studies. The aim of this study is to
examine the ability of the proposed Sh indices in QSPR study of
the n-octanol/water partition coefficients (logP) of a diverse set of
organic compounds by means of principal component regression
(PCR) and principal component-artificial neural network (PC-ANN)
modeling methods combining with two factor selection
procedures named eigenvalue ranking (EV), and correlation
ranking (CR). Experimental values for the partition coefficient
ranging from -0.66 (methanol) to 8.16 (2,2',3,3',4,5,5',6,6'-PCB)
have been collected from literature for 379 organic compounds
with a wide variety of functional groups containing C, H, N, O,
and all halogens. Ten different Sh indices (Sh1 through Sh10) were
calculated for each molecule by different combination of the
connectivity and distance sum vectors. The Sh topological
descriptor data matrix was subjected to principal component
analysis for the reduced the dimensionality of a data set and the
most significant factors or principal components (PC) were
extracted. Both the linear and nonlinear modeling methods were
employed for predicting the logP of an extensive set of organic
compounds including several structurally diverse groups of
compounds (alkanes, alkenes, alkynes, cycloalkanes, cycloalkenes,
aliphatic alcohols, ethers, esters, aldehydes, ketones, carboxylic
acids, amines, aromatic hydrocarbons, halogenated hydrocarbons
and some polychlorinated biphenyls (PCBs)). Principal component
regression and PC-ANN were used as linear and nonlinear
modeling methods, respectively. Principal component analysis of
the Sh data matrix showed that the seven PCs could explain
99.97% of variances in the Sh data matrix. The extracted PCs were
used as the predictor variables (input) for PCR and ANN (PN-ANN)
models. The ANN model could explain 97.98% of variances
in the logP data, while the value obtained from PCR procedures
were 80.76%. Indeed, linear (MLR) and nonlinear (MLR-ANN)
modelings by the use of original Sh indices were performed for
comparison. The respective square of correlation coefficients of
the prediction obtained by the MLR, PCR, MLR-ANN and PC-ANN
are 0.7431, 0.7857, 0.9377 and 0.9626, and the respective
standard errors are 0.783, 0.689, 0.361, and 0.281. Some newly
proposed topological indices (Sh indices) has been applied to
predict partition coefficient of a large set of organic compounds.
The results of this project showed that factor selection by
correlation ranking gives superior results relative to those obtained
by eigenvalue ranking. PCR analysis of the data showed that
proposed Sh indices could explain about 80% of variations in the
logP data; while the variations explained by the ANN modeling
were more than 96%. These results confirm the suitability of the
indices in QSPR analysis of the lipophilicity data. The Sh indices
were calculated in a simple and fast manner and, in comparison
with some previously reported QSPR models, produced better results.
|