Novel consensus quantitative structure-retention relationship method in prediction of pesticides retention time in nano-LC

Document Type : Research Paper


Chemometrics Laboratory, Faculty of Chemistry, University of Mazandaran, Babolsar, Iran


In this study, quantitative structure-retention relationship (QSRR) methodology employed for modeling of the retention times of 16 banned pesticides in nano-liquid chromatography (nano-LC) column. Genetic algorithm-multiple linear regression (GA-MLR) method employed for developing global and consensus QSRR models. The best global GA-MLR model was established by adjusting GA parameters. Three descriptors of SpMax2_Bhp, Mor31u and, MATS6c appeared in this model. Consensus QSRR models developed as an average consensus model (ACM) and weighted consensus model (WCM) by a combination of a subset of the GA-MLR models. Comparison of statistical parameters of developed models indicated that an ACM which is combining of the best global QSRR model with four-descriptor sub-model can be selected as the best consensus QSRR model. CrippenLogP, RDF070m, Lop, and HASA1 descriptors appeared in four-descriptor sub-model. In ACM, the square of correlation coefficients (R2) was 0.973 and 0.939, and the SE was 0.49 and 0.40, for the training and test sets, respectively. The ACM was assessed by leave one out cross-validation ("Q2 cv" = 0.935) as well as internal validation. Descriptors which appeared in this model suggest electrostatic, steric and hydrophobic interactions play the main role in the chromatographic retention of studied pesticides in nano-LC conditions.



Pesticides are considered a vital component of modern farming playing a major role in food security and public health [1-3]. Food security is defined as access of all people to adequate and safe food [4]. Pesticides enhanced the quantity and quality of products that respectively lead to a high agricultural productivity and improving the product health [3]. Also, these chemicals directly improve the public health by mitigating some pathogens such as malaria [3, 5], lice [6], flea and tick [7], and Zika virus [3]. Despite their benefits, these compounds could pose unintended risks to the environment and human health [1]. The presence of pesticide residues in water, herbal plants, and grocery has raised serious concerns [8-10], for example, a significant increase in their concentration in the food sample is associated with cancer [11]. Thus, identification and quantification of pesticide residues in food chain are substantial for inspection of the potential health risk. Nowadays, gas chromatography-mass spectrometry (GC–MS) with electron impact (EI) ionization and liquid chromatography-tandem mass spectrometers (LC–MS / MS) combined with electrospray ionization (ESI) technique are frequently applied as multi‐residue methods for pesticides analysis [12, 13]. Certainly, for analysis of banned pesticides especially in a susceptible instance, a more sensitive technique is requisite.

Nano-liquid chromatography (nano-LC) is a miniaturized high-performance liquid chromato-graphy (HPLC) technique introduced as the heart of this gain in sensitivity [14]. Miniaturization is one of the actual trends in science and technology that appears in the separation techniques as microchip devices, nano-capillary electrophoresis and nano-LC [15]. The first tendency for miniaturization of chromatography systems was revealed as nanometer-sized silica spheres stationary phase in GC [16]. Gradually, demands of life science and rapid developments of LC case to a considerable attention paid to the expansion of miniaturized LC [15]. Nano-LC column was applied for the analysis of thyroid hormones [17], modified ribonucleosides [18], biomarkers [19], drugs [20], proteins [21, 22], doping control [23] and environmental pollutants [24, 25].

Quantitative structure-retention relationship (QSRR) method is a modeling approach which quantitatively relates the chromatographic retention parameters of chemicals to their structural features [26]. Whenever the retention parameters of compounds were measured on a nano-column, extended QSRR model can explain the basic intermolecular interactions that determine the behavior of chemicals in the down-scaling column. QSRR model was employed for prediction of the retention time of some pesticides in cereals and oilseeds by multiple linear regression (MLR) and partial least square regression (PLS) methods [27]. Also, QSRR modeling was utilized for comparing the behavior of peptides in four common reversed-phase liquid chromatography (RPLC) columns. Less complexity of stepwise multiple linear regression (SW-MLR) models led to their superiority to un-informative variable elimination partial least squares (UVE-PLS) models, despite the good prediction of both equations [28].

Consensus quantitative structure-property / activity relationship (consensus QSAR / QSPR) is a novel modeling approach which is used the average of multiple predictions from individual QSAR / QSPR models [29]. It is a method for collaborative decision-making that reduces the uncertainty of the selected model used for classification [30-32] and regression [33-35] analysis. Different consensus approaches enhanced performance over the global and local models. A global QSAR / QSPR model uses a large and diverse training set covering a wide range of chemical space while a local model focuses on analog subsets [36]. The consensus models developed by combining the individual models (sub-models) differ in training subset [37], molecular descriptor [33, 34, 36], modeling method [30] or only vary in some statistical parameter with considering their admission threshold [29]. A consensus regression model includes average consensus model (ACM) and weighted consensus model (WCM) which respectively desired property was predicted by simply or weighted averaging of sub-models [35, 36]. Genetic algorithm-multiple linear regression (GA-MLR) modeling method defines an explicit functional relationship between independent variables and desired property by optimizing forms and coefficients of equation simultaneously [29]. Consensus GA-MLR model was developed for prediction of OH tropospheric degradation of volatile organic compounds (VOCs) [33] and soil sorption partition coefficient (log Koc) of organic non-ionic compounds [34].

The main goal of this work is to develop a QSRR model to predict retention time of 16 banned pesticides in food sample on nano-LC column. GA-MLR method was employed for developing the global and consensus models, and their corresponding accuracy and prediction power were compared.


Data Set

The retention times of 16 banded pesticides in food sample (organic apples and organic apple puree baby food) that their ultra-sensitive analysis was done on nano-LC-DBDI system was taken from [38]. The nano-LC experiments were performed in an Ekspert nano- LC 400 (AB Sciex, Darmstadt, Germany) column packed with reverse phase C18 (3 μm × 75 μm ×15 cm ). The names of these compounds and their experimental nano-LC retention times are indicated in Table 1. As can be seen in this table the retention value is varied in the range of 8.01 min (Simazine) to 17.52 min (Phoxim). Compounds in the data set randomly divided into the training set (12 compounds) and test set (4 compounds) that are respectively used for developing the models and evaluating their predictability.

Descriptors Calculation and Selection

The purpose of the QSRR model is to quantitatively correlate the structural features of studied chemicals to their nano-LC retention parameters by using theoretical molecular descriptors. In the first step of QSRR modeling, the structures of all studied chemicals were drawn and optimized by the semi-empirical AM1 method using Hyperchem program (version 7). Afterward, molecular descriptors were calculated for 16 pesticides by Padel [39], Dragon [40], and Codessa [41] softwares. Then, a pre-processing step involved exclusion constant, near constant, and collinear descriptor (r> 0.9) was employed in QSARINS (QSAR-INSUBRIA) software [29].


Models Development and Validation

The present study investigates the use of global and consensus GA-MLR models for developing a QSRR model to predict the nano-LC retention times of some banned pesticides. Also, the developed model can create an insight into the molecular interactions between bonded phases and solute molecules on the nano-scale column. The following steps of QSRR modeling were implemented in QSARINS software [29] as an new software for the development, analysis, and validation of global and consensus GA-MLR models. In this implementation, form and coefficients of GA-MLR models are simultaneously optimized by adjusting GA parameters (maximum number of descriptors = 4, generation per size = 1000, population size = 100, and mutation rate = 0.5). In each implementation, several GA-MLR models are generated varied in the number and type of descriptors or statistical parameters. A multi-criteria decision making (MCDM) was employed for the selection of best global model and sub-models of the consensus model. In MCDM, a large subset of fitness parameters including fitting criteria, internal validation criteria, and model applicability domain (AD) simultaneously were computed and considered as features of GA-MLR models for ranking. The equation of best global QSRR model based on MCDM is shown in Table 2. Three descriptors of MATS6cSpMax2_Bhp, and Mor31u appeared in this model with good statistical parameters (R2 training = 0.929, R2 test = 0.807 and SE training = 0.81, SE test = 0.78). The value of variation in inflation variance (VIF <10) of selected descriptors in Table 2 indicates that there is no multicollinearity among them. This value is calculated from Equation (1).


where R is correlation coefficient of multiple regression between each descriptor of the model with others [42].

Any individual QSRR model might be over-emphasized some aspects and underestimate or ignore other important features. It seems reasonable that a consensus QSRR model, which can be derived by calculating an average of sub-models, predicts the data better than the majority of individual models [34]. Several consensus models were developed by combining various sub-models selected by MCDM. Both ACM and WCM were applied for developing consensus QSRR models. Finally, an ACM which is combination of two sub-models was obtained as the best consensus QSRR model that one of them (three-descriptor sub-model) is the same best global GA-MLR model (Table 2). The predicted values of retention time with ACM are listed in Table1.

Comparison of statistical parameters of developed models indicated that an ACM which is the combination of the best global QSRR model with a four-descriptor sub-model can be selected as the best consensus QSRR model. The statistical parameters of sub-models, WCM and ACM, are shown in Table 3. Comparison between these parameters confirmed the superiority of ACM over other models. The plot of the ACM predicted values versusexperimental values of retention time are shown in Fig. 1 which indicates a good correlation among them (R2training = 0.973, R2test = 0.939 and SEtraining = 0.49, SEtest = 0.40). Moreover, the residuals of predicted values were plotted against the experimental values, as shown in Fig. 2. The random distribution of residuals around the zero line indicates that there is no any systematic error in the best-developed consensus QSRR model.

Model Validation

The fitting criteria say nothing about the stability of the developed QSRR model. Therefore, leave-one-out (LOO) cross-validation procedure is used to check the robustness of both sub-models, WCM and ACM. In LOOmethod the retention time of each pesticide in the data set is removed and the model is expanded on the remained chemicals, then the resulting model is employed to predict the retention time of the removed compound. This procedure is repeated for all chemicals in data set and then the cross-validated correlation coefficient (Q2cv) is calculated by Equation (2):


In this equation, yiy0i and ymean are the predicted, experimental and mean values of the experimental retention time for i th pesticide, respectively [43]. The calculated values of Q2cv for leave-one-out of selected sub-models and consensus QSRR models, shown in Table 3, are higher than 0.5 that indicated the robustness of all models.

Applicability Domain Analysis

The applicability domain of ACM models is evaluated by leverage analysis expressed by William plot. AD considers the competition features of GA-MLR models for selection as the best global equation or sub-model of consensus models [44]. It is proposed to prefer a model with the lowest number of bad prediction (Y-outlier) and chemicals far from the training structural domain (X-outlier). In the first step of obtaining William plot, the leverage (hat) value is calculated for each compound as follows:


where his the leverage of i th compound in the descriptor space, xi is the descriptor raw vector, X is the matrix of the descriptor, and superscript T refers to the transpose of the vector or matrix [44]. Then, the standardized residuals of predicted Y-values were plotted against their calculated hat values. In this graph, the Y-outlier indicates the compounds with standardized residuals higher than ±3, while X-outlier concerns to those with a leverage value higher than warning hat (h*). The value of h* in global models is constant at that n is the number of compounds in training set and p is the number of descriptors in the model, while in consensus models, p is the maximum number of descriptors that is defined in GA parameters setting step for each implementation. William plot for ACM is shown in Fig. 3 that defines an applicability domain with all studied chemicals.


Theoretical molecular descriptors are distin-guished by physicochemical properties as well as a mathematical tool or algorithm employed for their calculation. Three descriptors of SpMax2_BhpMor31u and, MATS6cappeared in a three-descriptor sub-model while CrippenLogPRDF070mLop, and HASA1 descriptors were emerged in a four-descriptor sub-model. The values of mean effect of each descriptor in the sub-models are calculated and represented in the last column of Table 2. These values indicate the important order of descriptors in these models. SpMax2_Bhp is a Barysz matrix type descriptor in which the maximum absolute eigenvalue of Baryszmatrix for n = 2 was weighted by polarizability [39]. The Barysz distance matrix is defined as a weighted distance matrix that simultaneously accounts the presence of multiple bonds and heteroatoms in the chemicals [39, 45]. In Barysz distance matrix diagonal elements correspond to related values of the atomic charge, polarizability and H-bond abilities which, respectively, are represented as the number 1, 2 and 3 in symbolic presentation of descriptor (e.g., polarizability in SpMax2_Bhp). Mor31u descriptor belongs to 3D-MoRSE (“molecule representation of structures based on electron diffraction”) type descriptor. [40]. In Mor31u a spectral representation of chemicals along a topological distance between two atoms (lag) 31 is un-weighted. These atoms are the vertices of the graph that a topological distance between them defined as a lag. MATS6c is a Moran autocorrelation topological structure (MATS) type descriptor containing a Moran type coefficient along the lag 6 weighted by Gasteiger charge [40]. CrippenLogP is a Crippen’s log P that is related to hydrophobicity [39]. RDF070m descriptor interpreted the probability distribution of atoms in a spherical volume with a radius of 7 Å as a function of atomic masses. A radial distribution function (RDF) descriptor is calculated for all atoms of a molecule to consider its interatomic distances [40, 45]. The lopping centric index (Lop) descriptor is a topological type descriptor defined as the mean information content which is derived from the pruning partition of acyclic graphs [40, 45]. The hydrogen bonding acceptor ability (HASA1) descriptor belonging to the charged partial surface area (CPSA) type descriptor reflects the solvent accessible surface area of H-bonding acceptor atoms [41].

Overall, the important structural connectivity information of the studied chemicals on the nano-LC retention time appeared alone or weighed by physicochemical properties of polarizability, Gasteiger charge, and atomic mass. Also, the significant surface properties emerged as Crippen Log P and HASA1 descriptors. Crippen Log P is related to hydrophobicity while HASA1 describes the charge related surface properties responsible for H-bonding formation. In RP-HPLC a solute that its atoms are incapable of formation of H-bonds is unwilling to present in water solvent (hydrophobic) [46]. Hence, steric, electrostatic, and hydrophobic parameters of the studied pesticides are the predominant interactions that are responsible for their retention time on the miniaturized RP-HPLC column.


The present study investigates the use of global and consensus GA-MLR models for developing a QSRR model to predict the nano-LC retention time of some banned pesticides. The best global QSRR model was established by adjusting GA parameters. Three descriptors of SpMax2_BhpMor31u, and MATS6c appeared in global GA-MLR model. Consensus QSRR models were developed as ACM and WCM by the combination of a subset of the GA-MLR models. The statistical parameters confirmed the superiority of ACM over other QSRR models. CrippenLogPRDF070mLop, and HASA1 descriptors appeared in four-descriptor sub-model. Considering descriptors that appeared in ACM model, it was concluded that electrostatic, steric and hydrophobic interactions play the main role in the chromatographic retention of the studied pesticides in nano-LC conditions.


The authors declare that there is no conflict of interests regarding the publication of this paper.

6. Franklin LU, Cunnington GD, Young DE. Terpene based pesticide treatments for killing terrestrial arthropods including, amongst others, lice, lice eggs, mites and ants. Google Patents; 2000.
14. Remco LRE-JS. Nano LC: Principles, Evolution, and State-of-the-Art of the Technique. 2011.
22. Li B, Takahashi D, Kawamura Y, Uemura M. Plasma Membrane Proteomics of Arabidopsis Suspension-Cultured Cells Associated with Growth Phase Using Nano-LC-MS/MS. In: Mock H-P, Matros A, Witzel K, editors. Plant Membrane Proteomics: Methods and Protocols. New York, NY: Springer New York; 2018. p. 185-94.
26. Mardia K. V., Kent J. T., Bibby J. M. Multivariate analysis: Academic Press, London.; 1979.
27. LI M-p, ZHANG S-w. Quantitative Structure-Retention Relationship of Gas Chromatography for Organophosphorus Pesticide Residues in Cereals and Oilseeds [J]. Journal of Shanxi University (Natural Science Edition). 2011; 3: 026
40. Todeschini R, al. e. DRAGON Web version 3.0: Milano, Italy; 2003.
41. Katritzky A, Lobanov V, Karelson M. CODESSA: Training Manual University of Florida Gainesville. FL Google Scholar. 1995.
43. Wold S, L. Eriksson, S. Clementi. Statistical Validation of QSAR Results. Chemometric Methods in Molecular Design1995. p. 309-38.
44. Atkinson AC. Plots, transformations and regression; an introduction to graphical methods of diagnostic regression analysis. 1985.
45. Todeschini R, Consonni V. Handbook of molecular descriptors: John Wiley & Sons; 2008.