Electrical Conductivity as a General Predictor of Multiple Parameters in Tigris River Based on Statistical Regression Model

Surface water samples from different locations within Tigris River's boundaries in Baghdad city have been analyzed for drinking purposes. Correlation coefficients among different parameters were determined. An attempt has been made to develop linear regression equations to predict the concentration of water quality constituents having significant correlation coefficients with electrical conductivity (EC). This study aims to find five regression models produced and validated using electrical conductivity as a predictor to predict total hardness (TH), calcium (Ca), chloride (Cl), sulfate (SO4), and total dissolved solids (TDS). The five models showed good/excellent prediction ability of the parameters mentioned above, which is a very good startup to establish a rule of thumb in the laboratories to compare between observations. The importance of linear regression equations in predicting surface water quality parameters is a method that can be applied to any other location.


INTRODUCTION
It is shown that there is strong evidence that electrical conductivity is the most appropriate variable predicting or explaining more than values of the dependent variables (Joarder et al., 2008). Sustainable development is closely linked to obtaining adequate quantitative and qualitative water to maintain environmental and health systems. It is essential for human, economic, and social development despite the increasing water demand is spotted through expanded international demand and mismanagement of water resources (Bhaduri et al., 2016).
In Iraq, river water is being used for different purposes (domestic, agriculture, and industrial); water quality could be a function of its physicochemical properties. Continuous monitoring of a large number of quality parameters is crucial for the efficient maintenance of water quality. Nevertheless, it is a complex and challenging responsibility for routine monitoring of all the parameters, even if satisfactory personnel and laboratory resources are available. Hence, another approach based on correlation and Regression has recently been deployed to improve empirical relationships for comparison between physicochemical parameters (Bhandari and Nayal, 2008).
Electrical Conductivity (EC) and total dissolved solids (TDS) are correlated and applied to portray the extent of salinity. The water TDS experiment is more complicated than that of EC; nonetheless, the TDS test is imperative since it is superior to EC in illustrating groundwater quality, specifically in the influence of seawater intrusion. These circumstances create the need for research in TDS/EC correlation and Regression (Rusydi, 2018). (Bhandari and Nayal, 2008), studied a broad spectrum of water quality-related correlations among them, EC was correlated to total hardness and sulfate, and total dissolved solids with R equals 0.89, 0.806, and 0.94 respectively. (Kumar and Sinha, 2010), showed that EC is having a noteworthy correlation with several quality parameters, including (total hardness, calcium, total dissolved solids, total solids, and sulfate) with a correlation coefficient (r) over 0.9.
In the present study, statistical regression modeling is adopted to produce five effective predictive models in which electrical conductivity is used as an independent variable (predictor) while the other five criterion variables (dependent variables) include: total hardness, calcium, chloride, sulfate, and total dissolved solids.

MATERIAL AND METHODS 2.1 Case study
The Tigris River in Baghdad city's boundaries has been considered and represents the only source of drinking water. The length of the river is about 65 km, and it divides the capital into two parts of Al-Karkh (west bank of a river) and Al-Rusafa (east bank of a river), nine monitoring stations initially constructed near the water treatment plant (WTP) intakes (Khudair, 2013). Each station has its own laboratory, and the observations taken from these stations are considered independent. The case study is shown in Fig. 1; where Al-Karkh Station (S1), East Tigris Station (

Data collection and analysis
In the early stage of statistical analysis, 15 different parameters were considered. The data are collected from Baghdad mayoralty in coordination with the different WTPs; the study period is selected for the period (2014-2019). After reviewing the data and cross-correlation between the parameters; Electrical conductivity showed a good correlation with Total Hardness (TH), Calcium (Ca), Chloride (Cl), Sulfate (SO4), and Total Dissolved Solids (TDS). Hence, the dataset is prepared, and the descriptive statistics in Table 1. are reported to understand the nature of the data. It is noteworthy that all test procedures are done according to standard procedures (APHA, 2017). It is noteworthy in Table 1. that the mean value of TDS exceeded the WHO standards while all other parameters are within the acceptable limits according to the world health organization.

Methodology
Correlation coefficients are a common method for characterizing the association between two variables. Correlation is confirmed when both variables have random measurement errors. The variables are equally important and interchangeable (both variables are correlated, but it does not mean that they are governed by predictor / predicted relation). Pearson sample correlation coefficient between x (first variable) and y (second variable) is given by ( (1) In this study, a simple linear regression model was adopted. Although data fit is a useful criterion for evaluating a model, it should not be concerned with obtaining the optimal mathematical fit (i.e., the adjusted maximum R 2 or the minimum residual standard deviation) of the estimated data.
Where Y and X represent the dependent and independent variables, while C0 and C1 represent the y-intercept value and the slope of the line, respectively. The collection of new data is the preferred method for validating the model. In many cases, this is neither practical nor possible. In this case, a procedure that simulates the collection of new data is needed. A sensible way forward is to split the data in one hand into two groups. The first data set, called the estimate data set, is used to estimate model parameters. The remaining data points, called the verification data set, are used to measure the accuracy of the model's prediction.

77
The Data splitting approach is used here, using a random sampling technique and split the dataset. The validation of the data splitting method is done using criteria estimates such as root mean square error RMSE given by: Where subscript refers to the validation dataset.

RESULTS AND DISCUSSION
To perform the linear Regression analysis for the five parameters (TDS, TH, Ca, Cl, and SO4) vs. EC, the data set containing more than 600 observations is separated into two data sets. The calibration dataset (used to create the models) and Validation dataset (used to check the models' performance for future prediction); in this study, the validation dataset is randomly taken from the whole data (60 observations are set aside for validation). Statistical program (SPSS) is deliberated in this study to perform the regression analysis. The summary of the models is shown in Table 2. These models make physical sense as the positive sign of (C1 and r) signify a direct relationship between the electrical conductivity and other parameters (for instance, when the EC increase, the Cl will increase). These results are analogous to (Joarder et al., 2008).
The physical explanation of the positive sign is that the EC in water increases as the positive and negative ion concentration increases (Peterson et al., 2013). From The predicted vs. observed plot is a powerful tool to emphasize the model's prediction ability, especially when the calibration and validation datasets are differentiated. Fig.2, 3, 4, 5, and 6 show the prediction ability of the five models described above. For the TDS-EC model, it is evident that the model is very accurate as the points are concentrated along the diagonal line, while the SO4-EC model is less accurate as the points are scattered along the diagonal line. Overall, the five models are acceptable and of high importance, as they indicate different parameters by observing the electrical conductivity only. Since the points are scattered, it indicates that the data meet the assumptions of the normally distributed errors and that the residue variations are constant. RMSEval has shown an acceptable value for all models as it is much lower than the STDV of the variable.