A Multi-variables Multi -sites Model for Forecasting Hydrological Data Series

A multivariate multisite hydrological data forecasting model was derived and checked using a case study. The philosophy is to use simultaneously the cross-variable correlations, cross-site correlations and the time lag correlations. The case study is of two variables, three sites, the variables are the monthly rainfall and evaporation; the sites are Sulaimania, Dokan, and Darbandikhan.. The model form is similar to the first order auto regressive model, but in matrices form. A matrix for the different relative correlations mentioned above and another for their relative residuals were derived and used as the model parameters. A mathematical filter was used for both matrices to obtain the elements. The application of this model indicates it's capability of preserving the statistical characteristics of the observed series. The preservation was checked by using (t-test) and (F-test) for the monthly means and variances which gives 98.6% success for means and 81% success for variances. Moreover for the same data two well-known models were used for the sake of comparison with the developed model. The single-site single-variable auto regressive first order and the multi-variable single-site models. The results of the three models were compared using (Akike test) which indicates that the developed model is more successful ,since it gave minimum (AIC) value for Sulaimania rainfall, Darbandikhan rainfall, and Darbandikhan evaporation, while Matalas model gave minimum (AIC) value for Sulaimania evaporation and Dokan rainfall, and Markov AR (1) model gave minimum (AIC) value for only Dokan evaporation).However, for these last cases the (AIC) given by the

multivariate variation. The last two cited research are those among the little work done on forecasting models of multi-sites multivariate types. However these model are rather complicated, and/or do not model the process of cross site and cross variables correlation simultaneously, which as mentioned above the real physical case that may exist. Hence researches are further required to develop a simplified multisite multivariate model. In this research a new straightforward multisite-multivariate approach is proposed to develop such a model that describe the cross variables and cross sites correlation structure in the forecasting of multi variables at multi sites simultaneously. This model was applied to a case study of monthly data of two hydrological variables, rainfall and evaporation at three sites located north Iraq, Sulaimania, Dokan, and Darbandikhan.

THE MODEL DEVELOPMENT
The multivariate multisite model developed herein, utilizes single variable lag correlations, cross variables lag-correlations, and cross sites correlations. In order to illustrate the model derivation consider Fig.1 shown. This figure illustrates the concept of two variables, two sites and first order model. This simple form is used to simplify the derivation of the model. However, then the model could be easily generalized using the same concept. For instant, Fig. 2 is a schematic diagram for a multivariate multisite model of two variables, three sites and first order time. The concept is that if there will be two-variables, two sites, and one time step (first order), then there will exist (8) nodal points. Four of these represent the known variable, i.e. values at time (t-1); the other four are the dependent variables, i.e. the values at time (t). As mentioned before Fig. 1 shows a schematic representation of the developed multisite multivariate model and will be abbreviated hereafter as MVMS (V, S ,O),where V: stands for number of variables in each site , S: number of sites , and O : time order, hence figure (1) can be designated as MSMV (2,2,1), while Fig. 2 MVMS (2,3,1). This model can be extended further to (v-variables) and / or (s-sites) and / or (o-time) orders as will be shown later .The model concept assume that each variable dependent stochastic component at time t can be expressed as a function of the independent stochastic component for all other variables at time (t), and those dependent component for all variables at time (t-1) at all sites. The expression is weighted by serial correlation coefficient, cross-site cross-correlation coefficient, cross-variable cross coefficient and cross-site, cross-variable correlation coefficient. In addition to that; the independent stochastic components are weighted by the residuals of all types of correlations. These residual correlations are expressed using the same concept of autoregressive first order model (Markov chain). Further modification of this model is to use relative correlation matrix parameters by using correlation values relative to total sum of correlation for each variable, and the total sum of residuals as a mathematical filter ,as will be shown later. A model matrix equation for first order time lag, O=1, number of variables=V, and number of sites=S, could be put in the following form: Where : where: ρ 1,1 = ρ [(x1, x1), (s1, s1), (t, t-1) ]= population serial correlation coefficient of variable 1 with itself at site 1 at site 1, for time lagged 1 ρ 1,2 = ρ [(x1, x2), (s1, s1), (t, t-1) ]= population cross correlation coefficient of variable 1 at site 1 with variable 2 at site 1, for time lagged 1 ρ 1,3 = ρ [(x1, x1), (s1, s2), (t, t-1) ]= population cross correlation coefficient of variable 1 at site 1 with variable 1 at site 2, for time lagged 1 ρ 1,4 = ρ [(x1, x2), (s1, s2), (t, t-1) ]= population cross correlation coefficient of variable 1 at site 1 with variable 2 at site 2, for time lagged 1 ρ 1,5 = ρ [(x1, x1), (s1, s3), (t, t-1) ]= population cross correlation coefficient of variable 1 at site 1 with variable 1 at site 3, for time lagged 1 ρ 1,6 = ρ [(x1, x2), (s1, s3),(t,t-1) ]= population cross correlation coefficient of variable 1 at site 1 with variable 2 at site 3, for time lagged 1,the definition continues… , finally ρ 6,6 = ρ [(x2, x2), (s3, s3), (t, t-1) ]= population serial correlation coefficient of variable 2 at site 3 with variable 2 at site 3, for time lagged 1.
The designated (ρ i,j ) is used for simplifying .That is variables at site 1 ,as 1, and 2,for this model ( in general to 1,2,…v),then for variables at site 2,as 3 ,and 4 (in general from v+1 to 2v and so on) hence (r1,v+1) stands for the correlation between variable 1 at site 1,and variable 1 at site 2 and so on. ϵ: is the Stochastic dependent component.
ξ: is the Stochastic independent component. σ i,j : are the residual of the correlation coefficient ρ i,j .
The matrix, Eq.
(2), replacing ρ i,j values by the corresponding relative values in the matrix Eq.(6), and σ i,j with the corresponding relative values σr i,j in the matrix Eq.(7) . The model can be generalized to any number of variables and number of sites.

THE CASE STUDY AND APPLICATION OF THE MODEL.
In order to apply the new developed (MVMS) model explained above the Sulaimania Governorate was selected as a case study. Sulaimania Governorate is located north of Iraq with total area of (17,023 km2) and population, 2009. 1,350,000. The city of Sulaimania is located (198) km north east from Kurdistan Regional capital (Erbil) and (385) km north from the Federal Iraqi capital (Baghdad). It is located between (33/43-20/46) longitudinal parallels, eastwards and 31/36-32/44 latitudinal parallels, westwards. Sulaimania is surrounded by the Azmar Range, Goizja Range and the Qaiwan Range from the north east, Baranan Mountain from the south and the Tasluje Hills from the west. The area has a semi-arid climate with very hot and dry summers and very cold winters. Barzanji, 2003. The variables used in the model among other meteorological recoded data are (rainfall and evaporation) for monthly model as a two main variables that are expected to be useful for catchment management and runoff calculation. Data were taken from three meteorological stations (sites) inside and around Sulaimania city, which are Sulimania, Dokan dam, and Darbandikhan dam meteorological stations. Dokan dam metrological station is located (61 km) north east, and Darbandikhan dam metrological station is located (55 km) south east of Sulaimania city. While Dokan dam meteorological station is located (114 km) north east of Darbandikhan dam metrological station .The sites coordinates are given in Table 1, Barzinji ,2003.The Satellite image of the locations of the three stations showed in Fig.3. The model was applied to the data of the case study described above. The length of record for the two variables and the three stations is (27) years, . The data for the first (22),  years were used for model building, while the left last 5 years data were used for verification, (2006)(2007)(2008)(2009)(2010). It is worth to mention that the data are on monthly basis. Moreover since the analysis includes the rainfall as a variable which has zero values for June, July, August and September, in the selected area of the case study, these months are excluded from the analysis. Hence the model was built for the continuous period from October to May. In order to give a general view for the data used the descriptive statistics (Mean, Standard deviation Sd, Coefficient of Skewness Cs, Coefficient of kurtosis Ck, Maximum Max, Minimum Min) were calculated for rainfall and evaporation of Sulaimania, Dokan dam, and Darbandikhan dam meteorological stations and are shown in Table 2. Before proceeding with the modeling process the data series should be checked for their homogeneity . The split sample test suggested by Yevjevich, 1972, was applied for this purpose for each data series to test the homogeneity both in mean and standard deviation values. Different sizes of the subsamples were used for dividing the data sample into two subsamples with (n1,and n2) as number of years for subsample one and subsample 2 respectively. That is (n1:n2) as (1,26),(2,25), (3,24), and so on. The split sample test result on estimated t-values that was compared with the critical t-value. If the t-value estimated is greater than the critical t-value then the data series is considered as non-homogeneous, Yeijevich, 1972 , and thus this nonhomogeneity should be removed. The results of this test had showed that there are some different subsamples splitting (n1:n2) values that exhibit non-homogeneity exist, however these cases that gives the maximum t-test values were considered for each of the 6-data series. Table.3 shows these results, which indicates that non-homogeneity is exist in Sulaimania evaporation, Dokan rainfall, and Derbendikhan evaporation data series, while the series of the other variables are homogeneous. To remove this non-homogeneity the method suggested by Yeijevich,1972 was used that using the following equation: (12) Where, H i,j : is the homogenized series at year i,month j of the first sub-sample (old). X i,j : is the original series at year i, month j, of the first sub-sample . A1, B1: are the linear regression coefficients of the annual means. A2,B2 : are the linear regression coefficients of the annual standard Deviations. Mean2,Sd2 : are the overall mean and standard deviation of the second sub-sample. This implies that the data is normalized according to the second sub-sample, i.e., the most recent one which is the correct way for forecasting. Table.4 shows the values of the of Mean2,Sd2,A1,B1,A2,and B2, for the three non-homogeneous series. The homogenized data were then retested to make sure that the transformation applied in Eq.(12), had removed the non-homogeneity. Table.5 shows these results which ensure that the data series are all now homogeneous. The next step in the modeling process is to check and remove the trend component in the data if exist. This was done by finding the linear correlation coefficient(r) of the annual means of the homogenized series, and the t-value related to it. If the t-value estimated is located in the r=0 hypothesis rejecting area t> + or -critical t-value of 2.83 then trend exist otherwise it is not. The following equation is used to estimate the T-values.
Where Table 6 shows these results, which indicate the absence of the trend component in all of the data series of the six variables. Before proceeding into the modeling process the data should be normalized to reduce the skewness coefficient to zero. The well-known Box-Cox transformation Box and Jenkin , 1976 was used for this purpose as presented in the following equation: Where: µ : is the power α : is the shifting parameter. XN : is the normalized series. Table.7 shows the coefficients of the normalization transformation of all of the six series. The shifting parameter is selected to ensure avoiding any mathematical problem that may occur due to the fraction value of the power µ. The power value is found by trial and error so as to select its value that reduce the skewness to almost zero value. Table 8 shows the statistical properties of the series before and after normalization, which indicate that the skewness coefficients are reduced to almost zero a property of the normal data. The next step in the modeling process is to remove the periodic component to obtain the stochastic dependent component of the series, which is done by using Eq.(15), as follows: = Where: ϵ i,j : is the obtained dependent stochastic component for year i, month j. Xb j : is the monthly mean of month j of the normalized series XN. Sd j : is the monthly standard Deviation of month j of the normalized series XN. Table 9 shows the monthly means and monthly standard deviations of the normalized data series XN. The ϵ i,j obtained series are then used to estimate the Lag-1 serial and cross correlation coefficients ρ i,j , and σ i,j of matrix Eqs. (6) and (7) respectively, which then used to estimate ρr i,j and σr i,j using Eqs.(9), and (11), respectively..

RESULTS AND DISCUSSION
The developed model above is used for data forecasting, recalling that the estimated parameters above are observed using the 22 years data series . This model will be used to forecast data for the next 5-years (2006)(2007)(2008)(2009)(2010) since the data available are up to 2010, that could be compared with the observed series available for these years, for the purpose of model validation.
The forecasting process was conducted using the following steps: 1. Generation of an independent stochastic component ( ) using normally distributed generator, for 5 years,i.e., (5*12) values.
3. Reversing the standardization process by using the same monthly means and monthly standard deviations which were used for each variable to remove periodicity using Eq. (15) after rearranging.

Applying the inverse power normalization transformation (Box and Cox) for calculating unnormalized variables using normalization parameters for each variable and Eq.(14).
In most forecasting situation, accuracy is treated as the overriding criterion for selecting a model. In many instance the word "accuracy" refers to "goodness of fit," which in turn refers to how well the forecasting model is able to reproduce the data that are already known. The model validation is done by using the following steps: 1. Checking if the developed monthly model resembles the general overall statistical characteristics of the observed series.
2. Checking if the developed monthly model resembles monthly means, monthly standard deviations using t-test for the means and F-test for the standard deviation.
Where: n: is the number of the total forecasted values . K: number of parameters of the model plus 1. Rss: is the sum of square error between the forecasted value and the corresponding observed value.
For each site and variable three sets of data are generated. The overall statistical characteristics are compared with those observed, for each of the generated series. Table 10 shows these comparisons. For all variables and sites the generated sets resemble the statistical characteristics not exactly with the same values of the observed series but sometimes larger or smaller but within an acceptable range. Table 11 shows the t-test and F-test summary for all of the variables and sites. As it is obvious from the results of these tables, that the generated series succeed in (ttest) for all of the monthly means, except for two months for Sulaimania rainfall, i.e. overall succeed percent of (98.6%). This indicates that the model is successfully resembled the monthly means values, with excellent accuracy. Based on (F-test) which seek the variance differences between the observed and generated series; the success percentage ranking of the generated series was: the best being for Sulaimania rainfall (96%), followed by Darbandikhan evaporation (88%), Darbandikhan rainfall (83%), Dokan evaporation (83%), Dokan rainfall (71%), and finally Sulaimania evaporation (67%). The overall success percentage was (81%). These results of the F-test indicate that the model was successfully resembled the monthly standard deviations, with a very good accuracy. As mentioned above for purpose of the comparison of the model performance with the available forecasting models, the Akaike , 1974 test was used. Before that six single variable single site models were developed, one for each variable, and three single variable multi-site models, Matalas ,1967 one for each site. These models were then used for forecasting monthly data for the same period (2006)(2007)(2008)(2009)(2010), forecasted by the developed model. Table.12 shows the Akaike test results for all of the forecasted variables, in each sites, obtained using these model and those obtained by the developed model. It is obvious that the developed model had produced for most of the cases the lowest test value, i.e, the better performance. Even though for some cases it has higher test value than the other models, but for these cases it is observed that a very little differences are exist between these test values and the minimum obtained one.

CONCLUSIONS
From the analysis done in this research, the following conclusion could be deduced: 1-The model parameters can be easily estimated and do not require any extensive mathematical manipulation. 2-The model can preserve the overall statistical properties of the observed series with high accuracy.
3-The model can preserve the monthly means of the observed series with excellent accuracy, evaluated using the t-test with overall success (98.6%). 4-The model can preserve the monthly standard deviations of the observed series with a very good accuracy, evaluated using the F-test with overall success (81%). 5-The comparison of the model performance with the single variable single site and the multi-site single variable models, using the Akaike tset had proved that the developed model had proved better performance in the most cases. Moreover for those less cases where other models had the better performance; the test value of the developed model is slightly higher than the minimum value.