1 Introduction

Energy is needed for sustaining the rapid pace of economic development. Developed countries need energy to keep up with this accelerated pace of development. Least developed world depends on energy for meeting their basic needs. And here, a critical shortage of energy primarily for electricity exists and threatens to remain. This is detrimental to the populations living here [1]. For countries like Nepal without their own natural reserves of fossil fuel, there is an increased dependence on other countries even for meeting its basic energy needs. Thus, a switch over to a renewable energy source can answer all the problems. Biogas is one of such source of renewable energy that uses animal and human litter for energy generation.

The fragile Himalayan environment here is suffering from rapid deforestation and soil erosion. Total energy consumption here in the years 2008 and 2009 was about 9.3 million tons of oil equivalent (401 million GJ), out of which 87% were derived from traditional resources, 12% from commercial sources and less than 1% from the alternative sources [2].

Non-commercial energy forms, largely fuel wood and other biomass, are still the main source of direct energy use in Nepal for a large proportion of the population. According to 2011 census, 64% of the households use wood/firewood for meeting their cooking needs [3]. The rural urban differential in the dependence on firewood as a source of cooking is reflected by the fact that 59.03% in the rural areas and 5% in the urban areas depend on firewood for meeting their cooking needs. Cow dung provides energy source for cooking in 10.3% of the households with rural urban differential of 10% and 0.29%, respectively. The dependence of households for cooking on commercial fuel source namely kerosene is 0.387% and 0.638% in rural area and urban areas, respectively. Only 1% households in Nepal use kerosene for meeting their cooking needs. According to WECS energy report 2010, fuel wood is the largest energy source in Nepal providing 77% of the total energy demand in the year 2008/2009. Here, 99.2% is consumed by the residential sectors, 0.2% by the industrial and 0.6% by the commercial sector. Other sources of energy are agriculture residue and animal dung contributing about 4% and 6%, respectively. Share of petroleum fuel in the energy consumption profile of the nation is about 8%. Other sources of commercial energy are coal and electricity with both contributing about 4% in the total energy supply. So, in nutshell, the overall energy consumption of Nepal is largely dominated by the use of traditional non-commercial forms of energy such as fuel wood, agriculture residue and animal waste. The share of traditional biomass resources, commercial energy and renewable energy resources is 87%, 12% and 1%, respectively. There has been a slight decline in the use of traditional fuel as an energy source for over the years with its share of 91%, 88% and 87% in 1995/1996, 2004/2005 and 2008/2009, respectively. The commercial source provided remaining 13% of the energy.

Generation, verification, rectification and prediction of data lead to evidence-based study. This can be applied to data from any discipline. It is thus an interdisciplinary approach to the solution of a research problem. This makes the research objective and undisputable. Halvorsen et al. [4] applied multinomial logistic regression to explore association between general practitioners preferences for time spent on preventive health care activities and age, gender and practice characteristics. Chambers and Whitehead [5] applied multinomial logistic regression for modeling the willingness to pay for wolf protection plan. Similarly, contraceptive use was modeled using multinomial logistic regression by Angeles et al. [6]. Crompton and Wu [7] have used evidence-based method of forecasting energy consumption in China by using Bayesian vector autoregressive methodology. Cai and Jiang [8] analyzed the rural urban differential in energy consumption in China using analysis of variance techniques for parametric and nonparametric data. Bhattacharyya [9] studied the access of energy to India’s poor, whereas an overview of energy consumption pattern by available data and the analysis of some relevant aspects of energy policy in rural China are presented by Zhang et al. [10]. Petrides and Furnham [11] analyzed several dimensions human’s emotional intelligence with exploratory factor analysis.

In this paper, the socioeconomic benefits of energy are studied on the basis of collected data. It is shown that the benefit in developing countries far outweigh any directly attributable advantage. Here, the novelty lies in the fact that unlike many papers, the use of statistics is not superficial. Extensive statistical methods are developed and used in drawing in-depth conclusions and inferences. Categorical data are generated from the survey of 700 households. Nepal like many countries in the developing world suffers from limited and scarce data. Countries with limited and scarce data do not have a strong backbone of good quality official data. Remote geographical location, lack of awareness of stakeholders and lack of incentives are the main causes of limited and scarce data. Such data and results of such studies are needed by policy makers and planners for making and achieving developmental goals and initiatives [12]. In the absence of accurate measurement instruments for such countries, the use of methods of generation and prediction of categorical data was considered appropriate to fill this knowledge void. Section 2 explains Data Sources and Methods, and Sect. 3 presents Exploratory Data Analysis. Section 4 explains Models and Estimation, and Sect. 5 presents Results and Discussion. The Conclusion section gives the concluding remarks with possible question for future research.

2 Data sources and methods

2.1 Samples and data collection

This study is based on combined results of 700 rural households based on sample surveys of 300 households of national grid energy users and 400 households of biogas users. The survey areas are numbered in the map of Nepal shown in Fig. 1. Here, 2 which represent Kavre district is the survey area for 300 households and 1, 3 and 4 representing Bhaktapur, Simara and Sarlahi, respectively, is the survey area for 400 households. This detailed sample survey of 300 households was conducted with an aim to have a detailed energy consumption profile. Similarly, a sample survey of 400 households of biogas consumers was also conducted. In both these surveys, the base questionnaire is the same. It is a structured questionnaire with answers given as multiple choices that resulted in categorical data. This categorical data could be analyzed on ordinal scale. Because of large sample size this could be treated as a continuous data by the application of central limit theorem. The questionnaire was pretested. The survey of biogas consumers collected data on their energy consumption pattern before and after the installation of biogas plant in their household. Their response on before questions helped assess their needs as normal users and potential biogas users. After construction of biogas, questions helped assess the change in their energy consumption pattern after they switched to biogas. The family structure and the location of 700 households are shown in Table 1.

Fig. 1
figure 1

Survey areas

Table 1 Respondent’s profile

2.2 Statistical methods

Factor analysis: It is a statistical technique studying the interrelationship among the variables in an effort to find a new set of factors that are fewer in number than the original variables. These factors are common among original variables and identify the hidden relationship between the variables. For categorical data, exploratory factor analysis based on principal components analysis method with varimax rotation produces stable results. For this method, normal distribution is not a prerequisite. Elliptical symmetry of the data is required for this method. Kaiser–Meyer–Olkin measure of sampling adequacy (KMO) is an indicator of level of factorability of the data. KMO > 0.5 indicates good factorability of correlation matrix. The common factor analytic model assumes that a variable consists of common and unique parts. Therefore, the factor model can be considered as

$$X = \varLambda F + U$$
(1)

where X = [X1, X2,…, Xp]′, F = [f1, f2, …, fq]′, U = [e1, e2, …, ep]′.

X is the data vector, F is the vector of common factor and U is the unique factor. The elements of F and U cannot be observed from investigation. The matrix of factor loadings is given in Eq. (1) where \(\lambda_{ij}\)’s (i = 1, 2,…, p; j = 1, 2,…q) are unknown.

$$\varLambda = \left[ {\begin{array}{*{20}c} {\lambda_{11} } & {\begin{array}{*{20}c} {\lambda_{12} } & \ldots & {\lambda_{1q} } \\ \end{array} } \\ {\begin{array}{*{20}c} {\lambda_{21} } \\ { \ldots_{{}} } \\ {\lambda_{p1} } \\ \end{array} } & {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\lambda_{22} } \\ { \ldots_{{}} } \\ {\lambda_{p2} } \\ \end{array} } & {\begin{array}{*{20}c} \ldots \\ \ldots \\ \ldots \\ \end{array} } & {\begin{array}{*{20}c} {\lambda_{2q} } \\ { \ldots_{{}} } \\ {\lambda_{pq} } \\ \end{array} } \\ \end{array} } \\ \end{array} } \right]$$
(2)

It is assumed that e1, e2,…, en are mutually independent and they are also independent of the elements in F. That is \(\text{cov} \left( {U, \;F^{\prime}} \right)\, = \, 0\). Also, it is assumed that

$$E\left[ {UU^{\prime}} \right] = \left[ {\begin{array}{*{20}l} {\psi_{1} } \hfill & 0 \hfill & 0 \hfill & \ldots \hfill & 0 \hfill \\ 0 \hfill & {\psi_{2} } \hfill & 0 \hfill & \ldots \hfill & 0 \hfill \\ \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill & \ldots \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & \ldots \hfill & {\psi_{p} } \hfill \\ \end{array} } \right]$$
(3)
$$E\left( F \right) \, = \, 0,\;V\left( F \right) \, = \, 1$$

Multinomial logistic regression: It is used to predict categorical placement in or the probability of category membership on a dependent variable based on multiple independent variables. It is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable. Multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.

Let J denote the number of categories of Y. Here, Y is a multinomial response variable. Let {π1, π2…….πj} denote the response probabilities, satisfying the condition that their sum is equal to 1. Logit models for multinomial response pair each category with a baseline category [13]. When the last category (J) is the baseline, the baseline category logits are

$$\log \left( {\pi_{i} /\pi_{j} } \right),\quad j \, = \, 1, \ldots \;J - 1$$
(4)

Given that the response falls in the category j or J, this is the log odds that the response is j.

The baseline category logit model with predictor x is

$$\log \left( {\pi_{j} /\pi_{j} } \right) \, = \, \alpha_{j} + \, \beta_{j} x,\quad j\, = \,1, \ldots \;J - 1$$
(5)

The model has J − 1 equations with separate parameters for each. The effects vary according to the category paired with the baseline. When J = 2, this model simplifies to single linear equation for log(π1/π2) = logit(π1), resulting in ordinary logistic regression for binary responses.

There is an odds ratio associated with each predictor. It is denoted by Exp(B). It is more than 1 in cases where predictors increase the logit, Exp(B) is equal to 1 in cases where predictor do not have any influence on the logit and Exp(B) is less than 1 in cases where predictors decrease the logit.

Multivariate analysis of variance for two-way classification: The multivariate generalization of the analysis of variance for testing the equality of mean vectors of several populations if the total covariance matrix is partitioned into two components, viz., between—populations and within populations covariance matrix. This analytical procedure is a special case of the multivariate general linear hypothesis to carry out test on the parameter in the model. Each component of response in the model is composed of the p values observed from p variables. So let us consider that Y is an observed matrix of order n × p of p response variables on each of n objects. These response variables depend on some other variables Xi’s where there are (q − 1) independent variables observed on each of the n objects. Then, the multivariate regression model is written as

$$Y = XB + U$$
(6)

where X is of order n × q, B is a matrix of unknown regression parameters and is of order q × p and U is a matrix unobserved random disturbances. The matrix X in (2) may be a random matrix and the model is called multivariate regression model. This is multivariate regression analysis. The elements in the X matrix may be 0 and 1 indicating the absence or presence, respectively, of some factors. Then, Eq. (6) is called a general linear model and we have a case of multivariate analysis of variance.

3 Exploratory data analysis

Primary data generated for this study result from the survey of 700 households. This survey generated more than 350 multivariate categorical data. The questionnaire used for data generation had answers given as a multiple choice option. These answers generated categorical data that could be classified on ordinal scale. The categorical data generated from variables later used in factor analysis are in Table 2. These variables measure the amount of energy consumed. For example as seen from Table 2, variables time spent in the collection of firewood and amount of firewood consumed reflect the quantity of energy consumed. Due to the large sample size, these categorical data, classified on ordinal scale can be treated as continuous. The results of statistical analysis can be physically interpreted. This is by the application of central limit theorem. This type of classification of these data reduces the ambiguity in response of the energy users. Here, data related to types of house and types of toilet which are proxy asset indicators are used to study the contribution of socioeconomic status on energy consumption. The impact of family size and family composition through adult female composition of more than 25% of the family size and number of residents more than 15 years is also analyzed.

Table 2 Details of the variables 700 households used in factor analysis

The mean values of these variables along with the results of factor analysis are in Table 3. Due to the large sample size and through the application of central limit theorem, these values can be interpreted physically. For example, the mean time spent in the collection of firewood is 3.22 for all 700 households, implying that a household spends on average around 45 min per day for the collection of firewood. This time spent is directly proportional to the amount of energy spent in fuel wood collection. Distance travelled per day for the collection of firewood, amount of time spent per day in this process and the source of this firewood are the indicators of energy spent by a household for the consumption of energy gained from non-renewable energy source. The human energy invested in the collection of firewood could have been diverted to the conduction of income-generating activities. The benefits of biogas to women and utilization of time saved by biogas in specific income-generating activity are shown in Figs. 2 and 3. As seen from Fig. 2, out of 400 households, women in 396 say that a switch over to biogas saves their time and out of which 349 got involved in income-generating activities. The spare time is used in the involvement in children’s education by 26 households. As shown in Fig. 3, the number of households engaged in farming is 331 and those rearing livestock is 243. These choices are not mutually exclusive as many families have chosen farming as their first choice and livestock rearing as their second choice. Similarly, many women have chosen time saved as their first choice and clean atmosphere as their second choice. Daily wages of farm workers in normal season amount to USD 8 from 11:00 A.M. to 4:00 P.M. This amounts to around 2 USD per hour. Distance covered for the collection is longer and sources of firewood are farther for the household without biogas. Further, the percentage of adult women and adult members constituting a rural family plays an important role in decision related to choice of a renewable energy source. The total variables explaining the social and economic landscape of biogas users are in Table 4. These variables also form a part of biogas consumer profile database.

Table 3 Result of factor analysis of 700 households
Fig. 2
figure 2

Benefits of biogas to women

Fig. 3
figure 3

Income-generating activity

Table 4 Overview of biogas consumer profile database

4 Models and estimation

4.1 Factor analysis of energy consumption profile of normal users and biogas users

Sources of firewood, time spent in the collection of firewood, type of house, location and type of latrine used, no. of residents more than 15 years of age and composition of adult women more than 25% of the total family size are used to assess and cross-validate the energy pattern of rural families. The last two variables namely no. of residents more than 15 years and composition of adult women more than 25% of the total family size will help validate that families with large adult composition of adult members in general and women in particular spend more time and cover longer distances in the collection of firewood. Further, such households are less likely to switch over to alternative sources of energy. These variables along with their values on ordinal scale are in Table 2. They are then analyzed using factor analysis. This analysis is conducted for 700 households of energy users in general and 300 households of biogas users. Exploratory factor analysis is done using principal component analysis method (PCA) and varimax rotation. The sampling adequacy factors range over 0.60. This study is based on categorical data of 700 households, KMO > 0.5 indicates that sample size is adequate. The determinant of correlation matrix is 0.02. It is > 0.00001 indicating that there is no multi-collinearity. Similarly, Bartlett test of sphericity rejects the null hypothesis that the correlation matrix is identity at a very high level of significance. The factor loadings for normal users and biogas users are given in Table 3.

4.2 Multinomial regression of benefit of biogas users

A consumer profile data base of biogas consumers was constructed and information on 476 variables was stored. The overview of this data base is in Table 4. The impact of benefits of use of renewable energy in particular and biogas in general is highlighted and quantified. Multinomial logistic regression is fitted to variable related to the benefits of biogas. The descriptive statistics of these variables and their categorical values are shown Table 5. These values of mean, mode and median can be physically interpreted by central limit theorem. For example, for Variable A representing distance travelled for the collection of firewood before the plant, takes values 1, 2, 3, 4 and 5 on the categorical scale. The mean value is 4.08 implying that a house on average covered 200–500 m for the collection of firewood. The variable saves time is regressed using logistic regression on distance travelled before, distance travelled after, female benefitting from reduced fuel expenses, male benefitting from reduced fuel expenses, male benefitting from reduced pollution, female benefitting from reduced pollution and amount of firewood saved per month. The result of this logistic regression is in Tables 6 and 7. As seen from the table, the results are highly significant.

Table 5 Descriptive statistics
Table 6 Model fitting information
Table 7 Likelihood ratio tests

4.3 Multivariate analysis of variance (MANOVA)

Multivariate analysis of variance is conducted with the three extracted factor scores as dependent variables and time taken for the collection of firewood (after the installation of plant) and source of firewood (after the installation of plant) as independent variables. After the installation of biogas plant, the sources of firewood collection and the interaction between the time taken for the collection of firewood and the source of firewood are significantly different with respect (with p value of 0.008 and 0.001) to the three factors identified after factor analysis. They are namely status, infrastructure and time for fuel. These three factors extracted from seven variables explore the role social and economic status classified in terms of status, time for fuel and infrastructure in the energy consumption pattern of a rural household. These factors differ significantly with respect to source of firewood and the interaction between time and firewood.

5 Results and discussion

Eight observed variables related to various aspects of energy consumption of rural household are reduced to three unobserved factors using factor analysis. Thus, the hidden structure in data of 700 households (300 households of national grid energy users and 400 households of biogas users) for eight variables is revealed by three factors. These factors explain 78% of the total variability. As seen by the factor analysis results in Table 3, three factors are infrastructure (basic), status (type of house) and time (spent on collection of fuel), and explain 42%, 14% and 21% of the total variance. The exploratory factor analysis with varimax rotation is conducted here as it not based on the assumption of normal distribution. By using central limit theorem, sample size of 700 households for energy users of rural households (national grid energy and biogas energy) and 400 for biogas users is large enough for the assumption for treating this data as a continuous data. Similarly, 400 households of biogas consumers had three factors namely resources (time and money), infrastructure (basic) and size of the family. These factors explain 66% of the total variability of the data. The factor loadings are also given in Table 3. The factor analysis is done on the basis of before the installation of plant questions. The biogas consumers were asked questions related to energy consumption pattern before and after they installed the biogas plant.

The variables used in logistic regression are given in Table 5. We represent distance travelled for the collection of firewood before the construction of plant by A, distance travelled after by B, saves time by C, reduced fuel expenses by female by D, reduced fuel expenses by male by E, reduced pollution by male by F, reduced pollution by female by G and firewood saving by H. Although these are ordinal and nominal data, but the measures of central tendencies such as mean, mode and median give us an idea of the average response and most popular response. The standard deviation gives an idea of the consistency of the response. By the application of central limit theorem, these data can be treated as continuous data. Multinomial logistic regression is fitted to these variables. The variable saves time is regressed on distance travelled before, distance travelled after, female benefitting from reduced fuel expenses, male benefitting from reduced fuel expenses, male benefitting from reduced pollution, female benefitting from reduced pollution and amount of firewood saved per month.

The likelihood ratio test of regression of saves time (dependent variables) on ordinal data distance travelled after, distance travelled before and firewood saved shows highly significant results. Similarly, the response of reduced fuel expenses by men and reduced pollution by women has also shown significant results. Through multinomial logistic models, the benefits have been quantified. The results can be summarized in the following manner.

The results of fitting a multinomial regression model are given in Tables 6 and 7. We see that this regression is highly significant with a Chi-square value of 56. The likelihood ratio tests in Table 7 also show significant results at 5% level of significance for most of the independent variables. The accuracy of this model is 62.8% with respect to the correspondence of these values with observed and expected. This is quiet high for a data based on response of human beings. These results are not based on closely controlled laboratory conditions. Although all the possible sources of error have been traced out and eliminated in this survey, experiments based on humans are more likely influenced by other psychological factors. Hence, an accuracy of 62.8% is a good result.

The improvement in lives of biogas users outweighs its direct advantage of merely getting to cook in gas stoves is displayed in Table 8. The improvement in the lives of biogas consumers is quantified in terms of odds ratio here. We see that those who travelled up to 100 for the collection of firewood before the construction of plant their likelihood of claiming the biogas saved their time increased by 19,730,000, 23,230,000 and 21,670,000 for up to 60 min, 1–3 h and 3–5 h, respectively. Similarly, those who covered 100–200 m for the collection of firewood, their likelihood of responding that it saved their time increased by 1.945 and 1.582 times for up to 60 min and 1–3 h, respectively. This is in comparison with those households who had to cover more than 500 m for the collection of firewood before the construction of plant. Similarly, those who do not have to travel at all for the collection of firewood after the construction of the plant are 3.478 and 4.706 times more likely to respond that biogas saved 3–5 h and more than 5 h than those who still cover more than 500 m. These are all the Exp(B) values. They are also called the odds ratio. Odds ratios in this table are different from 1, showing significant impact of use of biogas. There were 229 households which covered more than 500 m before and now, after the construction of the plant, it is 88. This shows that there is a considerable reduction in the distance travelled. Less distance travelled implies more time saved which in turn implies more money saved. As this time saved is used in income-generating activities by these household. So a switch over to renewable energy is not only an adaptation strategy to climate change but it can be also increase the socioeconomic status of areas hit with scarcity of fossil fuel. Similarly, the odd of those women favoring biogas as source of energy that has reduced effects of pollution also supporting its time saving benefits is very highly significant. This feature is reflected more prominently in women as they have to suffer from the pollution of traditional cooking stoves.

Table 8 Quantification of impact of biogas in terms of time saved-odds ratio

Multivariate analysis of variance is done with the factor scores obtained from the three factors. The aim is to study the effect of time spent in the collection of firewood and source of firewood collection for the biogas users on these factor scores. The source of firewood (SourceAfter) and interaction between time spent in the collection of firewood and source of firewood (AFTIMES * SourceAfter) are significantly different with p values of 0.008 and 0.001 for Wilks Lambda. Sources of firewood collection and its interaction with the time taken for the collection are significantly different with respect to the three factors.

6 Conclusion

Three major multivariate analysis techniques are applied to the data collected from 400 households of biogas users and 300 households of national grid electricity users. They are namely factor analysis, multiple analysis of variance and logistic regression. The results of these techniques show the quantification of social and economic benefits that outweigh directly attributable advantages of energy use which is its use as a cooking gas. Categorical data generated from the primary data collected in the sample surveys have been shown to be suitable for the study of intangible factors and on impact on merit of energy use among rural households.

The result of factor analysis states that three hidden factors are namely status (type of house), infrastructure (basic) and time for fuel govern the energy consumption dynamics. These factors explain 78% of the energy consumption pattern of rural households. Multiple analysis of variance tests the significance of difference between the factor scores computed from these three factors with respect to independent variables time for the collection of firewood and source of collection of firewood for the biogas users. Multinomial regression has been used in predicting and quantifying the benefit of biogas in terms of time saved. Time saved is money saved. This time saved has been used in income-generating activities. Large values of odds ratio thus obtained from the multinomial regression have quantified great benefits obtained from biogas plants. These benefits are more for women than men.

Similar survey with exact quantification is planned for the further extension of this work. In future, required instruments for exact measurement along with the questionnaire will be carried by the data collection team. This will be one of many strategies employed for the accuracy of the collected continuous ratio data.