• No results found

Parametric and Nonparametric estimation of inequality measurements : the case of Ecuador

N/A
N/A
Protected

Academic year: 2021

Share "Parametric and Nonparametric estimation of inequality measurements : the case of Ecuador"

Copied!
54
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Parametric and Nonparametric estimation of inequality

measurements, the case of Ecuador

Ramiro Mejia

Student number: 11805412

Date of final version: August 15, 2018

Master’s programme: Econometrics

Specialisation: Free track

Supervisor: dr E.(Eleni) Aristodemou

Second reader: dhr. dr. K.J. (Kees Jan) van Garderen

Faculty of Economics and Business

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

(2)

Statement of Originality

This document is written by Student Ramiro Mejia who declares to take full respon-sibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction 1

2 Literature Review 8

2.1 Inequality Measures . . . 8

2.1.1 Lorenz Curve . . . 8

2.1.2 Gini Coefficient . . . 10

2.2 Estimation of the Income distribution . . . 12

2.2.1 Parametric Approaches of estimation of income distribution . . . 12

2.2.2 Non-parametric Approaches of estimation of income distribution . 15 3 Model 19 3.1 Parametric estimation . . . 19

3.2 Non-Parametric estimation . . . 21

4 Data 23 4.1 Data characteristics . . . 23

4.2 Distribution of Income of Ecuadorian households . . . 24

5 Results 27 5.1 Non-parametric results . . . 27 5.2 Parametric results . . . 29 6 Conclusion 37 A Program 39 Bibliography 47 ii

(4)

Chapter 1

Introduction

Nowadays, one of the main socioeconomic problems that Latin American countries face is income inequality. There are significant gaps among the people living in the region especially in income, access to basic services, education or even access to the financial markets. It is often claimed that the region is one of the most unequal in the world. The income distribution has effects on the allocation of human and physical, and so, on the growth rate of the economy. The performance of an economy ought to be evaluated concerning economic indicators such as growth, production, and employment but also re-garding reduction of poverty and unfair socioeconomic disparities (Gasparini and Lustig, 2010).

Figure 1.1 shows the Gini coefficient, which measures income inequality, for various countries around the world and compares this measure among its different regions for the year 2011. The figure clearly shows that Latin America is the most unequal region surpassing even Africa. Inequality is pervasive and has economic and social costs. As mentioned before it reduces the access to opportunities for the lower income groups, and can diminish the effect of economic growth on poverty reduction. Many Latin American states followed a model of clientelism1associated with high degrees of inequality. Wealthy individuals and corporations have a strong influence over governments, while individuals in the low-income group are in general excluded and have no power over the political decisions. It is evident that this model has affected the way public policy is done and it is disadvantageous especially for most impoverished individuals (De Ferranti et al., 2004).

Inequality can be the source of a feeling of unfairness among the population. Thus high inequality can help to explain much of the socioeconomic problems, political tensions and unstable periods in Latin America especially during the twentieth century. Consequently,

1Stokes (2009), defined Clientelism as the exchange of material goods by electoral support.

(5)

Figure 1.1: Gini coefficient by regions 2011

Source: Gasparini and Lustig (2011)

achieving accurate measures of income inequality and effective reduction of it must be a priority for the economic and social development of the countries in the region.

Income disparities have been persistent in Latin America for centuries. High inequal-ity is embedded in institutions that have been established since the colonial era and survived political and socioeconomic changes until the present day. The origin of con-temporary inequality in Latin America institutions can be found in its Spanish colonial past, in particular, in the relationships between colonist and the colonized population. In Latin America, European settlements concentrated in highly populated areas rich in nat-ural resources. To exploit labor from indigenous population and African slaves, colonist developed institutions to control the land, politics, and labor. As examples of Spanish institutions to exploit the labor force, there was the encomienda, in which the indigenous people had to give the colonist a tribute and labor services, in exchange for converting them to Christianity. The Spanish settlers also adapted an Inca institution known as mita, used by the Incas over men, to pay tribute to the Inca government in the form of forced labor. These institutions generated wealth for the Spanish empire and made their descendants rich, but also they turned Latin America into one of the most unequal regions (Acemoglu and Robinson, 2012).

After the independence from Spain, the domestic elites from the newly formed coun-tries continued to mold institutions and policies to preserve their status (De Ferranti et al., 2004). As Engerman and Sokoloff (2002) claimed, the institutions formed during the colonial and post-colonial eras were compatible with rent-seeking but incompatible with economic growth. In contrast, North America experienced a different colonization

(6)

process. The colonizers faced an alternative scenario in which there were no opportunities to exploit resources such as gold quickly. However, they also found a small indigenous population in the area that could not be forced to work. Colonizers were given incentives through institutions to own the land and work by themselves (Acemoglu and Robinson, 2012), so the evolution of inequality was different.

In contrast, some authors such as Williamson (2009) claimed that the levels of inequal-ity in Latin America rose during the period of development of the region in the nineteenth century. He suggests that inequality increased with the conquest, however, was similar to other regions of the world in similar stages of development during that time. This author also claimed that during the sixteenth century the inequality remained stable. However, the revolutions and the economic recession increased the levels of inequality at the end of the nineteenth century.

Modern inequality analysis (based on indices) is based on micro-data from national household surveys that are available in Latin American countries since the 1970s. Before this period information was scarce and with many methodological problems, this makes difficult the analysis of inequality for periods before the 1970s (Gasparini and Lustig, 2010).

In the 1970s inequality went down or remained constant in most Latin American countries, due to the economic growth that the region experienced, one example is the countries with oil exports that went through an increase in income and consumption (Ocampo et al., 2014). The next decade, the 1980s, is known as the lost decade of the Latin-American development, also in distributional terms. Income inequality rose between the 1980s and 1990s due to the political and economic situation, mainly caused by high inflation, external shocks such as the changes in commodities prices, higher costs of external financing, macroeconomic disequilibrium and adjustment programs that hurt more the individuals in the lower income groups. In most countries of Latin America income inequality begun to decline in the late and the early 2000s (Gasparini and Lustig, 2010), however, inequality has remained constant although there has been an economic expansion in the region. Inequality is persistent even when welfare has increased, and poverty has reduced in the region (Sarabia et al., 2014). Table 1.1 shows the persistence of inequality through the Gini coefficient in several countries of the region during the last 40 years. Although the countries of the region have experienced economic growth,

(7)

Table 1.1: Average Gini Coefficient per decade Latin American Countries Gini Coefficient Country 1980 1990 2000 2010 Argentina 42.96 47.9 48.62 41.88 Bolivia 52.76 55.88 46.56 Brazil 59.14 58.96 56.12 52.22 Chile 56.2 55.76 50.38 47.54 Colombia 55.7 55.44 52.7 Costa Rica 42.86 46.16 49.4 48.64 Dominican Republic 49.16 49.24 50.3 45.86 Ecuador 50.5 53.9 52.56 46.22 El Salvador 52.32 48.52 41.88 Guatemala 58.96 54.4 48.3 Haiti 41.1 Honduras 59.5 54.62 56.3 52.58 Mexico 48.9 49.2 48.32 44.98 Nicaragua 55.9 48.54 46.2 Panama 58.9 57.64 54.74 51.18 Paraguay 52.12 52.92 49.28 Peru 55.36 50.54 44.3 Puerto Rico Uruguay 43 45.76 41.02 Venezuela 51.44 47.1 49.72 Source:World Bank (2018)

(8)

Figure 1.2: Evolution Gini Coefficient Ecuador 1998-2018

Source: World Bank (2018)

inequality does not seem to vary considerably.

Like the rest of the region, Ecuador went through an economic instability during the decades of 1980 and 1990. During the 1990s, governments attempted to reduce inequality mainly by transfer policies, however, as Ponce and Vos (2012) argued, inequality seems to be more associated with the movement of macroeconomic conditions (such as exports or inflation) and the most vulnerable groups are the ones at the low income of the distribution.

Ecuadorian economy during the 1990s was characterized by liberalization of the fi-nancial and trade sector. The high inflation occurred on the 1980s was controlled with macroeconomic stabilization policies. During the first half of the decade, there was a slight increase in income household inequality from 0.45 to 0.47 between 1990 and 1996.

Macroeconomic policies also affected the Ecuadorian job market, in general, the ad-justment policies failed to produce strong employment growth. The new jobs created in the modern and formal sector of the economy benefited mainly skilled workers. However, the majority of the job supply was absorbed by the informal sector and the agricultural industry of the economy. This caused a widened of the wage gap between formal and informal sector workers as well as between skilled and unskilled workers.

(9)

including floods caused by the El Ni˜no phenomenon and falling oil prices, (the main export of Ecuador), lead the country to a banking and economic crisis in 1999 which finally drove to the decision of the government to dollarize the economy. This economic situation, as well as high inflation, decreasing real wages (that affected the lower income group more) and exchange rate depreciation, caused more unemployment and pushed the Gini coefficient to 0.59 in 2001. Despite this discouraging economic scenario, the country experienced an economic recovery in the following years. Regarding inequality, the economy is still characterized by high informality in the labor market. It is important to mention that cash transfer programs also helped reduce poverty and inequality. The Gini coefficient for the nation-wide inequality in per capita household incomes dropped from 0.60 to 0.51 in 2010 (Ponce and Vos, 2012) and further to 0.462 in 2017 (INEC, 2017b). This behavior is represented in the Figure 1.2. For the years between 2001 and 2003, there is no information on the Gini coefficient. However, it is clearly shown that after the highest point reached after the economic crisis, the Gini coefficient began to fall.

To reduce inequality, the first step that must be taken is to measure it accurately. The most popular measure of inequality is the Gini coefficient that measures the degree of inequality of a data sample and the Lorenz Curve that represents the distribution of income. Several functional forms have been considered to model the distribution of income. The most common approaches are Pareto and the Lognormal distribution (Sarabia, 2008). However, there have been contributions such as Singh and Maddala (1976) or McDonald (1984) who have tried to model the distribution of income taking into account the theoretical limitations that these distributions have, like heavy tails. A parametric approach to estimate the income distribution has the advantage that a better accuracy of the estimation can be obtained since the convergence will be faster. However, the researcher may overlook several characteristics of the income distribution such as tails and extreme values that may be relevant for the estimates (Lubrano, 2017).

Given this background, the primary purpose of the dissertation is to characterize the income distribution of Ecuadorian households. In particular, to determine which para-metric functional form best fits the distribution of the household, to achieve this, several parametric functional forms will be adjusted to the observed data. Besides, to analyze the goodness of the adjustment, a graphical and statistical analysis will be performed to determine which distribution best fits the observed data. Also, the Gini coefficient will be estimated for each of the proposed distributions. In the meantime, a non-parametric esti-mate of the income distribution will be made and compared with the parametric results.

(10)

The remainder of this thesis is organized as follows. Chapter 2 contains a literature review of the parametric and nonparametric techniques of estimation of the income dis-tribution. Chapter 3 makes a description of parametric and nonparametric models used to fit the observed income data. Chapter 4 has a description of the survey data used during the estimation and a depiction of the distributional disparities among Ecuado-rian population. Finally Chapter 5 and 6 presents the results and the conclusions of the research.

(11)

Literature Review

2.1

Inequality Measures

Due to the characteristics of the income distribution in Latin America, the degree of income disparity cannot be calculated adequately by using a measure such as the pop-ulation mean, such analysis would omit the form of the tails of the observed income distribution and would not take into account the values of unusually high or low income (Gasparini and Lustig, 2010). One of the widely used techniques is the Gini Coefficient introduced by Corrado Gini in 1912 which is a measure of economic inequality which it is directly related to the Lorenz Curve.

2.1.1

Lorenz Curve

The Lorenz Curve (LC) is a crucial tool for measuring the distribution of income. Find-ing an adequate functional form of the curve is of practical and theoretical importance (Sarabia et al., 2010). Several ways of specifying the Lorenz Curve have been developed and studied. Generally, it is constructed through a theoretical distribution function and it is assumed that income has a density function. The Lorenz Curve sketch the fraction of total income earned by a share of the population when the size of their incomes ranks it. Gastwirth (1971), defined the Lorenz Curve based on the inverse of the theoretical distribution function. The LC can be used to represent both continuous and discrete variables.

Following Gastwirth (1971) recommendations, x represents a random variable with a Cumulative Distribution Function (CDF)2 indicated as F (x), in this case, x represents

the income of a specific member of a population. F (x) is the proportion of the people that

2The Probability Density Function (pdf) of a continuous random variable x with a support S can be

defined as an integrable function f (x) which is positive for all x within the support. The area under the

(12)

receive an income less than or equal to x. F−1(t) represents the inverse of the distribution F (x) and is defined in Equation 2.5.

F−1(t) = inf {x : F (x) ≥ t} (2.5)

According to Gastwirth (1971), the definition of the inverse of the CDF in Equation 2.5 is enough to consider a discrete and a continuous distribution. A Lorenz Curve for a random variable x, with a CDF F (x) and a finite mean represented by µ = R0pF−1(t)dt can be defined as:

LF(p) = µ−1

Z p

0

F−1(t)dt, 0 ≤ p ≤ 1 (2.6)

LF(p) is the fraction of the total income of the population corresponding to the

percentile p of income. Another approach to defining the Lorenz Curve could be described using the two following equations. First determining the p percentile:

p = F (x) = Z x

0

f (t)dt (2.7)

Where x is an income level. Then, the Lorenz Curve can be expressed as LF(p) = 1 µ Z x 0 tf (t)dt (2.8)

density function in the support is equal to 1. Thus it that can be expressed as Z

S

f (x)dx = 1 (2.1)

If the pdf is the one in Equation 2.1 Then, the probability that the variable x belongs to an interval A is given by

P (x ∈ A) = Z

A

f (x)dx (2.2)

The Cumulative Distribution Function CDF is directly related to the pdf. It can be defined as F (x) =

Z x

−∞

f (x)dx f or − ∞ ≤ x ≤ ∞ (2.3)

Where F (x) is a non-decreasing continuous function for a continuous random variable (Penn State Uni-versity, 2018).

An Empirical Distribution Function EDF, is defined as

Fn(x) = 1 N N X i=1 1xi≤x (2.4)

Where, 1xi≤x is an indicator function that is equal to one if xi≤ x.

The value of the EDF at x is obtained by counting the number of observations in the sample N that are less than or equal to x, and then dividing this number by the total number of observations in the sample. (Taboga, 2010).

(13)

Figure 2.1: Lorenz Curve

Source: Author

2.1.2

Gini Coefficient

The Gini Coefficient could be addressed from two different approaches. First, it be can be calculated from the Lorenz Curve of the income distribution. An example of the Curve is depicted in Figure 2.1.

As mentioned before the Lorenz Curve plots the cumulative percentage of income on the vertical axis corresponding to the poorest p% of the population. The farther from the perfect equality the line the Lorenz Curve is, the more unequal the distribution will be (Gasparini and Lustig, 2010).

From this approach, the estimated value of the Gini coefficient is equal to twice the area between the Lorenz Curve and the 45-degree line that represents an egalitarian distribution of the income (Catalano et al., 2009).

So far, it was assumed that x is a random variable representing the income also the Lorenz Curve was defined in Equation 2.6. The Egalitarian Lorenz Curve function can be specified as

LE(p) = p, 0 ≤ p ≤ 1 (2.9)

Using the Egalitarian LC and the LC the Gini coefficient can be defined as GF = 2

Z 1 0

(14)

Equation 2.10 can be also expressed as G = 1 − 2

Z p

0

LF(p)dp (2.11)

The second approach to calculate the Gini coefficient is built on the variability of a statistical distribution or a probability distribution (Dorfman, 1979). Gini (1912) based the coefficient on the average of the absolute difference between a pair of observations. That definition leads to

G = 1 − 1 µ

Z x∗

0

(1 − F (x))2dx (2.12)

Where, F (x) is the CDF of the income , µ is the finite mean, and x∗ is the upper limit of the distribution.

Frequently, the Lorenz Curve is estimated by a parametric approach. To calculate the income distribution, numerous density functions have been considered, the Pareto and the Lognormal distribution are the two most commonly used approaches (Sarabia, 2008), however there are other types of income distributions that had been used and are summarized in Table 2.1, with their resultant Gini index.

The Weibull3 distribution is used as well for modeling and fitting different data sets

of the income distribution (Mirzaei et al., 2018). One common feature of all these distri-butions is that they are uni-modal (Lubrano, 2017).

The advantages of using a parametric approach to estimate the Lorenz Curve is that regularly its theoretical properties are satisfied and asymptotic converge rates are faster than root-N (Zhang et al., 2016).

Lubrano (2017), stated that a parametric approach could lead to a better precision of the estimates. However, the researcher might miss several features of the income distribution, because a functional form is imposed on the observed data. Estimating the

3A random variable X follow a standard Weibull distribution if its cumulative distribution function is

G (x) = 1 − e−(βx)

α

x > 0 (2.13)

With a quantile function:

QG(u) = β {−log(1 − u) }

1

α u ∈ (0, 1) (2.14)

where

(15)

Table 2.1: Different parametric Distribution functions

Distribution Lorenz Curve Gini Index

Exponential L (p) = p (1 + µ /σ )−1 (1 − p) log(1 − p) G = 2(µ+σ)σ Classical Pareto L (p) = 1−(1 − p)1−1/α G = 2α−11 Singh-Maddala L (p) = IZ(1+α1, q − 1/α) where z = 1 − (1 − p)1/q G = 1 − Γ(q) Γ(2q−1/α)Γ(q−1 α)Γ(2q) Dagum L (p) = IZ(q+α1, 1 − 1/α) where z = p1/q G = Γ(q) Γ(2q+1/α)Γ(2q) Γ(q+1/α)− 1 Lognormal L (p) = Φ(Φ−1(p) − σ) G = 2Φ√σ 2  − 1 Classical Gamma (p, L (p)) = (γ(α, x σ) Γ(α) , γ(α+1,xσ) Γ(α+1) ) G = Γ(α+1) √ πΓ(α+1)

σ and α are a scale and shape parameters, p corresponds to the percentile, and q = 1 − p Source : Sarabia (2008)

Lorenz Curve based on a misspecified distribution may lead to inconsistent results if the real income data has outliers or it is skewed (Luo, 2013). This problem might be the case in Latin America’s survey data due to the high degree of inequality and measurement errors.

Alternatively, nonparametric approaches to estimate the Lorenz Curve, give the re-searcher more flexibility, which means that the approach takes into account negative and zero incomes that could appear in the data, and also might capture essential character-istics of it that are omitted when using a restricted function. Slottje (1990), discussed the advantages and disadvantages of parametric and non-parametric methods in con-structing measurements of income inequality using grouped data. He found as well that using parametric specifications can lead to higher information content, but it might be costly to use a possible misspecified functional. He suggested that the indices should be estimated using parametric methods and then checked using non-parametric techniques. In the same line Luo (2013), pointed out that one of the disadvantages of nonparametric approaches might be a slower asymptotic convergence rate.

2.2

Estimation of the Income distribution

2.2.1

Parametric Approaches of estimation of income

distribu-tion

The theoretical approaches of the income distribution have changed since Pareto (1897) seminal contribution in which he proposed a model for fitting the distribution of income

(16)

based on a threshold of it which is only surpassed only by a certain number of individuals. Estimating an appropriate measure of inequality constitutes a critical statistical and theoretical problem; the objective must be to find an appropriate statistical distribution for the observed data (Basmann et al., 1990). It is possible that the functional forms used to approximate the observed income distribution face problems, for example, some functional forms might fit well a lower segment of the observed income distribution but might not fit well at its tails. It is also stated that some functional forms have no flexibility for multidimensional analysis of the income distribution.

Basmann et al. (1990) claimed that one could avoid this problem by using a non-parametric approach. However, this method would be enough from a statistical point of view, but insufficient from an economic policy explanation. Throughout the years there have been some important works that influenced much of the studies of income distribution, Atkinson (1970), Kakwani and Podder (1973), Singh and Maddala (1976) , and Gastwirth (1972).

As it was mentioned before, two of the most often used functions to estimate the income distribution are the Pareto and the Lognormal. These functions have drawbacks, for example, the Pareto form does not fit the data well for the lower income levels, and the Lognormal is not good at fitting the data towards the upper end of the distribution. Singh and Maddala (1976) derived a function that depicts the distribution of incomes that is a generalization of the Pareto distribution and the Weibull distribution based on an analysis of hazard rates and failure rates.

Also, Gastwirth (1972), proposed an approach to obtain upper and lower bounds of the Gini index from data which are grouped in intervals and the mean income in each interval is known. With this method, the Gini index can be accurately estimated without fitting any curves to the data whenever it is appropriately grouped. This method proved to be accurate. Nevertheless, some problems could appear if the sample size is large.

Schader and Schmid (1994), suggested that with some types of income data that is not at an individual level, such as grouped form (group means and group frequencies) is not possible to calculate an exact value of the Gini coefficient of inequality. They made a comparison between several parametric specifications of the Lorenz Curve and the approach suggested by Gastwirth (1972). This approach (Gastwirth’s) is considered entirely non-parametric since no assumption of the shape of the income distribution had been made. They concluded that most of the results based on parametric approaches are

(17)

erratic. However, they found the nonparametric approach reliable and straightforward. Basmann et al. (1990), suggested a general functional form for the Lorenz Curve based on Kakwani and Podder (1976), for grouped observations based on a least-squares method to provide consistent estimators of the parameter of the Lorenz Curve. They found that this technique could conduct good results for an empirical data. Other studies that have relied on Kakwani and Podder (1973) work as well, are: Rasche (1980), who proposed a general form for the Lorenz Curve based on the Pareto distribution, and Gupta (1984), who criticized Rasche’s work, and claimed that the nonlinearity in the parameters, makes estimation of the parameters by the linear least squares technique not possible. Instead, he used a linear least squares estimation but in this case employing a log-linear form for the Lorenz Curve.

Other examples of income distribution based on parametric functional forms are the works of Rohde (2009), who derived an implicit distribution function and a cumulative distribution function for a single parameter Lorenz Curve, and Sarabia et al. (2010), claimed that this method is a re-parameterization of the model proposed by Aggarwal (1984). Furthermore, Wang and Smyth (2013), suggested a bi-parametric form for the Lorenz Curve to obtain different curvatures of the Lorenz Curve using convex combination models.

Moreover, Sarabia et al. (2014), estimated the income distribution in Latin Ameri-can using limited information, they applied the approaches of Singh and Maddala (1976) and Dagum (1977) to model national and regional income distributions. Also, Ortega et al. (1991), accurately estimated Gini Inequality Indexes using Pareto Lorenz Curve for grouped data for Spain and several countries. Helene (2010), proposed a new func-tion, linear in the parameters, to estimate the Lorenz Curves. She estimated the data uncertainties from year variation of income for the Brazilian distribution. McDonald (1984), presented two generalized beta distributions with four parameters namely GB1 and GB2 distributions and showed a fitness with the data U.S. family income nominal for the period between 1970 and 1980.

A different approach was considered by Villasenor and Arnold (1989), who studied an explicit expression for the Lorenz Curve and the density of the income distribution. They considered a subfamily of the general Lorenz Curve, called the quadratic, in particular, the elliptical4ones. This approach worked well with unimodal distributions, however, did

4The general quadratic form

(18)

not fit well a bimodal and multi-modal distribution. Ogwang and Rao (2000), proposed “hybrid” Lorenz Curves to avoid the possible drawbacks of traditional models of Lorenz Curve, such as fitness over the income distribution function, and they showed a better performance from this specific type of curves with respect of the data. They suggested two categories of hybrid models, the additive, and multiplicative ones. The first is obtained by a convex linear combination of the traditional models. The latter one is attained by taking weighted products.

2.2.2

Non-parametric Approaches of estimation of income

dis-tribution

The non-parametric approach to estimate an income distribution involves kernel density estimators, which do not impose strong assumptions or constraints on the distribution function. However, one disadvantage of this approach might be that the accuracy depends mainly on the bandwidth selection (Boccanfuso et al., 2013).

Most of the parametric models are based on the Maximum likelihood method to esti-mate the corresponding parameters. To estiesti-mate a parametric distribution, this method has to take into account a numerical search of algorithms instead of analytically finding the maximum value. Moreover, the researcher has to consider the choice of values at which these algorithms start, since if start far from the global maximum it may converge to a local one or not converge at all. Other possible problems that the researcher might have when estimating by maximum likelihood methods are; a gradient close to zero, sim-ilar successive parameter values and, finally, there could be a problem when the number of iterations made in the estimation is very high.

Other less restrictive approach to estimate the income distribution is the Kernel cu-mulative distribution function (CDF), which is the estimator of a distribution F at a point x c Fh(x) = 1 N N X t=1 K(x − xt h ) (2.18)

Satisfies the properties of Lorenz curve The quadratic Lorenz curves, with c=1, will satisfy the general quadratic form, have f=0 and e = −(a + b + d + 1) .

Solving for y the curve is derived:

y = L2(x) = {−(bx + e) − (ax2+ βx + e2) 1/2

}/2 (2.17)

(19)

Where K is a cumulative kernel, with h bandwidth and CDF symmetric around zero.

The bandwidth selection is a relevant feature of the kernel estimator. A large value of h cause an under smooth estimator with high variance and a small bias,while small values of h yield over smooth estimators with small variance but with a large bias (Boccanfuso et al., 2013).

It is mentioned by Boccanfuso et al. (2013) that data-driven selection tries to have a balance on the selection of the bandwidth by minimizing the MSE of the estimate. One approach to select an optimal bandwidth h is mentioned by Bowman et al. (1998), which is the least squares cross-validation that uses information of the sample and minimizes the integrated squared error of the empirical distribution function (EDF).

CVF(h) = 1 N N X t=1 Z [1( xt ≤ x )−dF−t] 2 dx (2.19)

Where dF−t(x) is defined as,

d F−t(x) = 1 N − 1 N X j6=t K(x − xj h ) (2.20)

Bowman et al. (1998), demonstrated that h converge in probability to the asymptotic optimal value. Boccanfuso et al. (2013) mentioned that the cross-validation method is more straightforward than the parametric maximum likelihood method to obtain the income distribution. It is stated as well by the author that the correct choice of the bandwidth is of greater importance than the selection of the kernel function used. In their study, Boccanfuso et al. (2013), considered an income distribution estimation for poverty analysis in computable general equilibrium models. They found that Kernel estimators have a similar mean squared error (MSE).

The Gini index can be calculated with no constraints on data as Abdelkrim and Duclos (2007) proposed, they defined Gini as:

ˆ I = 1 − ˆ ξ ˆ µ, ˆξ = N X t=1 [V 2 i − Vi+12 V2 1 ]yi (2.21)

(20)

weight for the observation i. It can be stated that Vi =PNh=iwh.

Xi Chen (2000), proposed a gamma kernel estimator for density functions on the support [0, ∞) using a gamma probability density as kernels, to replace the symmetric density of the standard Kernel estimator. This new density satisfies some properties such as non-negative, free of boundary bias and reaches an optimal rate of convergence for the mean integrated squared error (MISE). Scaillet (2001), presented alternative kernel estimators based on Inverse Gaussian (IG) and Reciprocal Inverse Gaussian (RIG) prob-ability density functions. These estimators hold the same properties as the gamma kernel proposed by Xi Chen (2000). These kernel shapes are allowed to change with the location of the data point, Bouezmarni and Scaillet (2005), claimed that their variance decreases as the location where the smoothing is made, move away from the boundary. This is an advantage because usually, empirical distribution functions have long tails and sparse data. Bouezmarni and Scaillet (2005) made another contribution, considered asymmetric kernel density estimators and proved that these estimators converge in probability to infinity at x = 0 when the density is unbounded.

Cowell (2000), proposed a semi-parametric approach to estimate the Lorenz Curve. The authors claimed that Lorenz Curves are sensitive to data contamination in the tails of the distribution. To overcome this problem, they combine parametric models with empirical estimation of the upper tail of the distribution of the Pareto model. With this approach, they gain flexibility, and they proved that this robust curve could restore the initial ordering of data even with the presence of extreme values.

Other approaches have been proposed to estimate inequality in a non-parametric way as well. Zhang et al. (2016), proposed a flexible nonparametric estimator of the Lorenz Curve that keeps its theoretical properties such as convexity and monotonicity. They used the linear spline method to give a parametric representation of the Lorenz Curve. Finally, they applied the described technique to calculate the Gini coefficient based on the non-parametric Lorenz Curve for a U.S household survey. Luo (2013) proposed kernel estimators for the low-income fraction of the population, and a generalized Lorenz Curve, and used a cross-validation technique to determine the bandwidth for the kernel estimators. Luo (2013) found that these estimators are asymptotically normal and have better finite sample properties than parametric estimators.

There also alternative measures of income inequality that have been proposed. For instance, Zenga (2007) proposed to measure inequality based on an inequality curve that compares the mean income of the poorest with the mean of the wealthiest individuals in the population. Ostasiewicz and Mazurek (2013), compared the Gini coefficient and the

(21)

Zenga index and found that the latter increases more rapidly for low values of variation. They also found that the Zenga index is sometimes higher or equal to the Gini coefficient. Greselin and Pasquazzi (2013), compared non-parametric versus parametric estimation of the Gini coefficient and the Zenga index for a cross-regional study on Switzerland. They considered a Dagum5 model to compare the results of both measurements and found that

the accuracy of confidence intervals enhances when a parametric model fits the data.

5Dagum developed alternative distribution from the Pareto distribution and the Lognormal

distri-bution to summarize income and wealth data. Dagums’s distridistri-bution is considered to study income distribution with a presence of heavy tails (Kleiber, 2007)

(22)

Chapter 3

Model

The objective of the present dissertation is to estimate the income distribution of Ecuado-rian household using parametric and non-parametric techniques. To obtain a parametric estimation of the income distribution, a Maximum likelihood method will be used to estimate the parameters of different functional forms that will fit the observed data. For the non-parametric estimation of the income, a kernel density will be estimated using two approaches for the bandwidth selection: Silverman (1986) and the Maximum Likelihood Cross-Validation (MLCV). After, these estimations the results of the Gini index for each functional form will be presented and compared.

3.1

Parametric estimation

As mentioned before, it can be advantageous to adapt a functional form to the raw data to estimate the income distribution mainly because of the asymptotic converge rates which are faster than root-N. Then, it is possible to calculate the Lorenz Curve and the inequality indexes (such as the Gini index) by indirect means. To perform this, the parameters of a specific functional form should be estimated to fit the observed data. For ungrouped data, the parameters of a functional form can be derived using maximum likelihood techniques (Cowell, 2000).

In this dissertation, Income data from Ecuadorian household will be fitted to different parametric distributions using maximum likelihood techniques to estimate their respec-tive parameters. As Jenkins and Van Kerm (2015) mentioned, the likelihood function for a empirical sample is defined as the product of the densities for each observation and is maximized using a linear form.

Therefore, the following likelihood function is maximized and fitted with the observed data. In this specific case, x is the variable of interest that corresponds to the income of

(23)

Table 3.1: Probability Distribution function Distribution pdf Pareto f (x, a) = a∗xa0 xa+1 Lognormal f (x, σ, µ) = x√1 2πσexp (ln x−µ)2 2σ2 Dagum f (x, a, p, b) = x(1+(b/x)ap(b/x)aa)p+1 Singh-Manddala f (x, a, b, q) = ba(1+(aqa−1x b) a)1−q GB2 f (x, a, p, b, q) = bapβ(p,q)(1+(x/b)axap−1 a)(p+q)

households, and θ is the vector that contains the parameters of each one of the functional forms. LN(θ) = f (x, θ) = N Y i=1 f (xi, θ) (3.1)

f (x, θ) is the underlying probability density function (PDF) of each of the parametric distributions specified on the Table 3.1. The parameters of each one of the distributions are calculated in order to adjust the real data.

To evaluate the adjustment of the parametric distributions. It is analyzed the values of the sum of squared errors (SSE), the sum of absolute errors (SAE) a to asses the goodness of fit of each of the parametric distributions against the observed data. This approach is used by Boccanfuso et al. (2013) and Bandourian et al. (2002). These measurements are defined as: SSE = N X i=1 (ni N − pi(ˆθ)) 2 (3.2) Where, pi is the estimated value for f (x) to the left of the point xi. ni is the cumulated

sum of the analyzed function until the the ith observation. niN is the fraction of the data to the left of xi. (This is defined as the EDF evaluated at the point xi). Finally the

vector ˆθ is of the estimated parameters. ni is

ni(ˆθ) = N

X

t=1

I(xt < xi) (3.3)

the sum of absolute errors (SAE) is :

SAE = N X t=1 | ni N − pi(ˆθ) | (3.4)

(24)

3.2

Non-Parametric estimation

For the non-parametric method the kernel technique is chosen to to estimate the distri-bution of income. The kernel estimation of the income distridistri-bution is estimated using techniques described in Section 2.2.2. Usually, the estimation of the Kernel density is influenced by the selection of the bandwidth (Boccanfuso et al., 2013). Two different approaches are analyzed to estimate the bandwidth.

The first one is based on Silverman (1986), where the bandwidth h is defined as:

h = 0.9m

N1/5 (3.5)

where m is:

m = min(√σx, iqrx

1.349) (3.6)

σ is the standard error and iqr is the interquartile range of the distribution of income. Another approach that is an extension of Cross-Validation described in Equation 2.19 is the Maximum Likelihood Cross Validation (MLCV). Habbema and Van den Broek (1974) and Duin (1976) proposed this method of estimation an optimal bandwidth where the likelihood of the leave one out value of the density defined as ˆfh,i(x) ,is maximized to

find the optimal bandwidth (Guidoum, 2015). The leave one out value is defined:

ˆ fh,i(Xi) = 1 (N − 1)h X j6=i K(Xj − Xi h ) (3.7)

And the MLCV of h is defined as:

M LCV (h) = (N−1 N X i=1 log[X j6=i K(Xj− Xi h ) − log(N − 1)h]) (3.8)

The EDF corresponding to the Household Income is calculated through the Equation 2.4. To obtain the adjustment measures described in Equations 3.2 and 3.4, the difference between the EDF and each of the parametric CDFs (that were obtained through ML) was computed. The CDFs of these functions were calculated based on the pdfs of Table

(25)

3.1. For a complete derivation of the Distributions refer to Jenkins and Van Kerm (2015).

Likewise, the EDF was used in the same way to obtain the adjustment of the Kernel functions calculated through the methods described in the Equations 2.18, 3.5 and 3.8.

(26)

Chapter 4

Data

4.1

Data characteristics

The present dissertation use data from the National Survey of Employment, Unemploy-ment, and Sub-employment (Encuesta Nacional de Empleo, Desempleo y Subempleo) – ENEMDU, published by the National Institute of Statistics and Census (INEC, 2017a), which is the institution in charge of the production of the statistic in Ecuador. The objective of this survey (ENEMDU) is to gather information for the Ecuadorian Labor Market with monthly periodicity in the main cities and quarterly in the entire country (INEC, 2017b). The survey contains information about economic activity, the sources of the income of the population, and overall information about the characteristics and situation of the active and inactive working residents of Ecuador.

The relevant variable for the study will be the income per capita of a household, which is the total amount of revenue of every member of a household divided by the number of its members. For the model estimations and the descriptive statistics, the ENEMDU survey of December 2017 will be taken as it is the most complete (concerning the surveyed population) and also the most recent one.

To avoid possible measurement problems some data processing was made. This pro-cess focused on finding incongruent information about income, particularly in the way an individual reports the sources of it. Other possible measurement errors might be negative income values or cases when an individual reported a value for the consumption which is higher than the incomes obtained including possible loans. These type of errors were not considered for the estimation as they were few (992 observations were eliminated of the total sample size of 110,283) compared with the full sample that was analyzed.

Since an individual can report incomes from different sources different than only

(27)

Table 4.1: Mean per capita household income distribution by deciles in US Dollars Area

Deciles of Income

distribution Urban Rural Total

1 32.99 30.15 29.42 2 60.56 60.44 60.49 3 85.92 85.66 85.78 4 110.23 110.45 110.33 5 138.36 138.06 138.24 6 172.18 171.52 171.93 7 215.83 215.49 215.72 8 279.41 277.86 279.01 9 392.18 385.01 390.72 10 862.69 849.65 860.78 Total 288.32 152.46 234.03 Source: ENEMDU (2017)

income from working activities, all types of sources of the earnings of the person such as income from capital, transfers, donations, and properties have to be taken into account as well. Finally, the household income is computed, and the per capita measurement is computed for each of the households as well.

4.2

Distribution of Income of Ecuadorian households

As it is depicted in the Table 4.1 the mean per capita household income by deciles gives a first glance of the income distribution of Ecuador. There is no a big difference between the deciles of urban and rural area, however, the gap between these two areas is depicted in the total mean income. This analysis is based on the mean income of the household. However, it gives a general idea of the inequality in the country especially considering the income gap between the richest and poorest deciles.

Table 4.2. depicts the mean household income of the five biggest cities of Ecuador, and it showed the mean income is close to the minimum wage ( USD 375) except for Guayaquil and Machala. This result could be related with the fact that the most important economic activity (also formal activity) in Ecuador is located in these cities, so the households are

(28)

Table 4.2: Mean per capita household income five biggest cities

City Mean Income

Cuenca 368.57 Machala 257.15 Guayaquil 264.25 Quito 390.20 Ambato 373.64 Total 328.04 Source: ENEMDU (2017)

more likely to have a higher income compared with the rest of the country.

Figure 4.1 depicts the distribution of the total household income of the 3 wealthiest deciles against the seven poorest one of the income distribution. The 30% of the popu-lation concentrated the 65.25% of the total income while the 70% poorest 34.75% of the total income.

Figure 4.2 presents the Ecuadorian income distribution. The Figure shows that the distribution is concentrated in the low values of the income, with only a few people with incomes higher than USD 6000. The shape of the distribution might suggest that the household income follows a Lognormal or Beta distribution. It would be expected that the parametric distribution does not fit well the data for the lowest level of income especially for the case of the Pareto Distribution.

Given this general view of the income distribution, Parametric and Non-parametric techniques are used to estimate the correspondent densities. These results are shown in detail in the next chapter.

(29)

Figure 4.1: Share of Income of the sample

Source: ENEMDU (2017)

Figure 4.2: Income distribution of household

(30)

Chapter 5

Results

5.1

Non-parametric results

First, the results of the non-parametric income distribution will be presented. An estimate of the kernel density for the income of households is made. As was mentioned in Chapter 3, two approaches are implemented to evaluate the optimal bandwidth. The first one is the Silverman (1986) proposal for the optimal bandwidth detailed in section 2.2. Given the survey data of Ecuadorian income the optimal bandwidth is h∗=12.52. Meanwhile, with the Maximum Likelihood Cross-Validation (MLCV) technique the optimal bandwidth is h∗= 4.94.

In Figure 5.1, the kernel density estimations using both bandwidths are represented. No substantial difference is shown, however, with MLCV bandwidth the kernel density is less smooth.

In Figure 5.2, the kernel density (with MLCV bandwidth) is compared with the Observed Income Distribution of Ecuadorian household income. A good approximation is sketched in the right tail; nevertheless, in the left tail, the figure shows that for values around USD 100 of the Income per capita of household the kernel density does not adapt

Table 5.1: Goodness of fit Measures Non-Parametric technique

SSE SAE

Silverman 7.12 844.64

MLCV 1.19 324.69

Source: ENEMDU (2017)

(31)

Figure 5.1: Kernel estimation of income distribution

Source: ENEMDU (2017)

Figure 5.2: Kernel estimation of income distribution and Observed Data

(32)

Figure 5.3: Lognormal and Singh-Maddala Distribution

Source: ENEMDU (2017)

well. A taller tail for the distribution would be needed to fit better the data. Also, in Table 5.1 the Goodness of fit measures are presented. The MLCV technique to obtain the bandwidth of the kernel density shows a better fit to the observed data and is used since here on in the analysis.

5.2

Parametric results

This section will present the results of the Maximum Likelihood estimation for the para-metric distributions defined in Table 3.1. These functional forms are investigated to fit the observed data of income distribution.

In Figure 5.4 the Kernel density is plotted against the fitted Pareto Distribution, and the Observed Data of the Income of Ecuadorian households are represented by the red dots. The Pareto is one of the most commonly used to fit income distribution (Sarabia, 2008). Contrary to this, the Pareto Distribution does not fit the data well, in particular, the observations located in the left tail, which are the same as the low-income levels, as illustrated by Singh and Maddala (1976). Consequently, an estimation of the Gini coefficient assuming the mentioned distribution will be inaccurate.

Figure 5.3 depicts the fitted parametric densities of Lognormal and Singh-Maddala distribution, the latter is a generalization of the Pareto distribution, but it takes into

(33)

Figure 5.4: Kernel and Pareto Distribution

Source: ENEMDU (2017)

account the fitness of the left tail. For both distributions, good fitness is observed, comparable to the kernel density of Figure 5.2.

Furthermore, the fitness of other parametric densities is studied and showed in Figure 5.5. In particular the case of the Dagum distribution which considers the presence of heavy tails (Kleiber, 2007) and the Generalized Beta 2 (GB2) distribution which is a Beta distribution with four parameters. Both estimated densities present a good fitness of the data such as the Lognormal and the Singh Maddala distribution.

In general, the graphical analysis shows that the proposed parametric distributions have a good fit. To a certain extent, it can be considered that the values of the left tail of the distribution have similar behavior in all the distributions, with the apparent exception of the Pareto distribution.

Figure 5.6 depicts the Dagum, GB2, Lognormal and Singh Maddala densities against the observed data, while the Pareto distribution is excluded from the analysis. In the left tail that corresponds to the low-income values, it is shown again that all the distribution studied except Pareto have similar behavior. However, it is necessary to perform more detailed goodness of fit analysis, which will be presented below.

As it was mentioned before, two measures of goodness of fit are going to be analyzed: Sum of Squared Errors (SSE), Sum of Absolute Errors (SAE) following D’Agostino (1986)

(34)

Figure 5.5: Dagum and Generalized Beta 2 density

Source: ENEMDU (2017)

Figure 5.6: Parametric densities with better fit

(35)

Table 5.2: Goodness of fit Measures

Distribution Loglikelihood SSE SAE SE

Pareto -787053.05 5430.06 20793.26 2.94 Lognormal -691101.16 3.78 852.54 0.81 Singh-Maddala -690969.02 9.35 547.66 0.95 Dagum -691172.35 4.43 615.43 1.46 GB2 -690632.33 2.58 451.10 0.72 Source: ENEMDU (2017)

and McDonald (1984) recommendations.

The results shown in Table 5.2 confirm the visual analysis mentioned before.The com-putation of the Goodness of fit Measures was made using the EDF depicted in Equation 2.4 and the CDF of each one of the parametric distributions specified in Table 3.1, fol-lowing Equations 3.2 and 3.4. The Pareto is the parametric distribution that performs the worse among the ones considered in the analysis. Among the distributions depicted in the Figure 5.6, the GB2 distribution is the one that fits better the data. The Singh and Maddala distribution present the second highest SSE and SAE. The Table shows that the errors (SSE) of the Lognormal and the Dagum distribution are close to each other. The standard errors computed for each one of the distribution and corresponding estimated parameters depict a similar result; The Generalized Beta is the parametric distribution with the best fit followed by the Lognormal. When comparing these results with the non-parametric ones, it is observed that the Silverman approach has a better fit than the Pareto and Singh-Maddala parametric distributions, but the other parametric distributions surpass its adjustment. The opposite happens with the MLCV which shows the best fit between all the compared densities.

The Gini coefficients of the non-parametric and parametric approaches are depicted in Table 5.3. Non-parametric is calculated using the method described in Equation 2.21, parametric Gini is calculated as is described in Table 2.1 for each of the distributions.

Another visual approach of the goodness of fit of the distribution is proposed. In Figure 5.7 a distributional diagnostic plot is presented where the quantiles of the actual distribution of income of household are sketched against the quantiles of the parametric distributions estimated employing maximum likelihood. As the plots show, the Gener-alized Beta distribution is the one with the best fit, which similar to what was found in

(36)

Table 5.3: Gini Coefficients

Distribution Gini Coefficient

Non Parametric (MLCV) 0.483 Lognormal 0.492 Singh-Maddala 0.497 GB2 0.489 Dagum 0.513 Source: ENEMDU (2017)

Table 5.4: Goodness of fit Measures

2015 2016

Distribution SSE SAE SSE SAE

Pareto 5424.77 20943.61 5452.29 21195.01 Singh-Maddala 10.32 607.95 12.99 486.07 Lognormal 4.93 868.96 2.88 1003.88 Dagum 4.60 634.91 3.00 519.47 GB2 3.85 548.73 2.72 466.88 Source: ENEMDU (2017)

(37)

Figure 5.7: Q Q plots of parametric distributions

Source: ENEMDU (2017)

Table 5.2

Finally, the same procedure is made for the years 2015 and 2016 to check if there have been changes in the income distribution of households. As it is depicted in Figure 5.8 and 5.9, the income distribution seems to follow a pattern similar to that of the year 2017. The results of the Table 5.4 confirm that the GB2 distribution is the one with the best fit. Furthermore, the Gini coefficient showed in Table 5.5 for these years are similar in magnitude. It is important to mention that the techniques used so far offer an alternative approach6 for the calculation of the Gini coefficient compared to the one used by the National Institute of Statistics and Census of Ecuador. The results for the Gini coefficient are similar, however, for both techniques parametric and nonparametric the Gini was higher (0.49 and 0.48 respectively) for the year 2017 compared to the value obtained by the Institute (0.46) for the same year.

6The Gini coefficient is calculated using the formula:

G = 1 + 1 N − 2 ¯ yN2 N X i=1 yi(N + 1 − i) (5.1)

(38)

Figure 5.8: Observed Income distribution for the years 2015 and 2016

Source: ENEMDU (2017)

Figure 5.9: Parametric distributions with better fit for years 2015 and 2016

(39)

Table 5.5: Gini Coefficients Gini Coefficient Distribution 2015 2016 LogNormal 0.491 0.495 Singh-Maddala 0.498 0.500 Dagum 0.514 0.515 GB2 0.510 0.512 Source: ENEMDU (2017)

(40)

Chapter 6

Conclusion

Inequality is an important problem in countries of Latin America. Gaps in income have a negative effect on the allocation of resources in both human and physical capital, which create fewer opportunities for people in the lowest levels of the income distribution. Inequality is rooted in the economic and social institution of the region, so, reduction of these disparities must be a priority to improve the performance of an economy.

Finding appropriate measures of inequality should be the first step to reduce it. Char-acterizing the income distribution of Ecuadorian household provides an approximation to functional form that has statistical advantages such as the one mentioned in previous chapters. Survey data might have problems such as outliers and heavy tails, thus, fitting a parametric distribution that takes into account this feature is relevant.

Throughout the years, several functional forms have been used to model the dis-tribution of income. From the seminal condis-tribution of Pareto (1897) other parametric functional forms have been considered to overcome theoretical problems related to the adjustment of the tails of the distribution. Singh and Maddala (1976), McDonald (1984), Dagum (1977) proposed different functional forms with the purpose of obtaining a better adjustment of the income distribution. According to Sarabia (2008) the most common approaches are the Pareto and the Lognormal distribution. However, this dissertation has proved that other distributions have a better performance regarding adjustment.

The literature related to the functional form of income distribution is scarce for Latin America, especially for Ecuador. Sarabia et al. (2014), estimated the distribution of in-come in Latin America using the approaches of Singh and Maddala (1976) and Dagum (1977), however, they did not present any alternative distribution or a comparison be-tween different functional forms. For Ecuador, the distribution of income does not follow a Pareto distribution; this should be considered in particular at the time of the estimation

(41)

of the Gini coefficient.

The income distribution of Ecuadorian households has a high concentration in the low-income group (left tail of the distribution) which is a characteristic that Pareto Dis-tribution did not capture. It was shown that other parametric disDis-tributions fitted well with the data and outweighed these problems, this is the case of the distribution Gener-alized Beta, Lognormal, Singh and Maddala. The first one is the one that best adapted to the data.

Furthermore, non-parametric analysis of the income distribution is presented. The kernel density technique is used to estimate the income distribution. Since the kernel evaluated is particularly sensitive to the choice of the bandwidth two approaches are given. The first one based on Silverman (1986), and the second one through optimization of Likelihood Cross Validation (MLCV). The MLCV density is less smooth than Silverman’s. An analysis of the goodness of fit between both Kernel approaches was made. MLCV proved to have a better fit to the observed data. The Silverman (1986) approach showed a worse fit than the parametric GB2, Dagum, and Singh-Maddala, however, with the bandwidth attained by Likelihood Cross Validation the best goodness of fit is reached. For the calculation of non-parametric Gini inequality index, the MLCV bandwidth was implemented.

This thesis offered a general overview and analysis of the income distribution and inequality measures for Ecuador. However, further research can be done on this topic. A more complex non-parametric estimation of the Gini coefficient can be estimated and compared it with the parametric results. Mirzaei et al. (2017), proposed an estimator of the Gini coefficient based on U statistics with a good asymptotic performance. This technique may be used with Ecuadorian income data to have a robust value of the Gini index. Additional research can be proposed comparing the income distributions of differ-ent Latin American countries to determine if there is a parametric distribution common in the region.

Finally, it can be investigated how the Ecuadorian income distribution evolves through a period of time. Also, it could be useful to analyze the public policies for poverty reduction such as cash transfers, to determine how the income distribution and inequality change throughout years.

(42)

Appendix A

Program

/* Thesis MSc Econometrics University of Amsterdam

Title: Parametric and Nonparametric estimation of inequality measurements, the case of Ecuador

Ramiro Mejia #11805412 */

*---Income computation (INEC) Ecuador

*---egen id_hogar = group(area ciudad conglomerado zona sector panelm vivienda hogar) tempvar x y

* Detects incongruent income

gen ‘x’ =0

replace ‘x’ = 1 if (p66 ==999999 | p63 ==999999 ) | \\\

(p70b==999999 & p69==999999 & p71b==. & p72b==. & p73b==. & p74b==. & p76==. )

*If the individual does not inform in his labor activity his revenue as an \\

(43)

independent or dependent worker. (categories are mutually excluded)

replace ‘x’ = 2 if (p69==999999 & p70b==999999) \\ & (p71b!=. | p72b!=. | \\\

p73b!=. | p74b!=. | p76!=. )

* If the individual does not inform about both of his sources of income

recode p63 p64b p65 p66 p67 p68b p69 p70b p71b p72b p73b p74b p76 (999999=.)

local miss = "999 9999 99999" //

foreach k of varlist p63 p64b p65 ///

p66 p67 p68b p69 p70b p71b p74b p72b p73b p76 { foreach c of local miss {

replace ‘x’=1 if ‘k’==‘c’ }

(44)

*==============================================================================*

* Labor Income

*==============================================================================* * Principal Activity

replace p65 = -p65

egen ind = rowtotal(p63 p64b p65) , missing egen asal = rowtotal(p66 p67 p68b) , missing egen ila1 = rowtotal(ind asal) , missing * Second Activity

egen ila2 = rowtotal(p69 p70b), missing * Labor Income

egen ila = rowtotal(ila1 ila2), missing replace ila = ila2 if ila1<0

* "ineg" variable that indicates when an individual spends more than he earns. gen ineg =1 if ila1 <0 & ila==.

replace ineg =. if ‘x’ == 1

*==============================================================================* * Non laboral Income *

*==============================================================================* egen icap = rowtotal(p71b), missing

* Ingresos de capital

egen ipens = rowtotal(p72b), missing egen ilocal = rowtotal(p73b), missing egen iextr = rowtotal(p74b), missing egen isocial = rowtotal(p76), missing

egen itrans = rowtotal(ipens ilocal iextr isocial), missing * Ingresos por transferencias

egen inla = rowtotal(icap itrans), missing

******************************************************************************** loc var "ind asal ila icap ipens ilocal iextr isocial itrans inla"

foreach var of varlist ‘var’ { replace ‘var’ = . if ‘x’ ==1 }

(45)

*==============================================================================*

* Individual Income *

*==============================================================================* egen ii = rowtotal (ila inla), missing

replace ii =. if ‘x’ == 1

*if is inconsistent make missing replace ii =inla if ‘x’ == 2 recode ii (0 =.) *==============================================================================* * Family Income * *==============================================================================* sort id_hogar

loc var "ila icap ipens ilocal iextr isocial itrans inla" foreach var of varlist ‘var’ {

egen ‘var’f =sum(‘var’), by (id_hogar) replace ‘var’f = . if ‘var’f ==0 }

* Family Income

egen ih =sum(ii), by (id_hogar) replace ih =. if ih ==0

*==============================================================================*

* Income per capita family *

*==============================================================================* gen ‘y’ = 1

egen hsize = sum(‘y’), by (id_hogar) * Income per capita

gen ipcf= ih/hsize

******************************************************************************** * Ln of Income

gen lipcf = ln(ipcf)

(46)

* Labels of variable according survey manual

label var ind "Ingreso laboral principal independiente" label var asal "Ingreso laboral principal asalariado" label var ila1 "Ingreso laboral principal"

label var ila2 "Ingreso laboral secundario" label var icap "Ingreso individual -capital"

label var ipens "Ingreso individual -pensiones-jubilaciones" label var ilocal "Ingreso individual -transferencias locales" label var iextr "Ingreso individual -transferencias externas" label var isocial "Ingreso individual -beneficios sociales" label var itrans "Ingreso individual -transferencias"

label var inla "Ingreso individual -no laboral" label var ila "Ingreso individual -laboral" label var ii "Ingreso total individual" label var ilaf "Ingreso familiar laboral" label var icapf "Ingreso familiar -capital"

label var ipensf "Ingreso familiar -pensiones-jubilaciones" label var ilocalf "Ingreso familiar -transferencias locales" label var iextrf "Ingreso familiar -transferencias externas" label var isocialf "Ingreso familiar -beneficios sociales" label var itransf "Ingreso familiar -transferencias"

label var inlaf "Ingreso familiar no laboral" label var ih "Ingreso total familiar" label var ipcf "Ingreso per cta familiar" label var lipcf "Logaritmo del ipcf"

label var hsize "Tamael hogar" label var ineg "Gasta mde lo gana" saveold ENEMDU151, replace

(47)

*---Estimation

*---*/

use ENEMDU151, clear cap drop _*

set seed 12345

egen upm=group(area ciudad conglomerado zona sector) svyset upm [pw=fexp]

keep if ipcf <2000 save temp_2, replace ** NEW EDF !!

use temp_2, clear cap drop _*

contract ipcf drop if ipcf == . qui summ _f

ren _f _freq_out

gen edf_out = _f/r(sum) sort ipcf

gen cfreq = sum(_freq_out) gen ecdf_out = cfreq/r(sum) save temp_1, replace

use temp_2, clear

merge m:1 ipcf using temp_1, nogen gen sample=_N

/* *Pareto

paretofit ipcf, cdf("pareto_cdf") pdf("pareto_pdf") gen a= e(ba)

gen x0=e(x0)

gen sd_pareto =sqrt( (a)*x0^2 / ((abs(a-2)*(a-1)^2))) gen sse_cempirical_pareto = (ecdf_out-pareto_cdf)^2 gen sae_cempirical_pareto = abs(ecdf_out-pareto_cdf)

(48)

*Singh-Manddala

smfit ipcf, cdf("sm_cdf") pdf("sm_pdf") stats svy gen sd_sm=e(sd)

gen se_sm=sd_sm/(sqrt(sample))

gen sse_cempirical_sm = (ecdf_out- sm_cdf)^2 gen sae_cempirical_sm = abs(ecdf_out- sm_cdf) *Log Normal

lognfit ipcf,cdf("logn_cdf") pdf("logn_pdf") stats gen sd_logn=e(sd)

gen se_logn=sd_logn/sqrt(sample)

gen sse_cempirical_logn = (ecdf_out- logn_cdf )^2 gen sae_cempirical_logn = abs(ecdf_out- logn_cdf ) *Dagum Distribution

dagumfit ipcf, cdf("dagum_cdf") pdf("dagum_pdf") stats svy gen sd_dagum=e(sd)

gen se_dagum=sd_dagum/(sqrt(sample))

gen sse_cempirical_dagum = (ecdf_out-dagum_cdf)^2 gen sae_cempirical_dagum = abs(ecdf_out-dagum_cdf) * Generalized Beta (GB2) Distribution

gb2fit ipcf, cdf("gb2_cdf") pdf("gb2_pdf") stats svy gen sd_gb2=e(sd)

gen se_gb2=sd_gb2/(sqrt(sample))

gen sse_cempirical_GB2 = (ecdf_out-gb2_cdf)^2 gen sae_cempirical_GB2 = abs(ecdf_out-gb2_cdf) * Results

foreach x of varlist sse_c*{ qui summ ‘x’

dis "‘x’: ‘=r(sum)’" }

foreach x of varlist sae_c*{ qui summ ‘x’

dis "‘x’: ‘=r(sum)’" }

foreach x of varlist se_*{ dis "‘x’ :" ‘x’

} */

(49)

cap drop _*

keep if ipcf <2000

qui kdensity ipcf, at(ipcf) bw(4.9) generate(k_ipcf k_pdf_ipcf) save temp_32, replace

* Calculate the cdf of kernel use temp_32, clear

collapse (mean) k_pdf_ipcf, by(k_ipcf) sort k_ipcf

gen k_cdf_ipcf = sum(k_pdf_ipcf) keep k_ipcf k_cdf_ipcf

save temp_cdf, replace

use temp_32, clear

merge m:1 k_ipcf using temp_cdf

gen sse_cempirical_kernel = (ecdf_out-k_cdf_ipcf)^2 gen sae_cempirical_kernel = abs(ecdf_out-k_cdf_ipcf)

foreach x of varlist sse_cempirical_kernel sae_cempirical_kernel { qui summ ‘x’

dis "‘x’: ‘=r(sum)’" }

(50)

Bibliography

Abdelkrim, A. and Duclos, J. (2007). Dasp: Distributive analysis stata package. World Bank.

Acemoglu, D. and Robinson, J. A. (2012). Why Nations Fail: The Origins of Power, Prosperity and Poverty. Crown, New York, 1st edition.

Aggarwal, V. (1984). On optimum aggregation of income distribution data. Sankhy: The Indian Journal of Statistics, Series B (1960-2002), 46(3):343–355.

Atkinson, A. B. (1970). On the measurement of inequality. Journal of Economic Theory, 2(3):244 – 263.

Bandourian, R., Turley, R., and McDonald, J. (2002). A Comparison of Parametric Models of Income Distribution across Countries and over Time. LIS Working papers 305, LIS Cross-National Data Center in Luxembourg.

Basmann, R., Hayes, K., Slottje, D., and Johnson, J. (1990). A general functional form for approximating the lorenz curve. Journal of Econometrics, 43(1):77 – 90.

Boccanfuso, D., Richard, P., and Savard, L. (2013). Parametric and nonparametric in-come distribution estimators in CGE micro-simulation modeling. Economic Modelling, 35(C):892–899.

Bouezmarni, T. and Scaillet, O. (2005). Consistency of asymmetric kernel density estima-tors and smoothed histograms with application to income data. Econometric Theory, 21(02):390–412.

Bowman, A. W., Jones, M. C., and Gijbels, I. (1998). Testing monotonicity of regression. Journal of Computational and Graphical Statistics, 7(4):489–500.

Catalano, M., Leise, T., and Pfaff, T. (2009). Measuring resource inequality: The gini coefficient. Numeracy, 2.

Cowell, F. (2000). Measurement of inequality. Handbook of Income Distribution, 1:87–166.

Referenties

GERELATEERDE DOCUMENTEN

Door de gedigitaliseerde gegevensverzameling over het verslagjaar 2015 klopten de totalen in de kolommen automatisch met de subcategorieën. Een extra controle daarop, zoals in

For each country, I collect data about income inequality, export of goods and services, foreign direct investment net inflow, inflation, GDP per capita growth, labor force

In hoeverre bestaat er een verband tussen de gecommuniceerde identiteit en de gemedieerde legitimiteit van organisaties op social media en in hoeverre spelen het gebruik van

He argues: “To view animals the way Nussbaum does, to care for them in a corresponding way, and at the same time to retain the ability to eat them or experiment on them, requires

Not only ‘being live’ is what makes the program current but also the live feeds and references inside the program emphasize that the viewer has to watch both the television

The friction between the technological, masculine television set and the domestic, feminine living room disappears by this technology that is not explicitly technological..

De leerlingen uit de diverse landen waren echter zo enthousiast dat alle landen niet alleen de opdracht hebben gemaakt die ze moesten maken, maar ook alle andere

In the first example, F-HMUSIC was evaluated in white noise scenario where the amplitudes of complex exponentials were constant. The corresponding results of estimated RMSE versus