Gary King Homepage Previous: Examples Up: Examples Next: Identifying Common Support in


Counterfactuals about U.N. Peacekeeping

This section illustrates the workings of WHATIF with the empirical example in Section $ 2.4$ of King & Zeng (2006), which evaluates counterfactuals about the causal impact of U.N. peacekeeping operations on peacebuilding success.

The factual data set has 124 observations (including two with missing values) on ten covariates as well as on the key causal variable, untype4, which is a dummy variable. The counterfactual data set is the observed covariate data set with untype4 replaced with $ 1 - $ untype4. We list-wise delete the two counterfactuals that are not fully observed. We then save the two data sets, one factual and the other counterfactual, as text files in our current working directory and name them `peacef.txt' and `peacecf.txt', respectively. The first five rows of `peacef.txt' look like:

  decade wartype   logcost wardur factnum factnumsq    trnsfcap untype4 treaty
1      5       1 14.917450     72       4        16    5.735545       0      0
2      4       0 15.671810    168       6        36    9.730863       0      0
3      5       1  6.907755     24       2         4   12.626030       0      0
4      5       1 12.971540     24       2         4 -112.000000       0      1
5      3       1  9.210340    216       2         4    4.275317       0      0
    develop       exp
1  132.8466 0.1217277
2  132.0000 0.1163292
3 1533.0000 0.0610000
4 2216.6080 0.1294513
5 1295.0000 0.1420000
Similarly, the first five rows of `peacecf.txt' look like:
  decade wartype   logcost wardur factnum factnumsq    trnsfcap 1-untype4
1      5       1 14.917450     72       4        16    5.735545         1
2      4       0 15.671810    168       6        36    9.730863         1
3      5       1  6.907755     24       2         4   12.626030         1
4      5       1 12.971540     24       2         4 -112.000000         1
5      3       1  9.210340    216       2         4    4.275317         1
  treaty   develop       exp
1      0  132.8466 0.1217277
2      0  132.0000 0.1163292
3      0 1533.0000 0.0610000
4      1 2216.6080 0.1294513
5      0 1295.0000 0.1420000

The function whatif can be called in two alternative ways to analyze these counterfactuals. First, typing:

  > my.result <- whatif(data = "peacef.txt", cfact = "peacecf.txt")
tells whatif to load the datasets `peacef.txt' and `peacecf.txt' from our working directory. Second, typing:
  > my.result <- whatif(data = peacef, cfact = peacecf)
tells whatif to use the R objects peacef and peacecf loaded into memory prior to the function call. These objects must be either non-character matrices or data frames containing the counterfactual and observed covariate data, respectively; in this case, they are data frames. Alternatively, peacef may be either a Zelig or other R model output object (e.g., a model output object returned by a call to glm).

The resulting output object my.result is a five-element list (six-element if the option ``return.distance=T'' is used), each element of which we now describe. The first is simply the call. The second is a logical vector named in.hull, which contains the results of the convex hull test. Each element can have a value of either FALSE, indicating that the corresponding counterfactual is not in the convex hull of the observed data and thus requires extrapolation, or TRUE, indicating the opposite. To see the values of in.hull, we type:

  > my.result$in.hull
For this example, the values are all FALSE.

The third element of the output list, geom.var, is the geometric variability of the observed data, which we retrieve by typing:

  > my.result$geom.var
In this case, it is $ 0.110$ when rounding to three significant digits. King and Zeng offer the geometric variability as a rule of thumb threshold: counterfactuals with distances to the observed covariate data less than this value are to some extent nearby the data. By default, pairwise Gower's distances ($ G^{2}$ ) between each counterfactual and data point are calculated by whatif in order to determine which counterfactuals are nearby the data; alternatively, whatif will calculate the pairwise (squared) Euclidian distance between each counterfactual and data point by setting the parameter distance equal to "euclidian" as follows:
  > my.result <- whatif(data = peacef, cfact = peacecf, distance = "euclidian")
However, this option is only appropriate for quantitative data; since some of our variables are qualitative, we use the default Gower's distance measure.

Note that the matrix containing these distances can be large in size and is not returned by default. To return the distance matrix, set the parameter return.distance to TRUE.

The fourth element of the output object, sum.stat, is a numeric vector, each element of which is the proportion of data points nearby the corresponding counterfactual. The values can be seen by typing:

  > my.result$sum.stat
The output looks like:
            1           2           3           4           5           6 
  0.008196721 0.008196721 0.008196721 0.008196721 0.008196721 0.008196721 
            7           8           9          10          11          12
  0.008196721 0.008196721 0.008196721 0.008196721 0.008196721 0.008196721 
  ...
          121         122
  0.008196721 0.016393443
The numerical summary reported on page 14 of King and Zeng (2006) is the average of sum.stat over all counterfactuals, which we can obtain using the command
  > mean(my.result$sum.stat)
In this case, the average is $ 1.3$ percent. This statistic is reported for your convenience by the function summary.

We note that by default, `nearby' is defined as having a distance to the counterfactual less than or equal to the geometric variability of the observed data. The default can be changed by setting a value for the parameter nearby. For example, to instead set the nearby criterion at two geometric variances, we would type:

  > my.result <- whatif(data = peacef, cfact = peacecf, nearby = 2)

The fifth element of the output object, cum.freq, stores information on the cumulative frequency distribution of the distances between a counterfactual and the observed covariate data. To access the cumulative frequency distribution for the default set of Gower distances (from 0 to $ 1$ in increments of $ 0.5$ ) between the first counterfactual and the data points, for example, we type:

  > my.result$cum.freq[1, ]
This prints the distribution to the screen:
            0        0.05         0.1        0.15         0.2        0.25         
  0.000000000 0.000000000 0.008196721 0.081967213 0.262295082 0.483606557          
          0.3        0.35         0.4        0.45         0.5        0.55
  0.680327869 0.844262295 0.950819672 0.991803279 0.991803279 1.000000000
          0.6        0.65         0.7        0.75         0.8        0.85         
  1.000000000 1.000000000 1.000000000 1.000000000 1.000000000 1.000000000
          0.9        0.95           1
  1.000000000 1.000000000
Alternatively, we can change the default set of Gower distances by using the parameter freq. For example, to calculate a cumulative frequency distribution solely for the Gower distances of 0 , $ 0.5$ , and $ 1.0$ , we type:
  > my.result <- whatif(data = peacef, cfact = peacecf, freq = c(0, 0.5, 1.0))
Now the cumulative frequency distribution for the first counterfactual looks as follows:
  > my.result$cum.freq[1, ]
          0       0.5         1 
  0.0000000 0.9918033 1.0000000

We now turn to the auxiliary functions included in the WHATIF package. The first is plot, which produces figures that graph the cumulative frequency distribution of the distances similar to Figure 3 in King and Zeng (2006). This function takes as its input a whatif output object. To plot the default cumulative frequency distributions for all counterfactuals to the screen, type:

  > plot(my.result)
Plotting 122 distributions on the same graph will not be very helpful, however. A particular frequency distribution or combination of frequency distributions can be plotted by setting the parameter numcf to equal the desired values. For example, to plot only the cumulative frequency distribution for the first counterfactual, we type:
  > plot(my.result, numcf = 1)
We also have the option of smoothing the raw cumulative frequencies, which can be plotted either on their own or in addition to the raw data. The parameter controlling this option is type. To plot both the raw and LOWESS smoothed cumulative frequency distributions for the first two counterfactuals, for example, we type:
  > plot(my.result, numcf = c(1, 2), type = "b")
where "b" stands for `both'. Alternatively, assigning the value "l" to type would plot only the smoothed frequencies. To save the graph as an encapsulated postscript file for later use instead of printing it to the screen, we set the parameter eps equal to TRUE:
  > plot(my.result, numcf = c(1, 2), type = "b", eps = TRUE)
The graph is saved to our working directory.

Not surprisingly, the function summary summarizes the most important information produced by the function whatif. The output object, a list, contains this information, which may also be printed to the screen. For example, typing:

  > summary(my.result)
displays the total number of counterfactuals evaluated; the number of counterfactuals that are in the convex hull of the observed covariate data; the percentage of data points nearby each counterfactual averaged over all counterfactuals; and a table that contains both the results of the convex hull test and the percentage of data points nearby the counterfactual for each counterfactual. Alternatively, typing:
  > my.result.sum <- summary(my.result)
saves the summary information as the object my.result.sum, which can be printed to the screen by typing either:
  > print(my.result.sum)
or:
  > my.result.sum
at the command prompt.

Finally, the package WHATIF includes two print functions. To print the output object returned by whatif to the screen, type either:

  > print(my.result)
or the name of the output object at the command prompt. Not printed by these calls are the matrices of distances and cumulative frequencies. These large objects can be printed by setting the parameters print.dist (if the distance matrix was returned) and print.freq equal to TRUE, respectively. For example, to print the entire output object except for the matrix of Gower distances to the screen, we type:
  > print(my.result, print.freq = TRUE)
The other print function controls the printing of the output object from the function summary.



Gary King 2010-08-12