Details

Previous: Examples

Up: Zelig Commands

Next: Presenting Results

Details

z.out <- zelig(formula, model, data, by = NULL, ...)
The zelig() command estimates a selected statistical model given the specified data. You may name the output object (z.out above) anything you desire. You must include three required arguments, in the following order:
1. formula takes the form y ~ x1 + x2, where y is the dependent variable and x1 and x2 are the explanatory variables, and y, x1, and x2 are contained in the same dataset. The + symbol means ``inclusion'' not ``addition.'' You may include interaction terms in the form of x1*x2 without having to compute them in prior steps or include the main effects separately. For example, R treats the formula y ~ x1*x2 as y ~ x1 + x2 + x1*x2. To prevent R from automatically including the separate main effect terms, use the I() function, thus: y ~ I(x1 * x2).
2. model lets you choose which statistical model to run. You must put the name of the model in quotation marks, in the form model = "ls", for example. See Section for a list of currently supported models.
3. data specifies the data frame containing the variables called in the formula, in the form data = mydata. Alternatively, you may input multiply imputed datasets in the form data = mi(data1, data2, ...).^4.1 If you are working with matched data created using MatchIt, you may create a data frame within the zelig() statement by using data = match.data(...). In all cases, the data frame or MatchIt object must have been previously loaded into the working memory.
4. by (an optional argument which is by default NULL) allows you to choose a factor variable (see Section ) in the data frame as a subsetting variable. For each of the unique strata defined in the by variable, zelig() does a separate run of the specified model. The variable chosen should not be in the formula, because there will be no variance in the by variable in the subsets. If you have one data set for all 191 countries in the UN, for example, you may use the by option to run the same model 191 times, once on each country, all with a single zelig() statement. You may also use the by option to run models on MatchIt subclasses.
5. The output object, z.out, contains all of the options chosen, including the name of the data set. Because data sets may be large, Zelig does not store the full data set, but only the name of the dataset. Every time you use a Zelig function, it looks for the dataset with the appropriate name in working memory. (Thus, it is critical that you do not change the name of your data set, or perform any additional operations on your selected variables between calling zelig() and setx(), or between setx() and sim().)
6. If you would like to view the regression output at this intermediate step, type summary(z.out) to return the coefficients, standard errors, 9#9 -statistics and 10#10 -values. We recommend instead that you calculate quantities of interest; creating z.out is only the first of three steps in this task.
x.out <- setx(z.out, fn = list(numeric = mean, ordered = median, others = mode), data = NULL, cond = FALSE, ...)
The setx() command lets you choose values for the explanatory variables, with which sim() will simulate quantities of interest. There are two types of setx() procedures:
- You may perform the usual unconditional prediction (by default, cond = FALSE), by explicitly choosing the values of each explanatory variable yourself or letting setx() compute them, either from the data used to create z.out or from a new data set specified in the optional data argument. You may also compute predictions for all observed values of your explanatory variables using fn = NULL.
- Alternatively, for advanced uses, you may perform conditional prediction (cond = TRUE), which predicts certain quantities of interest by conditioning on the observed value of the dependent variable. In a simple linear regression model, this procedure is not particularly interesting, since the conditional prediction is merely the observed value of the dependent variable for that observation. However, conditional prediction is extremely useful for other models and methods, including the following:
  - In a matched sampling design, the sample average treatment effect for the treated can be estimated by computing the difference between the observed dependent variable for the treated group and their expected or predicted values of the dependent variable under no treatment (, ).
  - With censored data, conditional prediction will ensure that all predicted values are greater than the censored observed values (, ).
  - In ecological inference models, conditional prediction guarantees that the predicted values are on the tomography line and thus restricted to the known bounds (, ,).
  - The conditional prediction in many linear random effects (or Bayesian hierarchical) models is a weighted average of the unconditional prediction and the value of the dependent variable for that observation, with the weight being an estimable function of the accuracy of the unconditional prediction (see , ). When the unconditional prediction is highly certain, the weight on the value of the dependent variable for this observation is very small, hence reducing inefficiency; when the unconditional prediction is highly uncertain, the relative weight on the unconditional prediction is very small, hence reducing bias. Although the simple weighted average expression no longer holds in nonlinear models, the general logic still holds and the mean square error of the measurement is typically reduced (see , ).
  In these and other models, conditioning on the observed value of the dependent variable can vastly increase the accuracy of prediction and measurement.
The setx() arguments for unconditional prediction are as follows:
1. z.out, the zelig() output object, must be included first.
2. You can set particular explanatory variables to specified values. For example:
```
> z.out <- zelig(vote ~ age + race, model = "logit", data = turnout)
> x.out <- setx(z.out, age = 30)
```
  setx() sets the variables not explicitly listed to their mean if numeric, and their median if ordered factors, and their mode if unordered factors, logical values, or character strings. Alternatively, you may specify one explanatory variable as a range of values, creating one observation for every unique value in the range of values:^4.2
```
> x.out <- setx(z.out, age = 18:95)
```
  This creates 78 observations with with age set to 18 in the first observation, 19 in the second observation, up to 95 in the 78th observation. The other variables are set to their default values, but this may be changed by setting fn, as described next.
3. Optionally, fn is a list which lets you to choose a different function to apply to explanatory variables of class
  - numeric, which is mean by default,
  - ordered factor, which is median by default, and
  - other variables, which consist of logical variables, character string, and unordered factors, and are set to their mode by default.
  While any function may be applied to numeric variables, mean will default to median for ordered factors, and mode is the only available option for other types of variables. In the special case, fn = NULL, setx() returns all of the observations.
4. You cannot perform other math operations within the fn argument, but can use the output from one call of setx to create new values for the explanatory variables. For example, to set the explanatory variables to one standard deviation below their mean:
```
> X.sd <- setx(z.out, fn = list(numeric = sd))     
> X.mean <- setx(z.out, fn = list(numeric = mean)) 
> x.out <- X.mean - X.sd
```
5. Optionally, data identifies a new data frame (rather than the one used to create z.out) from which the setx() values are calculated. You can use this argument to set values of the explanatory variables for hold-out or out-of-sample fit tests.
6. The cond is always FALSE for unconditional prediction.
If you wish to calculate risk ratios or first differences, call setx() a second time to create an additional set of the values for the explanatory variables. For example, continuing from the example above, you may create an alternative set of explanatory variables values one standard deviation above their mean:
```
> x.alt <- X.mean + X.sd
```
The required arguments for conditional prediction are as follows:
1. z.out, the zelig() output object, must be included first.
2. fn, which equals NULL to indicate that all of the observations are selected. You may only perform conditional inference on actual observations, not the mean of observations or any other function applied to the observations. Thus, if fn is missing, but cond = TRUE, setx() coerces fn = NULL.
3. data, the data for conditional prediction.
4. cond, which equals TRUE for conditional prediction.
Additional arguments, such as any of the variable names, are ignored in conditional prediction since the actual values of that observation are used.
s.out <- sim(z.out, x = x.out, x1 = NULL, num = c(1000, 100), bootstrap = FALSE, bootfn = NULL, ...)
The sim() command simulates quantities of interest given the output objects from zelig() and setx(). This procedure uses only the assumptions of the statistical model. The sim() command performs either unconditional or conditional prediction depending on the options chosen in setx().
The arguments are as follows for unconditional prediction:
1. z.out, the model output from zelig().
2. x, the output from the setx() procedure performed on the model output.
3. Optionally, you may calculate first differences by specifying x1, an additional setx() object. For example, using the x.out and x.alt, you may generate first differences using:
```
> s.out <- sim(z.out, x = x.out, x1 = x.alt)
```
4. By default, the number of simulations, num, equals 1000 (or 100 simulations if bootstrap is selected), but this may be decreased to increase computational speed, or increased for additional precision.
5. Zelig simulates parameters from classical maximum likelihood models using asymptotic normal approximation to the log-likelihood. This is the same assumption as used for frequentist hypothesis testing (which is of course equivalent to the asymptotic approximation of a Bayesian posterior with improper uniform priors). See King, Tomz, and Wittenberg (2000). For Bayesian models, Zelig simulates quantities of interest from the posterior density, whenever possible. For robust Bayesian models, simulations are drawn from the identified class of Bayesian posteriors.
6. Alternatively, you may set bootstrap = TRUE to simulate parameters using bootstrapped data sets. If your dataset is large, bootstrap procedures will usually be more memory intensive and time-consuming than simulation using asymptotic normal approximation. The type of bootstrapping (including the sampling method) is determined by the optional argument bootfn, described below.
7. If bootstrap = TRUE is selected, sim() will bootstrap parameters using the default bootfn, which re-samples from the data frame with replacement to create a sampled data frame of the same number of observations, and then re-runs zelig() (inside sim()) to create one set of bootstrapped parameters. Alternatively, you may create a function outside the sim() procedure to handle different bootstrap procedures. Please consult help(boot) for more details.^4.3
For conditional prediction, sim() takes only two required arguments:
1. z.out, the model output from zelig().
2. x, the conditional output from setx().
3. Optionally, for duration models, cond.data, which is the data argument from setx(). For models for duration dependent variables (see Section ), sim() must impute the uncensored dependent variables before calculating the average treatment effect. Inputting the cond.data allows sim() to generate appropriate values.
Additional arguments are ignored or generate error messages.

Subsections

Presenting Results

Gary King 2011-11-29