Gary King Homepage Previous: User's Guide Up: User's Guide Next: Loading in the Data

Data Preparations

YOURCAST operates on time series cross-sectional data indexed by (1) a time period such as a year, (2) a grouped continuous variable such as an age group, and (3) a spatial or geographic variable such as geographic region or country. To fix ideas, we refer to these as time, age, and geography, respectively, but obviously they may change in other applications. (Either the age or geography indexes, but not both, may be dropped if desired.) We require a single dependent variable, such as mortality rates, to be the same and have the same meaning for all units see (, &)167;8.4 for an exeception#. Covariates may differ in number, meaning, and content across both age and geography.

Thus, YOURCAST analyzes a set of data sets, each defined for one cross-sectional unit indexed by age and/or geography. Inside the data set corresponding to each cross-sectional unit is a time series with measures on the dependent variable and the covariates observed in this cross section. An example would be an annual time series (say 1952-1996) with the dependent variable of mortality rates and several covariates, all within the cross-section of 15-20 year olds in Uganda. All cross-sections should have the same time indices (1952-1996), possibly with some different overlapping observation periods, the same dependent variable, and covariates that are the same, completely different, or overlapping from cross-section to cross-section.

All the individual cross-section data sets (each containing a time series) must be on in a single subdirectory on disk in fixed width text files (.txt), comma separated value files (.csv), or Stata data files (.dta). (Alternatively, they may be in memory, in your workspace.) Each file must be named with one string in three parts: an alphanumeric tag of the user's choice, a geography code of between zero (for no geography index) and four digits, and an age group code of between zero (for no age group index) to four digits. For example, if you have observations on cancer deaths for age group 45 (which might represent 45-50 year olds) for U.S. citizens (e.g., geography code 2450), you may decide to choose tag ``cancer''. We would add a file extension as well and so if the data are in a plain text file, we put these elements together and the file name would be cancer245045.txt.

So we can understand your coding scheme, include an extra file in the same directory called tag.index.code, where ``tag'' is the actual alphanumeric tag you chose (not the word t-a-g). For the example above, the filename would be cancer.index.code. The contents of the file should be 0-4 letters g followed by 0-4 letters a. In the example above, the entire contents of the file is: ggggaa.

Optionally, you may also add files that contain labels for each of the time, age, and geography codes. If these are included, they will make text and graphics output easier to interpret (and they may be useful documentation for you separate from yourcast). The files are tag.T.names for time periods, tag.A.names for age groups, and tag.G.names for geographic regions, where again ``tag'' is your chosen alphanumeric tag. The contents of each file should be ASCII text with all valid numerical codes in the first column and a corresponding label in the second column. Include column labels in the first row. So for geography, the second column might be country names and the columns would be labeled ``region'' and ``name''. If the codes are interpretable as is, such as is often the case for age groups and time periods, then you can omit the corresponding file.

Finally, if you wish to smooth over geographic regions, which the map and bayes methods allow, you must also include a file called tag.proximity.txt where ``tag'' is your chosen alphanumeric tag and ``.txt'' is used for text files but can also be used with comma separated files (.csv), or Stata data files (.dta). The larger the proximity score, the more proximate that pair of countries is in the prior; a zero element means the two geographic areas are unrelated, and the diagonal is ignored. Each row of the proximity file has three columns, consisting of geographic codes for two countries followed by a score indicating the proximity of the two geographic regions; please include column labels. For convenience, geographic regions that are unrelated (and would have zero entries in the symmetric matrix) may be omitted from proximity. In addition, proximity may include rows corresponding to geographic regions not included in the present analysis.



Gary King 2010-09-14