Previous: Usage Up: Function undergrad() Next: Value

Inputs

control
Specifies a control file to load in, specifying filenames and binary classifications for the texts. The file should contain be three sep-separated (or whitespace-delimited, if sep is NULL as in the default case) columns, one headed ``filename'' providing a list of filenames, one headed ``truth'' providing the classifications for a subset and missing values (NA or ``.'' for the others), and a third headed ``trainingset'' and having a 1 for each element of the training set and a 0 for elements of the test set. When trainingset=1, truth should not be missing or it will be deleted. The function will compute the distribution of documents across categories for all documents with trainingset=0 (if truth is not missing for some these observations, it will not be used during estimation but will be used for printing and graphics on output to compare to the estimates). Defaults to ``control.txt''.

Alternatively, one can provide a data frame in the same three-column format. This will be written to readmetmpctrl.txt in the working directory during program operation.

stem
Should the Porter stemmer be used to stem the individual words? Please note: the Porter stemmer relies on case-insensitivity and will only function properly when ignore.case is set to TRUE. See details at http://www.tartarus.org/~martin/PorterStemmer/. Defaults to TRUE.

strip.tags
Indicates whether or not HTML/XML/SGML ``head,'' tags and JavaScript should be stripped from the input. Defaults to TRUE.

table.file
Path of file in which table of word frequencies should be stored. Defaults to ``tablefile.txt''. Of course, user must have read and write access to this file, and prior contents of file will be overwritten.

threshold
A floating-point number between 0 and 1. Only words occuring in more than threshold (and less than 1-threshold) times the number of texts will be included as features. To include all words, set threshold to 0. Default = 0.01, which includes all words occuring in more than 1% of texts.

pyexe
Path to use for Python interpreter. If NULL, ReadMe will first search the system path and then, if on Windows, default installation directories. If ReadMe is unable to locate your Python interpreter or you wish to use a different interpreter than that which lies on your system path set this variable. Defaults to NULL.

python3
Python versions 3.0 and greater require that different syntax be passed to Python, so set this to TRUE if you are using Python 3.0 or greater. Defaults to FALSE.

sep
String variable indicating column separators in control file. Defaults to NULL, in which case whitespace separates columns.

printit
Boolean variable indicating whether or not progress of text processing module should be output to screen. Defaults to TRUE.
fullfreq
Boolean variable indicating whether data returned should be frequency data giving the number of times a word occurs rather than the usual binary (word occurs or not). Defaults to FALSE, meaning binary data are returned.



Gary King 2011-07-12