Previous: Value Up: Function undergrad() Next: Function preprocess

Details

When a text file is used as the control file, it should be a comma-delineated table in which each line refers to a text to be included in the test set or training set. The first column, filename, is the path to the text (either absolute or relative paths will work). The second column, truth, defines the category of each text. The third column, trainingset, contains a binary value which indicates whether the text should be included training set.

If the trainingset value is 0 or is omitted, ReadMe will treat the text as part of the test set. Likewise, the truth value is typically omitted for texts in the test set. However, if a truth value is included for a test set text, it will not be used during estimation but will be used for printing and graphics on output to compare to ReadMe's estimate of the distribution.

Note that there is no numerical significance to the values used in the truth column to identify the categories; these values serve only as labels.

Consider the following example control file:

filename,truth,trainingset
/users/m/readme/example/file1.txt,1,1
/users/m/readme/example/file2.txt,2,1
/users/m/readme/example/file3.txt,2,1
/users/m/readme/example/file4.txt,3,1
/users/m/readme/example/file5.txt,,
/users/m/readme/example/file6.txt,,
/users/m/readme/example/file7.txt,,
/users/m/readme/example/file8.txt,,
/users/m/readme/example/file9.txt,,

ReadMe always disregards the first line of the control file, which can be used to label the columns. In this example, ReadMe will use the text documents in file1.txt through file4.txt as training texts, and will compute the distribution of across the categories 1,2,3 for the remaining texts.

When working with large control files, it may be useful to build and manage the control file in a spreadsheet program and then export the resulting file in the CSV comma-delineated format (both Microsoft Excel and OpenOffice.org Calc support this feature). Likewise, on systems which support a UNIX shell, the ls -1 command can be used to build a list of the texts in a given folder, which can then be copy-and-pasted into a spreadsheet or directly into a text control file.



Gary King 2011-07-12