lsm - Latent Semantic Mapping tool
lsm lsm_command [command_options] map_file
[input_files]
The Latent Semantic Mapping framework is a language independent,
Unicode based technology that builds maps and uses them to classify
texts into one of a number of categories.
lsm is a tool to create, manipulate, test, and dump Latent
Semantic Mapping maps. It is designed to provide access to a large subset of
the functionality of the Latent Semantic Mapping API, mainly for rapid
prototyping and diagnostic purposes, but possibly also for simple shell
script based applications of Latent Semantic Mapping.
lsm provides a variety of commands (lsm_command in
the Synopsis), each of which often has a wealth of options (see the Command
Options below). Command names may be abbreviated to unambiguous
prefixes.
- lsm create
map_file input_files
- Create a new LSM map from the specified input_files.
- lsm update
map_file input_files
- Add the specified input_files to an existing LSM map.
- lsm evaluate
map_file input_files
- Classify the specified input_files into the categories of the LSM
map.
- lsm cluster
[--k-means=N | --agglomerative=N] [--apply]
- Compute clusters for the map, and, if the --apply option is
specified, transform the map accordingly. Multiple levels of clustering
may be applied for faster performance on large maps, e.g.
lsm cluster --k-means=100 --each --agglomerative=100 --agglomerative=1000 my.map
first computes 100 clusters using (fast) k-means clustering,
computes 100 subclusters for each first stage cluster using
agglomerative clustering, and finally reduces those 10000 clusters to
1000 using agglomerative clustering.
- lsm dump
map_file [input_files]
- Without input_files, dumps all words in the map with their counts.
With input_files, dump, for each file, the words that appear in the
map, their counts in the map, and their relative frequencies in the input
file.
- lsm info
map_file
- Bypass the Latent Semantic Mapping framework to extract and print
information about the file and perform a number of consistency checks on
it. (NOT IMPLEMENTED YET)
This section describes the command_options that are
available for the lsm commands. Not all commands support all of these
options; each option is only supported for commands where it makes sense.
However, when a command has one of these options you can count on the same
meaning for the option as in other commands.
- --append-categories
- Directs the update command to put the data into new categories
appended after the existing ones, instead of adding the data to the
existing categories.
- --categories
count
- Directs the evaluate command to only list the top count
categories.
- --category-delimiter
delimiter
- Specify the delimiter to be used to between categories in the
input_files passed to the create and update
commands.
- group
- Categories are separated by a `;' argument.
- file
- Each input_file represents a separate category. This is the default
if the --category-delimiter option is not given.
- line
- Each line represents a separate category.
- string
- Categories are separated by the specified string.
- --clobber
- When creating a map, overwrite an existing file at the path, even if it's
not an LSM map. By default, create will only overwrite an existing
file if it's believed to be an LSM map, which guards against frequent
operator errors such as:
lsm create /usr/include/*.h
- --dimensions
dim
- Direct the create and update commands to use the given
number of dimensions for computing the map (Defaults to the number of
categories). This option is useful to manage the size and computational
overhead of maps with large number of categories.
- --discard-counts
- Direct the create and update commands to omit the raw word /
token counts when writing the map. This results in a map that is more
compact, but cannot be updated any further.
- --hash
- Direct the create and update commands to write the map in a
format that is not human readable with default file manipulation tools
like cat or hexdump. This is useful in applications such as
junk mail filtering, where input data may contain naughty words and where
the contents of the map may tip off spammers what words to avoid.
- --help
- List an overview of the options available for a command. Available for all
commands.
- --html
- Strip HTML codes from the input_files. Useful for mail and web
input. Available for the create, update, evaluate,
and dump commands.
- --junk-mail
- When parsing the input files, apply heuristics to counteract common
methods used by spammers to disguise incriminating words such as:
Zer0 1nt3rest l0ans Substituting letters with digits
W E A L T H Adding spaces between letters
m.o.r.t.g.a.g.e Adding punctuation between letters
Available for the create, update,
evaluate, and dump commands.
- --pairs
- If specified with the create command when building the map, store
counts for pairs of words as well as the words themselves. This can
increase accuracy for certain classes of problems, but will generate
unreasonably large maps unless the vocabulary is fairly limited.
- --stop-words
stop_word_file
- If specified with the create command, stop_word_file is
parsed and all words found are excluded from texts evaluated against the
map. This is useful for excluding frequent, semantically meaningless
words.
- --sweep-cutoff
threshold
- --sweep-frequency
days
- Available for the create and update commands. Every
specified number of days (by default 7), scan the map and remove
from it any entries that have been in the map for at least 2 previous
scans and whose total counts are smaller than threshold.
threshold defaults to 0, so by default the map is not scanned.
- --text-delimiter
delimiter
- Specify the delimiter to be used to between texts in the
input_files passed to the create, update,
evaluate, and dump commands.
- file
- Each input_file represents a separate text. This is the default if
the --text-delimiter option is not given.
- line
- Each line represents a separate text.
- string
- Texts are separated by the specified string.
- --triplets
- If specified with the create command when building the map, store
counts for triplets and pairs of words as well as the words themselves.
This can increase accuracy for certain classes of problems, but will
generate unreasonably large maps unless the vocabulary is fairly
limited.
- --weight
weight
- Scale counts of input words for the create and update
commands by the specified weight, which may be a positive or
negative floating point number.
- --words
- Directs the evaluate or cluster commands to apply to words,
instead of categories.
- --words=count
- Directs the evaluate command to list the top count words,
instead of categories.
- "lsm evaluate --html --junk-mail ~/Library/Mail/V2/MailData/LSMMap2
msg*.txt"
- Simulate the Mail.app junk mail filter by evaluating the specified
files (assumed to each hold the raw text of one mail message) against the
user's junk mail map.
- "lsm dump ~/Library/Mail/V2/MailData/LSMMap2"
- Dump the words accumulated in the junk mail map and their counts.
- "lsm create --category-delimiter=group c_vs_h *.c ';' *.h"
- Create an LSM map trained to distinguish C header files from C source
files.
- "lsm update --weight 2.0 --cat=group c_vs_h ';' ../xy/*.h"
- Add some additional header files with an increased weight to the
training.
- "lsm create --help"
- List the options available for the lsm create command.