TORQUE(1) | General Commands Manual | TORQUE(1) |
torque
— Launches
one or more child processes, each of which performs a series of bandwidth
intensive operations, and after completion torque
reports the bandwidth actually achieved by each operation during a period
when all operation streams were executing simultaneously.
torque |
[Global Parameters] [[Local Parameters]
Action1] [[Local Parameters]
Action2] [[Local Parameters]
Action3] [...]
Global Parameters (affect all actions performed):
Local Parameters (only affect immediately following action):
Actions (Specific action to perform):
|
torque
exercises a computer system in ways
that mimic normal operation, while retaining as much simplicity as possible
to aid in debugging. This is achieved by launching one or more child
processes performing a series of operations (usually high bandwidth) and
after completion reporting the bandwidth achieved by each operation during a
period where all operation streams were executing simultaneously in their
steady-state behavior. Note that torque
only
measures its own processes and does not measure bandwidth of any other
executing process.
This tool is currently in use for measuring system bandwidth,
testing for interactions between different sub-systems, reproducing
problems, power analysis, thermal analysis, signal sensitivity analysis, and
more. torque
contains a number of simple tests that
are used together in any combination to exercise memory, video, and/or any
component that supports a file system. Each separate test is run in its own
process, and each action listed above specifies one or more tests to execute
simultaneously. Common processor memory access patterns such as bcopy,
bzero, load, and store are supported as well as different ways to access the
file system. All test runtime parameters can be entered using a
configuration file or as a command line parameter.
When executed, torque
reports the most
common system characteristics such as processor speed and memory size. In
addition many more configuration details are left behind in the file
sysctl.current. During an execution, the first two things displayed are
always the torque
version number and the command
line used for execution to make it as easy as possible to rerun the test and
reproduce results at a later time.
Since torque
uses a large number of
operational parameters, the command line parameters are broken into three
groups: global test parameters, local test parameters, and actions/tests.
All tests have default configurations and can be executed with just an
action parameter. For example, " torque
-V
7", the action parameter is "-V 7" (run video test 7)
requesting execution of a video system memory read test.
If a user wishes to change the tests for things like working-set size, transaction size, number of transactions, duration, etc.; all local test parameters on the command line before the action apply to the action. For the video read test a common sequence is
torque
-n 256 -i 1 -V 6
to make sure there are exactly 256 transactions (-n 256) issued once. The local parameters "-n 256 -i 1" only apply to the video read test and not to any other test. If the user wishes to execute a stream of memory reads while performing the video read test an example command line is:
torque
-p 1 -n 8 -i 20 -MB 4 -load -n 256
-V 6
In this case the load test has a working-set size of 4 x 8 = 32 MBytes (-n 8 -MB 4) that is executed twenty times (-i 20). The video test still uses the same parameters mentioned above but is executed ten times (default) since "-i 1" was not entered on the command line. Note that the "-i 20" does not apply to the video test in this example.
Global parameters can be entered anywhere on the command line and control functions that affect all of the tests. For example
torque
-n 256 -i 1 -V 6 -nh
removes the header output for each test, reducing the amount of text added to the screen after execution.
-aio
<n>torque
makes sure a single process does not go
over this limit, but it is up to the user to make sure multiple processes
do not exceed the system's maximum outstanding AIO limit.-bcopy
-bzero
-files
<n>Random access read/write test to large number of files. To adjust read vs. write and max file size set -percR and -mfileMB repectively. The files test represents the use of a web server. There are thousands of files being read and written in random increments as the web site is being accessed. The accesses are of fixed size (transfer size) but random file choice and file location.
-load
-ls
-memStreams
<label> <lookahead> <size> <bw>
<startTime> <runTime> <mfile>-r
-rand
-rmw
-rw
-Sadd
-scan
<fn>It is preferred to use this command when scanning using the raw file I/O interface (e.g. /dev/rdisk0). This action can also scan through the standard file interface if the file location supplied is not in /dev. This action is designed to use the raw file I/O interface to perform transactions with equal spacing across an entire disk drive/file. When testing across an entire hard drive, this test provides bandwidth information for the different tracks on the physical disk by using the "sample 0" modifier. This is useful since accesses to inner hard drive tracks tend to have one half the bandwidth of accesses to the outer tracks. Note that this command automatically wires down the memory buffers and the wiring requires root access (see -wire). The easiest way to provide root access is by using the sudo command.
-scanfile
<fn>Scan using standard file I/O interface to perform the same function as -scanDisk, but using the standard interface. The only real difference is the memory is not wired by default when using the -scanfile command. This command behaves identically to -scan if the -wire local parameter is set.
-Scopy
-spinWait
-Sscale
-store
-streams
<label> <lookahead> <size> <bw>
<startTime> <runTime> <mfile>Instead of trying to discover the maximum bandwidth capability
of a path to memory or I/O, the stream/HDTV test attempts to hold one or
more streams to a particular bandwidth. In addition, streams can be
dependent upon each other. The goal is to simulate how video streams are
used. For example, the rrw option represents a video stream composed of
combining two video feeds. This means that the two read streams are
consumed to produce the one write stream. The option rwr consists of a
read stream that is then written back with a second read stream reading
the written results. This may happen on a video feed that is being saved
and watched at the same time. All combinations of rw,wr,rwr, and rrw
include dependencies. If the goal is to create streams without
dependencies, then just specify multiple streams using r and w. If it is
not possible to meet the specified bandwidth, then
torque
reports the achieved bandwidth.
-Striad
-V
<n>n=1; Read Pixels (W): proc. reads VRAM, writes system memory
n=2; sync image copy (W): Video DMA to system memory
n=3; async image copy (W): Video DMA to system memory
n=4; test 3 with glFlush() (W): Video DMA to sys. memory
n=5; sync PBO copy (W): Video PBO DMA to system memory
n=6; async PBO copy (W): Video PBO DMA to system memory;
n=7; async image copy (R): DMA system memory to VRAM
n=8; sync image copy (R): DMA to VRAM (Added glFlush())
n=9; sync image copy (R): DMA to VRAM (Added glFinish())
n=10; async image copy (R): DMA system memory to VRAM
n=11; sync image copy (R): DMA to VRAM (Added glFlush())
n=12; sync image copy (R): DMA to VRAM (Added glFinish())
The first video test consists of the processor reading the VRAM, the next five video tests involve copying data from the VRAM to system memory, and the sixth through ninth tests copy data from system memory to VRAM. Tests 10 through 12 are identical to tests 6 through 9, but the texture is rotated between three different textures every transaction. This was necessary since the newest video cards are only performing the first transfer on tests 6 through 9 and reporting unbelievable bandwidths. The hope is that the card is now smart enough to buffer the texture, but this has not yet been proven. If tests 6 through 9 report unbelievable bandwidths, use the results from tests 10 through 12 instead. The default transfer size is 1.32 MBytes, but this can be changed using -IH and -IW.
-w
-aliasFile
filename-bo
-c
<configuration filename>torque
;
Torque.config is the default. A configuration file may contain any option
or argument that may be entered on the command line of
torque
This is very useful for commands that are
required for every execution of torque
such as -f.
Note that command line arguments are consumed before configuration file
commands.-child
n-example
-f
<test file>-g
torque
using getchar. No testing
starts until a key is pressed. This is useful for finding the PID and
attaching shark to the process before testing begins. See shark
documentation for details. A reference to the shark documentation can be
found in the SEE ALSO section.-h
torque
Global Usage/Help-ha
torque
Tests/Actions Usage/Help-help
torque
Usage and Example-hg
torque
Global Usage/Help-hl
torque
Local Usage/Help-nh
-printAlias
-quit
torque
as soon as the command line
parameters and configuration file are parsed. Sometimes it is useful to
see how the parameters are parsed before testing begins. This option
allows the checking of parameters without having to wait for results.-seed
<n>-slow
-smkey
<key>-sp
torque
as a single process. Normally
torque
forks off one process for every test and
leave behind the main process to gather results. This option is useful for
a single test where the main process should perform the testing. For
example it is much easier to use this option for debugging.-v
-version
-vl
<n>torque
supports
three levels of verbosity. Using -vl 1 is equivalent to -v while using -vl
2 or -vl 3 provides so much detail that in some cases stdout shows
information about every transaction. This level of verbosity tends to
reduce the effectiveness of tests, but provides for detailed debugging.
This should never be used by a typical user.-affinityNumber
<n>-affinityNumberDiff
<n>-affinityParent
<n>-B
<Bytes>-bp
<p>-bpKey
<p>-bp1
<p>-bp2
<p>-bt
<p>-btr
-cfiles
-checkpoint
-dir
<path>-display
-execChild
-extraReadB
<Bytes>-extraReadKB
<KBytes>-extraReadMB
<MBytes>-fbc
torque
run may be
dependent upon a previous execution. When enabling the fbc, make sure to
execute twice, once to warm up the fbc and once to get results.-GB
<GBytes>-i
<number of iterations>-IH
<h>-IW
<w>-KB
<KBytes>-linesize
<bytes of cache line>torque
automatically sets the cache line size for
the processor under test, but this parameter can override
torque
and replace the value with whatever the
user desires. This is really the same as setting -stride.-m
-MB
<MBytes>-mfile4KB
<4 KBytes Chunks>-mfileGB
<GBytes>-mfileKB
<KBytes>-mfileMB
<MBytes>-mfileXfer
<transfers>torque
does not check
that the file size is large enough to perform the test. In the case of
writes, accessing an offset greater than the file size just increases the
file size. In the case of reads an error is returned and displayed to
stdout for every transaction when the offset exceeds the file size.-n
<file transfers>-noResetParentAffinity
-noVideoInitLoops
-offsetB
<Bytes>-offsetKB
<KBytes>-offsetMB
<MBytes>-p
<process count>-percR
<p>-s
<start with offset into file list>-sa
<interval>-sample
<interval>torque
automatically reports an aggregate bandwidth
for each test stream. In addition each stream may take up to 1024
transaction samples allowing periodic capture of bandwidth between
intervals. Since the time at the beginning and end of the sample is also
given, bandwidth during intervals can be calculated, but
torque
currently only calculates the bandwidth
during each sample. This is especially important if a file is not
sequentially located on the hard drive as the outer edge of the hard disk
can be about twice as fast as the inner edge (-scan). It is also a good
way to detect bursty behavior due to a bad hard drive, choppy file
placement, or other system effects. A sample is time stamped after a
specified number of transactions. Please make sure and check the number of
transfers requested to make sure to keep the number of samples less than
1024 samples as that is the maximum that can be collected. For a test
performing 2048 transfers, the interval must be set to greater than one or
only the first 1/2 of the test is sampled. Sometimes it is useful to set
the interval higher say 255 (2048 transfers implies 8 samples) to reduce
the amount of data reported. Note that torque
automatically stops taking samples after 1024, so any extra are lost.
Remember these are samples; therefore the argument "-sample 3"
measures every fourth transaction. Warning, if the transactions being
measured are small and a lot of samples are taken, sampling can reduce the
accuracy of the average bandwidth measurements performed by
torque
-stride
<stride>torque
may not
perform as expected due to processor and cache line edge constraints. Note
the default stride is automatically set to a system's cache line size;
therefore the default is dependent upon what system the test is executed
on.-T
usec-touch
-touch4K
-vectorB
<stride>-vectorKB
<stride>-vectorMB
<stride>-wire
This section details how to use torque
and
understand the returned results. The goal of torque
is to exercise desired portions of the computer system exactly as specified
and report on the results. To provide maximum flexibility every operation is
detailed through user specified parameters. To prevent very long command
lines, every parameter has a default that may be overridden by the user. The
user parameters can be provided through the command line or through a
specification file. The default specification file provided with
torque
is called Torque.config and is read in
automatically if it exists unless overridden with the -c option.
Below is a simple two processor load test, that can be run using either of the two equivalent command lines shown below. The test output follows the two command lines. Note that this test was executed on a single processor system and the total bandwidth reports the same results (with a small deviation) when performing a one or two process test. With one processor a two process test has each test running individually and then context switching with the other test. Therefore you get the same bandwidth, but it takes twice as long to complete.
torque
-p 2 -n 8 -i 10 -MB 4 -load
or
torque
-p 1 -n 8 -i 10 -MB 4 -load -p 1 -n
8 -i 10 -MB 4 -load
torque, version: 2.0(1014)-17
torque -p 2 -n 8 -i 10 -MB 4 -load
Wed Aug 2 11:02:52 PDT 2006
hw.machine: Power Macintosh
hw.model: PowerBook3,4
Ethernet Address: 00:03:93:c6:73:12
1000 hw.cpufrequency (MHz)
133 hw.busfrequency (MHz)
32 hw.cachelinesize (Bytes)
32 hw.l1icachesize (KByte)
32 hw.l1dcachesize (KByte)
256 hw.l2cachesize (KByte)
1024 hw.memsize (MByte)
1 hw.physicalcpu
1 hw.logicalcpu
18 hw.cputype
11 hw.cpusubtype
torque (time in ms)
We waited for 2 processes
transaction size = 4194304 (4096K), (4M)
configuration file = Torque.config
number of transactions = 8
Largest File Size = 40 MBytes
Bytes Transferred = 320 MBytes/process
Number of processes = 2
Number of iterations = 10
-p 2 -n 8 -i 10 -MB 4 -load
proc,Start,Finish,Diff,Xfers,BW(MB/s),TS(KB),IO/sec,Test,File,PID
0, 1372, 2275, 902, 80, 354.8, 4096, 88, Load, NA, 707
1, 1436, 2326, 890, 80, 359.6, 4096, 89, Load, NA, 708
BW, , , , , , , , , , , Load , , , , , , , , , , , , , , , Total
BW:, , , , , , , , , , , 714, , , , , , , , , , , , , , , 714
714: Total Bandwidth Consumed (MBytes/sec)
There are three stages to torque
execution: setup, testing, and reporting. In the setup stage the system
configuration is reported, the test processes are created, memory both
shared and private is allocated, and then everything waits behind a barrier
semaphore until all processes are ready to begin testing. This ensures that
all tests start at the same time. The information printed during setup
consists of the torque
version number, the command
line used to execute torque
, the date, the machine
name, machine model, ethernet address, and relevant machine statistics. The
ethernet address is provided as a way to verify which individual machine the
test was executed on.
During testing there is no information printed to the terminal as
a printf/cout is very system intensive and may change the measured results.
This means that during testing there is no feedback to let the user know
that everything is progressing properly. When planning to perform long
tests, run shorter versions first to make sure the test is progressing
properly before starting a long test. Measuring Multiple Simultaneous Tests
on page 28 of the torque
documentation (a pointer to
the documentation is located in the SEE ALSO section below) details how
testing is performed to make sure that all tests are executing
simultaneously during the measurement interval. Of course if a user tries to
run more tests than the machine has resources to support, such as two memory
tests on a single processor as performed above,
torque
does nothing to prevent it.
The last step of execution is to report the results of testing. This is done in three sections: individual test information, table of bandwidths, and summary/totals. Each group of tests, one group for every action, details statistics on items like transfer size, number of transactions, etc. Appended at the end is the portion of the command line that was relevant to the group of tests described.
The table of bandwidths has a number of columns:
proc: Process/test number
Start: Start time of measurement in milliseconds
Finish: Finish time of measurement in milliseconds
Diff: Measurement duration in milliseconds (Finish - Start)
Xfers: Number of transfers during measurement interval
BW(MB/s): Measured bandwidth in MBytes/second
TS(KB): Transfer size of each transfer
IO/sec: IOmeter like reporting (best to ignore).
Test: Test Type. This may also include a number such as Display number for
the video tests or file accessed for the hard drive tests
File: File name if applicable
PID: Process Id for the testing process
Though this is a great way to catalog the results for one
execution, it can be very hard to combine into a table of multiple
executions. There are also caveats such as a bcopy performing 1 MByte/sec of
bcopy, but actually resulting in 2 MBytes/second of system bandwidth. A
comma separated list of individual system bandwidths for each test is
included to make it easy to combine multiple executions of
torque
in a single spreadsheet. Only the tests names
that are executed are included in the comma separated list to keep the list
from getting too long. Lastly torque
reports the
total system bandwidth consumed. An easy way to extract just the comma
separated bandwidths is to redirect multiple test outputs to a file and then
use grep to grab the bandwidth results line. You may have to add a
"-a" to grep since some commands like "date" sometimes
use output that makes grep think the output file is binary.
grep -a BW: filename
Please send your comments, suggestions and bug reports to: perftools-feedback@group.apple.com
/Developer/ADC Reference Library/documentation/CHUD and /Developer/ADC Reference Library/documentation/CHUD/TorqueUserGuide.pdf
February 21, 2008 |