If you have a multi-core desktop machine, access to a server with
multiple CPUs or perhaps even a supercomputer, this note may be of
interest to you in your Chandra analysis. If not, feel free to hit
DELETE now.
In a previous note and paper
http://asc.harvard.edu/chandra-users/0428.html
http://arxiv.org/abs/astro-ph/0510688
we described how PVM has been used on a network of workstations to
parallelize X-ray modeling and analysis in ISIS. While successful,
to date that approach requires a fair amount of coding on the part
of the end user; it also leaves unexplored the facts that more and
more desktop machines today contain multiple CPUs or cores & that
many data centers operate multi-CPU servers or Beowulf clusters,
while at the same time very few widely used astronomy software
packages exploit these extra processors for common analysis tasks.
We are therefore pleased to announce a new method by which users
may, with relative ease, exploit the emerging multicore reality
for SLang-based modeling and analysis in ISIS. This capability
comes from the -openmp switch in version 1.9.2 of the SLIRP code
generator, now available at
http://space.mit.edu/cxc/slirp
For example, consider
unix% slirp -openmp par.h
where par.h contains
double cos(double x);
double sin(double x);
double log(double x);
double atof(const char *s);
The generated module will be vectorized for parallelization by
OpenMP-aware compilers such as Intel 9.1, Sun Studio 9 or later,
and prerelease versions of GCC 4.2 or 4.3. Here's a sample run
of ISIS utilizing 4 of 8 750Mhz CPUs on a Solaris 5.9 machine:
tdc@cfa% export OMP_NUM_THREADS=4; isis
isis> x = [-PI*100000: PI*100000: .05]
isis> tic; c = cos(x); toc
6.43994
isis> import("par")
This import replaces the built-in cos(), sin(), etc. functions
with parallelized versions. Keeping the same names lets the body
of analysis scripts remain the same, regardless of whether they
execute in a sequential or parallel context. Repeating the test
from above
isis> tic; pc = cos(x); toc
1.62847
reveals that on these CPUs our parallelized cos() is nearly 4X faster
for the given array size, while yielding the same numerical result:
isis> length( where(c != pc) )
0
The benefits extend to more involved expressions, too, such as
isis> tic; () = sin(x) + cos(x); toc
14.0862
isis> import("par")
isis> tic; () = sin(x) + cos(x); toc
4.13241
Similar constructs are relatively common in analysis scripts, such
as ISIS models implemented in pure S-Lang, where the core of the
model may be computed scores, hundreds, or even thousands of times
(e.g. during confidence analysis). Speedups like those shown above
would accumulate into significant differences over the life of such
a long-running computation. A more detailed and topical example,
taken directly from experience at our institute, is appended below.
What makes this new capability exciting is not that it guarantees
some amazing factor of N speedup on N cpus, because it doesn't.
Speedup is highly dependent upon the structure and size of the
problem, as well as the speed of the CPUs utilized; optimal (i.e.
linear) speedups are not the norm.
Rather, the importance is in how little the end-user needs to do,
in terms of learning about threading or other forms of parallel
programming, or rewriting algorithms and scripts, in order to gain
at least *some* speedup. Plots of speedup as a function of array
size for several arithmetic operations, on the 2 machines used in
the examples here, are given in
http://space.mit.edu/cxc/slirp/multicore.pdf
They suggest relatively small inflection points -- where arrays are
big enough to gain at least some speedup from threading to multiple
CPUs -- and that faster processors tend to require larger arrays.
The presentation also discusses several limitations in the approach.
Although you can use the parallelization features of SLIRP right now,
we're in the process of developing a module much like that outlined
above. The aim is to make it possible for users to parallelize their
analysis for multicore use simply by adding something like
require("parallel")
to the beginning of their scripts. Please contact me if you think
you might benefit from this work, or have any thoughts, criticisms,
or even offers of help!
Regards,
Michael S. Noble
----------------------------------------------------------------------
Vector Parallelization Example #2
Consider a volume of 320*320*320 real-valued voxels representing
doppler velocity mappings of silicon II infrared emission observed
with the Spitzer IRS. The data was stored in ASCII format and was
given to me by a colleague so we could view it in the volview() 3D
visualizer (space.mit.edu/cxc/software/slang/modules/volview).
Since I/O on 130 Mb ASCII datasets can be cumbersome and slow, and
because we would undoubtedly be repeating the visualization task a
number of times, I first converted the volume to the high-performance
HDF5 binary format. This involved some 320^3 calls to the atof()
function, which converts string data to double.
Now, unlike the trig functions used above, the atof() function in
S-Lang is not vectorized; the only way it can be used to convert an
array of strings is by looping over each element, and normally the
best way to do that in S-Lang is with array_map(). For example,
with a faked & much smaller 3D volume (100^3 voxels)
linux% isis
isis> avol = array_map(String_Type, &sprintf, "%d", [1:100*100*100])
the time to convert it to Double_Type using array_map() is
isis> tic; dvol = array_map(Double_Type, &atof, avol); toc
13.7544
This was executed on my dual 1.8Ghz Athlon desktop, running Debian
3.1 with 2GB RAM. Importing our vector-parallel module as above
and repeating the test
isis> tic; pdvol = atof(avol); toc
0.144219
shows an astounding speedup of 95X, while again
isis> length(where(dvol != pdvol))
0
yielding the same result. The reason for the vastly superlinear speedup
is that, in addition to utilizing both CPUs on my desktop, the SLIRP
version of atof() is vectorized to operate directly upon arrays, at the
speed of compiled C. All OpenMP-enabled wrappers generated by SLIRP are
vectorized in this manner. Even without multiple CPUs the vectorized
atof() is considerably faster than the non-vectorized version in S-Lang.
Finally, suppose you wanted to log scale the SiII doppler velocities:
isis> si2 = h5_read("si2vel.h5")
isis> tic; () = log(si2); toc
3.82157
isis> import("par")
isis> tic; () = log(si2); toc
2.09266
This archive was generated by hypermail 2b29 : Thu Dec 13 2012 - 01:00:09 EST